Advanced Business Intelligence Techniques 4: Data, distances and similarities

why wanted to show you a couple things so as I kept our four page rank if also after the little code for between its course so have a faulty value just absolutely equivalent to do with the other code that you define a matrix entangle you maintain two vectors or half compartir and just work a new verbs of that you determine the hub been given the authority the authority virgin given just puts parent with these small mattresses next I also here is a folder the code you download it because it it’s rather large so probably it will exceed some disk quotas and way to case here this folder but let me show it graphically so you can see that here you have a patent code okay this called the will as a to the last time we did he open congrats website and i added everything that is needed to download not only pay okay the idea was start from this page connect download the representative pages that are listed in the voltage and Dioner have a look at the least oh at the list of sponsors and football fields in order to be enter and network between representatives this network is computed by this type of gold so essentially i hope you we can compute many things by looking at those pages the first one is given to representatives connect a interesa some field that is sponsored by t and co-sponsored by a if the bill has been proposed by B and a endorses eaten this is a formal endorsement from A to B so this is one possible analysis and by this analysis we get a directed graph that hopefully connects most presentable South Pole the second thing that you can do and that will be useful in the future it is

so by the way these directed are cooked capitals of have a number that counts the number of deals that are sponsored by D and co-sponsored by a of course so each of these connections has absolutely the second thing that you can realtor is in an undirected graph so glad we are connections don’t have a direction where every connection just says how many fields that had been signed by both representatives so if this connection is then it means that there are 10 plz we’re both a and B appear as other sponsor or co-sponsor of course two representatives that are politically very close would have many beers in common so you can expect that very high number here doesn’t have a person they are very far away from speaking then we can expect either no connection for VP a connection related to if you say wide deals that apply to other why the political area so jdh that is to mind that website looking for these two kinds of information they are both related to proximity but one is directed and can be used to compute to directly computer example teacher or its scores the other piece one is undirected and we will use it when speaking of clustering in order to gratify the groups of people that are close together this is just an example there are many kinds of data that are available for this kind of analysis let’s adjust it look to the code I’m trying to find I was trying to find the way to enlarge the characters anyway you should see it such as it is so the idea is that ok let me give the initial part you okay so first thing that we download this is web page where base URL anyway every time we want to download something we first check if it hasn’t already been cast so everything I download I also put a copy in a rock at five so that of course we are going to download data

concerning more than 500 400 people and thousands of deals so of course it takes quite a long time so if we have everything cashman it’s better because we can rewrite the program to different analysis and so on I’m using your he’ll need to download files in add a URL so this is the page page containing all the names of representatives we scan that file looking for some specific lines that give us the name and other data we can see okay so i’m looking for the issues like this so i would have reference to one person the name some data our team and its political party so i just turned looking for those lines and colette in putting them together on an array and what i’m doing here is okay i create this dictionary where what every representative as a coder and is associated with list of properties among these properties there are two arrays that will contain the bills that have been sponsored by the database that have been co-sponsored by mean which are found looking in fact in this page so we call that be basically the idea of bills that have been sponsored and followed by a DS of theater in co-sponsor it can be divided into more than one cage so for example get into one to look at all these awesome please so you have you collected a few pages to that thanks for that next up if for every representative we have a list of sponsors of this doesn’t have to compute the adjacency matrix as I told you before so with the linker for every bill that has been sponsored by one and co-sponsored at the other so a bit is Patrick’s references and to use it for pager computation you know I need to okay that’s both the matrix well actually first you see it must be divided by the ropes on every row must be divided by Rho sound some rows are 0 so the treatment without fact that we transpose it

somehow and okay this is this is the code for computing patron which is exactly as the gold in our focus is 56 that take the same so I just do a row by our computational matrix and the vector actually I maintaining to be vectors okay the first and a second so that i alternate between the two northern not you destroy the first when i compute the seven so at every step I take my input from the first one and put my output of the second passivation saying for the pitch scores in this case anything two different versions of the back of stage a okay so what’s happened is I have backed off that represent the data export its tender hearts core and offer this core of every presenting next the second thing that they wanted to do was find also this IP a directed graph so actually what they do is for every pair of Representatives I took you to the intersections of the list of the answer that they have and I the size of the intersection is exactly the variable so finally I out put everything on to a CSV file they saw every representative comes with a name is party the number of response of bills the number of co-sponsor feels he is a dangerous for which in a computed people 10 days is okay i multiplied the fan of a top score by 10 so page artists including score from soon pretend have an automatic value the image file display in and that actually we have also i also wanted to display this information actually it turns out that almost all representatives are connected to each other because time there are some very popular deals outside the guy at the bar so okay its measure of filtering some of the most popular fields so for example I chose this i chose to insert the two thresholds in the program so that okay I decided to connect only represent if

that have 10 or more deals in common and let’s see okay and I exclude all bills that have been signed by more than ten percent of the representatives so in this case I wanted to only consider let’s say local interactions on these that are not too better not signed by too many people so at the end let’s see ok don’t here at the end that I only list connections but in this case two representatives even they satisfy those two criteria and rather than writing the whole number for some reasons that we will see later I was interested in pudding here you see this number meaning the number of bits in common is a sort of similarity measure the larger is number the more similar to representatives political speaking actually I wanted to transform it into a measure of distance distance is somehow the opposite of similarity right so I want a large number if they should be far away and the small number if their distance is bold meaning that they’re similar dis I so if the distance is d it turns out that a reasonable knows if there something there number of common bills is s then the square root of 1 / test turned out to be a reasonable distance then okay so that if s is very large that we tend to zero in the applause if s is small but it will never be larger than smaller than 1 because I’m I’m not considering connect shows if the zero then the distance that one so I collect is done for example this means that this representative a cave in yoga is connected to representative number seven is listed and this business is a 3.46 okay so actually the formula I used for business is Oh let’s say that do not are not connected so we can consider distance internet if the number of common plz is less less than 10 or 1 divided as final test that inter s is greater ok so these messes go are between 0 & 1 ok now the problem is what can we do with

this is released well see well for example I chose this the comma separated values representation because it can be easily over with that for example with stretching so if you open open office you should be able to import a CSV of course it requires you to specify you data for example in our case and data are separated by commas that’s the if necessary is the rated by double box so if we import them currently we have so with this list though we can ready start new sample for example sorting people according to their Phaedra out of value poor may be more interesting final correlation between the vision of the canary and so on so my idea for today is rightfully other food tasters greater or something to create and try to come out in this collection of data if we if you want to play with the big screen so what about progress rather than cocaine rather than downloading the file or directory profit can be it can exceed your disk quota ok maybe you can just use my cash so you can find all the file and they should be readable at this dis folder so slash home slash multiple nachos yes that you can copy the Congress of Pi file your directory and then here there are the names of blue cash director is where the files are stored rather than using these names you graphics never with my

suit Article II use my dash problems okay let’s say this ethical should be agreeable okay so start having a little good and then i will show you one of the possible outcomes of this kind of this kind of work here is an example of something that we can that we will see during the course so the idea is that obtaining that the raw numbers is only part of the story next what you want to do usually is to show those numbers so that they can be actually understood by people and for example one result of this data mining operation could be something like this this graph below shows the representative every sport vote is for the sanctity and it the place places representative so that the motor oil distances the distances between each other are as close as possible to the D stuff that has been determined by this formula based on the number of deals that have been side together here okay so of course some pears are not connected in that case they are free to be at any distance the promise that we can enforce too many constraints on this system because we will see that in theory even it more or less arbitrary set of distances it’s impossible to find a layout that perfectly you know that that is exactly these distances between all pairs of all you have to allow for some else listening system so the best you can do is a an approximation to this and tries to keep those together people to have the fact main fields who God and if fire part people who have you here you see I chose to follow the political color the nodes based on the political party ceramic rats are bluer publicans of red and actually you see in this case that the layout actually corresponds to the their political position so there is nothing here the terms decision to put all the organs on one side all Republicans on the other the disposition these two groups are obtained just by looking at how many beers have been sang together Susie in this case you have clear clusters of nodes that roughly

correspond to the political party so if no one told you that in America in the u.s. three is basically a two-party system just by looking at that even if we didn’t call these notes you could already tell that okay the American system is more or less people i call it the rest this is not the the complete story because if you look further away you will find a lot of people who are far away from all the others actually these representatives are okay other people who are not acting so they don’t participate imaginary process or they have different tasks for example they can be head of some important subcommittee of the house so they don’t participate to the parliamentary discussions but they participate to some properties and so on of course have our only the Peter can also help you to see the faces of the single goals you can see here that there are some Democrats and Republicans sir that are more recent amigo and the note there are any favor of names here not very much about American politics or you can look different exercises for example you see here i doot another shaft that comes me filter out people for example i can decide to take her out people who don’t have long talks constantly to tease i choose to only represent people having 16 of most awesome gives these are more elastic cord of a of the parametric to date so they are the people who put out or no proposals or they don’t always correspond to the people heading the highest paid rent but thats page at me in discounters it means that these people having high PageRank not only so the base that big sponsor they are co-sponsored by many people and those people that control those beers also have sponsored deals that are co-sponsored by high-ranking book okay so these are the people who feels as covered by the high pipe on the average but people who have not very good at seeing a recursive sentences dragon understand department us to understand these will get the same about the harbor and that isn’t close one other interesting point the let’s see the play so

but it you can also try to find correlations for example not so easy to see it here for example you have different measures for people in this case just have pager the half and also left scores is there any relationships between the three we if they of them so if i take all the representatives in a boon for example the patron of the tape sir and the hub value of the y-axis these are the loop rectangles is there any relationship change the 24 are the two measures actually independent they show different things so suppose that you see when you look at the definition of page rank higher in authority they all have the same structure so they are older derived by adding values coming from incoming edges so so one doubt that I had when I first learned of these kinds of measures is are they really independent or do they say exactly the same thing so of course they are defined in different ways but if you block one against the other what can you expect for example if you know the HR score and the hub score we’re more or less please measure the same thing then you can expect them to be roughly proportional so actually the blue squares would align somehow in the described I mean a high PageRank corresponds to a high to a high top scorer we would expect these rectangles to be aligned the tracer correspondence between the two measures which actually does not happen okay from this year after you can understand that or less Harper values and patron values are

actually independent of each other sorry no page rank is on the x-axis and then the y-axis represents have with the Blue Square and authority with direct so just don’t look at the Red Square Circle mode just consider Harper is a graph that puts into correlation page rank which is on the x-axis and having okay you see that there are people who have very high page rank and a very low top scorer and vice versa of course most people let’s say the ordinary people have low scores motivation and how few representatives get enough popularity to have a yet higher okay actually it’s much easier to have a location with the high hop value then vice versa now actually this is a problem probably a problem of scale after you can try to put information the harbor and the authority values that’s just so if I block have an opera tonight is 91 is the other of course steel with a scrap of cloth I get this kind of distribution that tells me I’m just looking at the graph so of course you can also is grounders for example correlation petitions and so on but from here you see that there is a sort of relationship between half and authority ranges they are not completely rounded with respect to each other if someone has a high hop value his authority value will not be below this threshold for example but again they are not so strongly for later so that one score can replace the other most in this case we can assume that they say context of course this is that by looking at the graph should be able to repair paper route to use numbers to compute just because numbers you can you support this intuition but this is something that we will see later in the course never

but they did you say for example a comparison between Democrats and Republicans basedir maybe on a page and score for each sort representatives according to patron is there one of the groups that comes out to just run to the other which is the average there in order to use so they are all the simple statistics that can be computed directly with a spreadsheet