PageRank
PageRank™ is an algorithm used by the Google search engine to rank websites in search results. PageRank measures the importance of a page by counting the quantity and quality of links pointing to it. It is not the only algorithm used by Google to rank web pages, but it is the first one used by the company and the best known. Its properties are widely discussed by search engine optimization (SEO) experts. The PageRank process was patented by Stanford University in the United States under number 6,285,999. Only the name PageRank is a registered trademark of Google. Google holds exclusive licensing rights to the PageRank patent. Stanford University received 1.8 million shares of Google stock in exchange for the use of the patent. The shares were sold in 2005 for $336 million. ==Description== In constructing the PageRank metric, the web is viewed as a network of citations; each node corresponds to a page, and each link corresponds to a reference from one page to another (hyperlink). The metric assigns a value to each node (page) in the network; a higher value corresponds to a more important node in the network. From the perspective of network theory, PageRank is a centrality metric. This metric leverages the structure of hyperlinks on the web to produce a value for each page in the network. A hyperlink to a page counts as a supporting "vote." The PageRank value of a page depends on the number of pages and the PageRank metric of those pages that point to it. A page has a higher PageRank value if: * many pages point to it * some pages point to it with a high PageRank metric (a page is important if important pages point to it) [[Image:PageRanks-Example.svg|right|thumb|400px|PageRank metric for nodes in a simple network, expressed as percentages. (Google uses a logarithmic scale). Node C has a higher PageRank value than node E, even though there are few links to C, the link to C comes from an important node and therefore has a high value. If a user starts at a random node with an 85% probability of choosing a random link from the node they are currently visiting, and a 15% probability of jumping to a randomly chosen node from the entire network, that user will reach node E 8.1% of the time. (A 15% probability of jumping to an arbitrary node corresponds to a damping factor of 85%). Without damping, any user would end up at nodes A, B, or C, and all others would have a PageRank value of zero. By using the damping factor, node A is connected to all nodes in the network, even if it has no connections to other nodes.]] == Google and PageRank == The PageRank system is used by the search engine [[Google]] to help determine the relevance or importance of a page. It was developed by Google founders [[Larry Page]] and [[Sergey Brin]] while they were students at [[Stanford University]] in [[1998]]. [[Google]] maintains a list of billions of pages in order of importance, that is, each page has its importance on the Web as a whole; this Page Database maintains everything from the most important page in the world to the least important. This importance is determined by the number of votes a page receives. A vote is a "[[link]]" anywhere on the web for that page. Votes for more important pages are worth more than votes for less important pages. This page ranking criterion, according to many people, is quite democratic, reflecting what the "web thinks" about a given term. Remember that about ten billion pages are taken into account. The [[quality]] of the most important pages is naturally guaranteed, ranked, and elected by the web itself. Furthermore, all pages have the same chance of rising in this ranking, gaining votes across the web. A good unit of measurement for defining a page's PageRank can be the percentage (%) of pages where it is most important. For example, if a page has a PageRank of 33%, it means it is more important than a third of the entire web. If its PageRank is 99%, it means it is superior to almost all web pages. However, it is possible to manipulate PageRank by assigning "links" that are out of context with the page's purpose, modifying the ordering of results in Google searches and inducing irrelevant or biased results. A recent example of this is the search for [[Google:failure|failure]] or [[Google:miserable+failure|miserable failure]] which returned as the first site the official biography of the [[White House]] for the President of the [[United States]], [[George W. Bush]] and then the page of [[Michael Moore]], a declared enemy of the US president. This process became known as "[[Google bomb|Googlebombing]]". Despite this, Google has removed some results resulting from "Googlebombing". == History == PageRank was developed at Stanford University by Larry Page (hence the name PageRank) and Sergey Brin in 1996, as part of a research project on a new type of search engine. Sergey Brin had the idea that information on the web could be ordered in a hierarchy of "link popularity": A page is more important if it has more hyperlinks pointing to it. It was co-authored by Rajeev Motwani and Winograd Terry. The first paper on the project, describing the PageRank metric and the initial prototype of the Google search engine, was published in 1998.Soon after, Page and Brin founded Google Inc., the company behind the Google search engine. The PageRank metric was inspired by citation analysis, developed by Eugene Garfield in 1950 at the University of Pennsylvania, and by the "Hyper Search" method, developed by Massimo Marchiori of the University of Padua. In the same year that PageRank was introduced (1998), Jon Kleinberg published his work on HITS. The founders of Google cited Marchiori and Kleinberg in their original article. A search engine called “RankDex” from IDD Information Services, designed by Robin Li, has been exploring a similar strategy for scoring and ranking pages since 1996. The technology used in RankDex was patented in 1999 and later used when Li founded Baidu in China. Li's work is referenced in some patents, including Google's search methods, and Larry Page's. The algorithm: The PageRank metric of a page represents the probability of a person reaching that page by randomly clicking on hyperlinks. The PageRank calculation is scalable, meaning it can be executed in a timely manner if the number of pages on the network increases considerably. The PageRank calculation is iterative, meaning it requires several passes, called "iterations," where the values obtained in each iteration converge to the desired PageRank values. In the first iteration, an initial PageRank value is assigned equally to all pages (N is the total number of pages). === The Simplified Algorithm === Imagine a network of only 4 pages: "A", "B", "C", and "D". Links from a page to itself and multiple links between two pages are ignored. Initially, the sum of the PageRank values of all web pages corresponded to the number of pages on the web. In later versions, PageRank began to assume values between 0 and 1, representing a probability distribution, that is, the probability of a user, browsing links randomly, reaching a specific page. In the first step of the iterative PageRank calculation process, all pages have the same PageRank value. In our example of 4 pages, the first step consists of assigning a PageRank value of 0.25 to each of the four pages. Note that the sum of the PageRank values of all pages is 1. [[File:Simple 4 nodes graph 3 nodes link to one.jpg|thumb|center|alt=Fig. 1- All pages have only one reference to page A.|Fig. 1- All pages have only one reference to page A.]] In a network with the configuration shown in the figure above, in the second iteration, each link transfers the value 0.25 to the PageRank of A, that is: : [[File:Pagerank.pt.fig2.jpg|thumb|center|alt=Fig. 2- Pages that reference more than one page|Fig. 2- Pages that reference more than one page.]] In the case of the network above, in the second iteration, the value of "'B'" is transferred half to "'A'" (0.125) and the other half to "'C'" (0.125). As "'D'" references 3 pages, its value to be transferred is divided by three, in this case the PageRank of "'A'" receives the following values. Therefore, the contribution of a link to the PageRank of the referenced page is equal to the PageRank value of the page with the link, divided by the number of links the page contains. If we represent the number of links of a page by L(), we can rewrite the expression above for our network of 4 pages: Generalizing, the PageRank value for a page u can be expressed as follows: The PageRank value of a page "u" depends on the PageRank values of each page "v" contained in the set "B".uThe set of all pages that reference "u" divided by the number of references "L" ("v") existing in "v". === Pages without links === The iterative process of calculating PageRank encounters problems when a page has no links to other pages. [[File:Pagerank.pt.fig3.jpg|thumb|center|alt=Fig. 3- Page without links|Fig. 3- Page without links.]] If the calculation is applied to the network in the previous figure, we end up obtaining a value of zero for both pages "A" and "B". In each iteration, "B" receives some of the PageRank of "A" (in this particular case "B" receives all of the PageRank of "A", but in a more complex network where "A" had links to other pages, "B" would receive only a part of the PageRank), but since "B" does not It has links, but it doesn't pass its value to other pages, in this case "'A'". This produces a PageRank draining effect out of the network. === Rank Sink === Another problem encountered in PageRank calculation occurs when a network contains a rank sink. [[File:Pagerank.pt.fig4.jpg|thumb|center|alt=Fig. 4- Example of a rank sink|Fig. 4- Example of a rank sink.]] Consider a closed loop of interconnected pages, but none of the pages connect to a page outside the loop. In such a scenario, the PageRank calculation gets "stuck" in the infinite loop; in each iteration, the PageRank value is transmitted from one page to another in the loop, without ever distributing the value to pages outside the loop. ...and without the values converging to stationary PageRank values. === Damping Factor === The problems described above are solved by a concept introduced by PageRank and called the damping factor. PageRank theory considers that an imaginary user (or surfer) who randomly follows the links between pages will eventually get bored and stop following the links. The probability, at each step, that the user will continue to follow the links is the damping factor "d". The damping factor, being a probability, can vary between 0 and 1. Therefore, the value of "'PR(A)'" has a component corresponding to the contribution of pages pointing to "'A'", weighted by the probability "'d'" of the user following the page links: : and a component corresponding to the user having selected the page randomly, weighted by the probability of the user not following the page links (1-d): With the introduction of the damping factor "'d'", the calculation of the PageRank value has the following expression, representing the total number of pages: : There are other variations for calculating PageRank, but the expression above has the particularity that the sum of the PageRank values of all pages is 1. This gives a probability distribution, that is, the probability of a user reaching page A. The damping factor introduces the following characteristics to the PageRank calculation: * A page, simply by existing, has a probability equal to all the others. Other options include being selected by a user's random choice. * A page without links is connected to all pages on the network. * Solving problems with pages without links and loops (rank sink). The damping factor "d" can take values between zero and 1, as already indicated. With "d" = 1, we move to the simplified form of the algorithm; with "d" = 0, no weight is assigned to the hyperlink structure between pages on the network, all pages have a PageRank value equal to , where is the number of pages on the network. Therefore, the closer "d" is to 1, the greater the weight given to the network structure. A value of 0.85 is normally assigned to the damping factor. === Matrix Representation === Assuming the network consists of pages "'P1'", "'P2'", ..., "'Pn'", "'M(Pi)'" represents the set of pages that reference "'Pi'", "'L(Pj)'" represents the number of references to page "'Pj'". The expression for calculating the PageRank value can be rewritten as follows: : (*) The vector "'R'" containing the PageRank value for all pages can be represented as follows: Constructing a transition matrix "'M'" of nxn, where n is the total number of pages, an element "'Mij'" (row "'i'" and column "'j'") is given by the function: :, if there is no reference to page "'pj"' to the page "'pi”' :, if there is a reference to the page ”'pj"' to the page "'pi”', : is the number of references existing in ”'pj”' (degree of output or number of links that come out of ”'pjNote that the function is normalized, that is: The expression for calculating PageRank for all pages can be written in the following matrix form: If we represent by 1 the vector of ones, with the value 1 in all elements, with n rows and one column, we have: Since the sum of all the elements of the vector "'R'" is 1 (that is, "'PR(P1)+PR(P2)+...PR(Pn)" = 1), then the product of "'R'" by the matrix "'E' nxn", with the value 1 in all its elements, is equal to the vector "'1'". Therefore, we can rewrite the expression for calculating R as follows: Factoring out "'R'", we obtain: That is, "'R'" is the eigenvector of the modified adjacency matrix, for the eigenvalue 1, where: Therefore, the PageRank metric can be seen as a variant of the eigenvector centrality metric. === Iterative Calculation of the PageRank Metric === Let's designate the initial PageRank values as "x(0)" and the calculated PageRank values in iteration "t" as "x(t)". In the first iteration "t=0", the value is assigned to all pages, that is, each element of the vector "x(0)" has the value: : In each iteration "t+1", we calculate the value of "x(t+1)" by multiplying the matrix by the vector "x(t)" (PageRank values calculated in the previous iteration): : The matrix has the following properties: * irreducible * primitive * stochastic Based on these properties of , it is proven that "x(t)" converges to the eigenvector "R". The iterative calculation ends when the variation of x(t) to x(t+1) is less than a certain predefined value. This form of calculation is scalable; in a network with 322 million connections, convergence is observed, with reasonable tolerance, in approximately 52 iterations. The convergence speed, in this calculation method, depends on the damping value "d". == Determining the PageRank == To check the PageRank of a given page, there are two options: * Install the [[Google Toolbar]]



Post comment