Google’s PageRank (Sort of) Explained

Posted on December 21, 2007
Filed Under Algorithms, Computer Science, Mathematics |

In this post we’ll take a look at the algorithm which defines Google’s pagerank (PR) analysis. A webpage’s PR is a number between 0 and 10 that Google uses to estimate the usefulness of that page. For example, CNN.com has a PR of 9. A “typical” web site might have a PR of 5. Pages with spam or malicious content get a PR of 0, also known as the page rank of death, feared by all spammers worldwide.

So, how does Google go about determining the rank of a page? The idea is as follows. Suppose you were looking for a doctor to treat some disease you have. You would probably go about asking for recommendations from different people you know. The doctor who got the most recommendations would be the one, right? Well, almost. You would naturally weight each reference by the importance of the person who gave it. The recommendation of your brother, who is a doctor himself, will probably be given a more significant weight than, say, the recommendation given to you by your friend who is really just a farmer. The same idea governs the PR algorithm used by Google.

Let’s take a simple example which will make things clearer. Take a look at the “mini-Internet” I’ve drawn in this figure:

Miniature Internet Model: The Internet as a Graph

My mini-Internet has 3 sites, labeled A, B, and C. I’ve drawn them as a graph. The sites link to each other using regular hyperlinks, which I drew using arrows. An arrow between site A and B means that A has a link leading to site B.

The importance of a site, according to the Google PR algorithm, is obtained by considering each site that links to it. Each referring site’s contribution to this “importance score” is proportional to the number of total outgoing links of the referring site, times to the importance of the linking site. Thus, the more a certain site “recommends” other sites (i.e., links to other sites), the less important is each recommendation (each link). This is much like in real life: the words of a person who speaks little have more weight attached to them than the words of a person who speaks a lot.

The situation described above may seem cyclical (a site’s importance is determined by the importance of the sites linking to it), but it becomes quite straightforward when put into equations. Let’s try it. We denote by P_i the page rank of site i, where i=A,B,C. We also denote by N_i the number of outgoing links from site i, so N_A = 2, N_B = 2, N_C = 1. The page rank of each site is given by:

 P_i = \sum_j \frac{P_j}{N_j}

where the sum ranges over all sites j linking to site i. Explicitly, we have for our mini-Internet:

P_A = \frac{P_B}{N_B} \\ P_B = \frac{P_A}{N_A} + \frac{P_C}{N_C} \\ P_C = \frac{P_A}{N_A} + \frac{P_B}{N_B}

(note, for example, that site C does not appear in the first equation because C doesn’t link to A.)
Plugging numbers into the N’s:

P_A = \frac{P_B}{2} \\ P_B = \frac{P_A}{2} + \frac{P_C}{1} \\ P_C = \frac{P_A}{2} + \frac{P_B}{2}

Those of you who are keen on linear algebra might write this using matrix notation as follows:

\left( P_A, P_B, P_C \right) = \left( P_A, P_B, P_C \right) \left( \begin{array}{ccc} 0 & \frac{1}{2} & \frac{1}{2} \\ \frac{1}{2} & 0 & \frac{1}{2} \\ 0 & 1 & 0<tex> \end{array} \right) ” class=”simple” /></tex></p>
<p><img src=

“AHA!” says you, “an eigenvalue equation!” - and, indeed, what we have obtained here is a standard eigenvalue equation of the sort you meet in linear algebra every day. Basically what Google does in order to compute a page’s PR is solve this eigenvalue equation.

Reality Check

In reality, though, things are somewhat more complicated. First of all, the web consists of billions of pages - about 8 last time I Googled the number (no pun intended) - and solving an eigenvalue with an 8,000,000,000 x 8,000,000,000 matrix is quite insane. Google uses an iterative solution, which computes the pageranks in several attempts, improving precision with each iteration, until a certain threshold is reached. Furthermore storing and manipulating such large matrices is an issue all in itself, and one needs to simply glance at a typical Google server installation to get a drift of the computing power needed to get things done (about 450,000, according to a recent New York Times estimate. Yep, you read that number right!).

Improvements on Basic Notion

The basic PR algorithm outlined above is a much simplified version of the one Google uses. The actual algorithm is a trade secret, but here are some potential problems which require a modification of the above scheme:

There are of course many other things to consider - I am no Google expert, mind you - but I hope I have given you a small taste of a very concrete application of linear algebra to our modern world.

Comments

Leave a Reply