Random Walkers and Web Crawlers
Brief Overview: This was put together for a class assignment in Probability Theory 2. We were to discuss random walks and how they relate to the PageRank algorithm.
An overview of the relevant mathematics is given, and a small web crawler is implemented to collect links between wikipedia articles. The links are saved in a SQLite database and used later to construct a graph and rank articles based on centrality.
If you’re interested in the code and results, most of that is at the bottom of the article. We begin with a simple overview of the concepts of graphs, centrality, and random walks.
Page Ranking Overview At it’s core, Google’s PageRank algorithm is an application of fundamental mathematical techniques. It combines techniques from several disparate but related fields: the theory of graphs and random graphs, linear algebra, probability, statistics, and computer science.
The world wide web is an enormous network, the scale of which few human inventions can match. This fundamental fact about the internet plays an important role in understanding the PageRank algorithm. Each web page on the internet can be thought of as a node on a giant graph, where each edge is a link between two web pages: when one page links to another using a hyperlink, that can be thought of as another edge on the graph.
The PageRank algorithm assigns a rank to a webpage based on its “centrality”. In essence, the algorithm returns the likelihood of a user arriving at a given page were they simply making random jumps from one web page to another.
Random Graphs and Centrality
A graph is a pair of sets G = {V, E}, where V is a set of vertices (or nodes) and E is a set of ordered pairs Ei,j = {Vi, Vj} that represent edges connecting those vertices. A random graph is one in which the edges, the nodes, or both are determined probabilistically.
One of the simplest measures of centrality is that of degree centrality. The degree of a node is equal to the number of edges attached to it. On a directed graph, such as the kind we’re concerned with, the degree centrality can be measured in terms of indegree (the number of edges coming into the node) and outdegree (the number of edges coming out of the node). When looking at degree centrality, the nodes with the highest number of edges are the ones that are “most central”.
Another measure of centrality is closeness centrality. We are often concerned with paths on graphs. A path is a sequence {p1, l1, p2, l2, …,pk-1, lk-1, pk} of nodes and edges, where each edge connects the previous and the next node. In examining closeness centrality, we are only concerned with paths that do not have cycles or loops. These are equivalent to self-avoiding walks on a graph. To measure closeness centrality, we examine the average distance from a node to all other nodes on the graph. The nodes with the shortest average distance are the most central.
An important measure of centrality for our study of PageRank is eigenvector centrality. Eigenvector centrality utilizes the adjacency matrix of a given network (the n x n matrix with 1s indicating a connection between nodes i and j and 0s indicating no connection). The eigenvector centrality score of a given vertex i is the numerical score at index i of the eigenvector solution to the adjacency matrix.
There exist many other measures of centrality, such as betweenness centrality, harmonic centrality, percolation centrality, and so on.
PageRank centrality is a modification of Katz Centrality. In Katz centrality, the centrality of node i, C(i) is equal to:
$ C(i) = \sum_{k=1}^{\inf}\sum_{j=1}^{n}(\alpha^{k} A^{k}_{ij})$
Here, $\alpha$
is known as the damping factor. A is the adjacency matrix for the graph in question, with each power k denoting a matrix whose cells indicate a connection between nodes i and j through k linking nodes.
This has a convenient matrix formulation:
$C = ((I - \alpha A^{T})^{-1} - I)J_{n,1} $
Here, I is the identity matrix and J is a vector of 1s.
Further details are given in the original Katz paper “A New Status Index Derived From Sociometric Analysis”, which can be found here. Overviews of other centrality measures can be found here and here.
PageRank modifies the Katz Centrality measure by acknowledging that some nodes may link to other nodes prolifically and that this can adversely affect measures of centrality. PageRank remedies this by utilizing an iterative algorithm that initializes all nodes to a rank equal to 1/n, and then in each iteration having each node donate some measure of its own centrality to the nodes it links to in equal proportion. As a result, pages that link to many other pages will donate only small quantities of centrality to the nodes they link to.
Random Walks and Markov Chains
Any measure of centrality can be normalized (if it is not already). Doing so generates a probability distribution over the nodes in the network, which can be interpreted as the long term probability of a “random walker” arriving at that node.
A random walker is a form of Markov chain. A Markov chain is a stochastic process - or, series of random variables - ${X_{n}:n \in N}$
such that $P(X_{t} = j | X_{0} = i_{0}, X_{1} = i_{1}, ..., X_{t-1} = i_{t-1}) = P(X_{t} = j | X_{t-1} = i_{t-1})$
.
That is, a Markov chain is a system whose state is dependent only on the immediately preceding state, ignorant of the past of the system. This memoryless property is also known as the Markov property.
A random web surfer can be thought of as a Markov chain whose states are webpages and whose transitions are hyperlinks away from the present webpage, where each transition is equally probable. The damping factor represents a probability of the process halting at any state.
A result of this is that the PageRank of a webpage determined by this algorithm is convergently equivalent to the probability of arriving at that page as the number of page transitions approaches infinity. This is a result of the Fundamental Theorem of Markov Chains:
(Fundamental Theorem of Markov Chains) For a connected Markov chain (one in which every vertex or state is reachable from every other vertex or state) there exists a unique probability vector $\pi$
satisfying $\pi P = \pi$
, where P is the transition probability matrix for the Markov chain. Further, for any starting distribution, $lim_{t -> inf}a^{(t)} = \pi$
The unique probability vector $\pi$
is known as the stationary distribution for the random walk, and is the set of values that the PageRank algorithm will converge to over time assuming no changes in the underlying set of pages.
The Wikipedia Webcrawler:
This is a very simple web crawler. I’m not super familiar with web crawling, but there’s several good posts that I referenced while working, including this one. I used the HTMLAgilityPack library to do much of the heavy lifting - it provides methods for loading web pages and selecting content by ID, class, et cetera.
Note that the crawler doesn’t limit how quickly it accesses Wikipedia. On my end, it runs relatively slowly due to all of the link parsing and database accessing. If you’re doing quick work, you might want to throttle your web crawler.
Code for the crawler is below:
|
|
Page Ranking:
Using the data collected from running the web crawler for a short while, we’re now gonna do some page ranking. When I ran it, I collected 67,672 records: articles linking to other articles. This was from crawling a total of 348 articles. I parsed the records down to just the links between pages that were crawled, resulting in 19,577 records total. This parsing was done with the following SQL:
|
|
With python, we can get the degree centrality of each node using the NetworkX library and write it to a file. This can give us a quick, rough approximation for the rankings of our webpages.
|
|
Below, we implement the simplified PageRank algorithm ourselves in C#: In progress
To Do Still:
- Implement simplified PageRank iterative algorithm in C#
- Implement PageRank using MCMC (see: this paper )