Completing Wikipedia's Hyperlink Structure
 through Dimensionality Reduction

This is the website for a research project on which I've been working as part of my Master's degree in Computer Science at McGill University in Montreal, Canada. Details can be found in the following paper:

Robert West, Doina Precup, and Joelle Pineau. Completing Wikipedia's Hyperlink Structure through Dimensionality Reduction. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM-09), pp. 1097–1106, Hong Kong, 2009. [PDF]

Abstract

Wikipedia is the largest monolithic repository of human knowledge. In addition to its sheer size, it represents a new encyclopedic paradigm by interconnecting articles through hyperlinks. However, since these links are created by human authors, links one would expect to see are often missing. The goal of this work is to detect such gaps automatically. In this paper, we propose a novel method for augmenting the structure of hyperlinked document collections such as Wikipedia. It does not require the extraction of any manually defined features from the article to be augmented. Instead, it is based on principal component analysis, a well-founded mathematical generalization technique, and predicts new links purely based on the statistical structure of the graph formed by the existing links. Our method does not rely on the textual content of articles; we are exploiting only hyperlinks. A user evaluation of our technique shows that it improves the quality of top link suggestions over the state of the art and that the best predicted links are significantly more valuable than the 'average' link already present in Wikipedia. Beyond link prediction, our algorithm can potentially be used to point out topics an article misses to cover and to cluster articles semantically.

Demo

To show the result of our link suggestion algorithm, we took a reduced version of Wikipedia that contains only the 5,500 most important articles (available at schools-wikipedia.org) and augmented it by adding the 17,000 highest-ranking links as suggested by our method.

That is, every article contains on average about three links that are not present in the original version but that have been proposed by our algorithm. To highlight such new links, they have been colored orange, while the original links stay blue.

You can browse the complete result starting here. The best way of getting an idea is by simply drifting from one article to the next, randomly clicking links. You might start with the article about beer, for instance.

It is important to note that, while this demo is based on a small Wikipedia version, our technique works just as well with the online, 'real' Wikipedia, as shown in the paper linked above.

Evaluation data

For reasons of transparency, I'm publishing the evaluation results obtained on Amazon Mechanical Turk. An interpretation of the results is included in the paper cited above.

Questions? Comments? Ideas?

Just send me an email: . I'm happy to discuss my research with you.

Last modified on December 01, 2009