The Dao of Wikipedia: Extracting Knowledge from the Structure of Wikilinks

PhD candidate Cristian Consonni

24 October 2019

Location: Polo Ferrari 1 - Via Sommarive 5, Povo (TN) - Room Garda
Time: 09:00 am

PhD Candidate

Cristian Consonni

Abstract of Dissertation

Wikipedia is a multilingual encyclopedia written collaboratively by volunteers online, and it is now the largest, most visited encyclopedia in existence. Wikipedia has arisen through the self-organized collaboration of contributors, and since its launch in January 2001, its potential as a research resource has become apparent to scientists, its appeal lying in the fact that it strikes a middle ground between accurate, manually created, limited-coverage resources, and noisy knowledge mined from the web. For this reason, Wikipedia's content has been exploited for a variety of applications: to build knowledge bases, to study interactions between users on the Internet, and to investigate social and cultural issues such as gender bias in history, or the spreading of information.

Similarly to what happened for the Web at large, a structure has emerged from the collaborative creation of Wikipedia: its articles contain hundreds of millions of links. In Wikipedia parlance, these internal links are called wikilinks. These connections explain the topics being covered in articles and provide a way to navigate between different subjects, contextualizing the information, and making additional information available.

In this thesis, we argue that the information contained in the link structure of Wikipedia can be harnessed to gain useful insights by extracting it with dedicated algorithms. More prosaically, in this thesis, we explore the link structure of Wikipedia with new methods.

In the first part, we discuss in depth the characteristics of Wikipedia, and we describe the process and challenges we have faced to extract the network of links. Since Wikipedia is available in several language editions and its entire edition history is publicly available, we have extracted the wikilink network at various points in time, and we have performed data integration to improve its quality.

In the second part, we show that the wikilink network can be effectively used to find the most relevant pages related to an article provided by the user. We introduce a novel algorithm, called CycleRank, that takes advantage of the link structure of Wikipedia considering cyckes of links, thus giving weight to both incoming and outgoing connections, to produce a ranking of articles with respect to an article chosen by the user.

In the last part, we explore applications of CycleRank. First, we describe the Engineroom EU project, where we faced the challenge to find which were the most relevant Wikipedia pages connected to the Wikipedia article about the Internet.

Finally, we present another contribution using Wikipedia article accesses to estimate how the information about diseases propagates.

In conclusion, with this thesis, we wanted to show that browsing Wikipedia's wikilinks is not only fascinating and serendipitous, but it is an effective way to extract useful information that is latent in the user-generated encyclopedia.

Contact: ict.school [at] unitn.it (ICT International Doctoral School)