seedubjay

Beating the Wikipedia Game

3 January 2021

Here’s a game to pass the time while you’re trapped at home 10 months into a pandemic.

Pick two random pages on Wikipedia – your goal is to get from one to the other in as few clicks as possible. (Using the box full of links at the bottom of the page is cheating…)

For example, you can get from Kangaroo to Dave Brubeck (a jazz pianist) in just 3 clicks: KangarooMatchSpringfield, MassachusettsDave Brubeck.

There’s an online game too for those with a competitive streak.

There are over 6.2 million English Wikipedia articles, but this ecosystem is remarkably well connected. In fact, when you randomly pick two pages, there’s an 82.1% chance that you’ll be able to get from one to the other in 5 clicks or less, assuming that it’s possible in the first place.

Dont believe me? Pick any two pages here and see the shortest sequence of clicks you’d need to get between them.

Pages which are ‘obscure’ (i.e. contain less than 20 links or are linked to by less than 20 other pages) have been excluded here for efficiency’s sake.

Breaking down the problem

The sheer volume of Wikipedia is remarkable, but also makes it challenging to analyse since it is so unwieldy. This is especially true for a task like finding a shortest sequence of clicks, as you may have to consider the entirety of Wikipedia all in one go to guarantee that you’ve found the shortest of all possible sequences.

To overcome this, we have to break the problem down.

We don’t care about every word on every Wikipedia page. Instead, we can just think about all of Wikipedia’s links from one page to the next.

The term for this in mathematics is a directed graph. In fact, any collection of ‘things’ and connections between them can be thought of as a directed graph - cross-country road networks, family trees, the electricity grid, etc.

For example, we can visualise a small group of Wikipedia pages and the links from each to the next like this:

Drag to move pages

Doing this transformation is useful for two reasons.

First, picking the right way to represent data gives us a new way to think about the problem. In other words, thinking about lots of dots and arrows is easier than thinking about pages upon pages of Wikipedia text.

Second, converting our obscure problem into a common mathematical format gives us access to a wealth of research and algorithms to help us solve the problem.

Researchers have spent decades developing algorithms and writing code to analyse directed graphs. They may have been designed for a completely different purpose, but they work just as well for navigating Wikipedia pages as they do for finding ancestors in a family tree, optimising trucking routes between cities, and lots of other unrelated problems, since they are all directed graphs under the hood.

We can use this existing knowledge to find our shortest paths and analyse our data in a whole new set of ways.

More tidbits

Not only is there a 82% chance that you’ll be able to get from one page to another in at most 5 clicks, there’s a 99.3% chance you’ll be able to do it in 7 clicks, assuming that it’s possible in the first place.

The final 0.7% of paths gets pretty dicey though – in the very worst case, 48 clicks are required. There are many starting pages that can achieve this, but only one ending page… Yes, as you’ve probably already guessed, it’s the 1948–49 FC Dinamo București season.

Why? Here’s an excerpt of the 48 glorious clicks to get there when starting at Nong Khai railway station, a railway station on the border between Thailand and Laos:

Nong Khai railway stationFirst Thai–Lao Friendship BridgeAsian Highway NetworkBulgariaPFC Levski SofiaList of unbeaten football club seasons1991–92 FC Dinamo București season1990–91 FC Dinamo București season → … → 1952 FC Dinamo București season1951 FC Dinamo București season1950 FC Dinamo București season1948–49 FC Dinamo București season

Spot the pattern?

Ironically, the journey back is completely average, taking just 5 clicks:

1948–49 FC Dinamo București seasonRudolf WetzerGreeceMetre-gauge railwayRail transport in ThailandNong Khai railway station

Another tool we can use to measure this directed graph is the betweenness centrality of each page. This value represents the likelihood that a page gets used in a path between two other randomly chosen pages in the graph. The higher a page’s betweenness centrality, the more central it is in the network, and the more pivotal it is to Wikipedia’s tight connectivity.

Curiously, the shortest path between pages very often involves a country. (You’ll probably notice this if you try enough pages in the search panel above.) In fact, the page for the United States is so common that some variations of the Wikipedia Game ban the page entirely.

So, if a friend ever challenges you to the Wikipedia Game, you’ll have the upper hand! All you need to do is:

  1. download about 20GB of Wikipedia data,
  2. scan through ~20 million pages to find links,
  3. filter out all the redirect pages and other dud pages,
  4. transform ~220 million links into a graph format that is actually usable, and
  5. use any ol’ path-finding algorithm to find a route to get you from one page to the other

In my case processing this takes about 6 hours… So you might need to stall for a while.

(Or, you could just secretly check this page and use the search panel instead…)

Analysis is current as of 20 December 2020.

It's probably worth noting that my system is far from perfect. For instance, it ignores the big boxes full of links at the bottom of most Wikipedia pages (a) because it makes the game much less interesting and (b) because they are really hard to track.

The betweenness centrality ranking is also an estimate since the exact calculation would take over a month to compute.


You're a Wizard, Harry

And other lies computer scientists tell themselves


Why Does JPEG Look So Weird?

And why do we hardly ever notice?