You're a Wizard, Harry
And other lies computer scientists tell themselves
Here’s a game to pass the time while you’re trapped at home 10 months into a pandemic.
Pick two random pages on Wikipedia – your goal is to get from one to the other in as few clicks as possible. (Using the box full of links at the bottom of the page is cheating…)
There’s an online game too for those with a competitive streak.
There are over 6.2 million English Wikipedia articles, but this ecosystem is remarkably well connected. In fact, when you randomly pick two pages, there’s an 82.1% chance that you’ll be able to get from one to the other in 5 clicks or less, assuming that it’s possible in the first place.
Dont believe me? Pick any two pages here and see the shortest sequence of clicks you’d need to get between them.
Pages which are ‘obscure’ (i.e. contain less than 20 links or are linked to by less than 20 other pages) have been excluded here for efficiency’s sake.
The sheer volume of Wikipedia is remarkable, but also makes it challenging to analyse since it is so unwieldy. This is especially true for a task like finding a shortest sequence of clicks, as you may have to consider the entirety of Wikipedia all in one go to guarantee that you’ve found the shortest of all possible sequences.
To overcome this, we have to break the problem down.
We don’t care about every word on every Wikipedia page. Instead, we can just think about all of Wikipedia’s links from one page to the next.
The term for this in mathematics is a directed graph. In fact, any collection of ‘things’ and connections between them can be thought of as a directed graph - cross-country road networks, family trees, the electricity grid, etc.
For example, we can visualise a small group of Wikipedia pages and the links from each to the next like this:
Doing this transformation is useful for two reasons.
First, picking the right way to represent data gives us a new way to think about the problem. In other words, thinking about lots of dots and arrows is easier than thinking about pages upon pages of Wikipedia text.
Second, converting our obscure problem into a common mathematical format gives us access to a wealth of research and algorithms to help us solve the problem.
Researchers have spent decades developing algorithms and writing code to analyse directed graphs. They may have been designed for a completely different purpose, but they work just as well for navigating Wikipedia pages as they do for finding ancestors in a family tree, optimising trucking routes between cities, and lots of other unrelated problems, since they are all directed graphs under the hood.
We can use this existing knowledge to find our shortest paths and analyse our data in a whole new set of ways.
Not only is there a 82% chance that you’ll be able to get from one page to another in at most 5 clicks, there’s a 99.3% chance you’ll be able to do it in 7 clicks, assuming that it’s possible in the first place.
The final 0.7% of paths gets pretty dicey though – in the very worst case, 48 clicks are required. There are many starting pages that can achieve this, but only one ending page… Yes, as you’ve probably already guessed, it’s the 1948–49 FC Dinamo București season.
Why? Here’s an excerpt of the 48 glorious clicks to get there when starting at Nong Khai railway station, a railway station on the border between Thailand and Laos:
Nong Khai railway station → First Thai–Lao Friendship Bridge → Asian Highway Network → Bulgaria → PFC Levski Sofia → List of unbeaten football club seasons → 1991–92 FC Dinamo București season → 1990–91 FC Dinamo București season → … → 1952 FC Dinamo București season → 1951 FC Dinamo București season → 1950 FC Dinamo București season → 1948–49 FC Dinamo București season
Spot the pattern?
Ironically, the journey back is completely average, taking just 5 clicks:
Another tool we can use to measure this directed graph is the betweenness centrality of each page. This value represents the likelihood that a page gets used in a path between two other randomly chosen pages in the graph. The higher a page’s betweenness centrality, the more central it is in the network, and the more pivotal it is to Wikipedia’s tight connectivity.
Curiously, the shortest path between pages very often involves a country. (You’ll probably notice this if you try enough pages in the search panel above.) In fact, the page for the United States is so common that some variations of the Wikipedia Game ban the page entirely.
So, if a friend ever challenges you to the Wikipedia Game, you’ll have the upper hand! All you need to do is:
In my case processing this takes about 6 hours… So you might need to stall for a while.
(Or, you could just secretly check this page and use the search panel instead…)
Analysis is current as of 20 December 2020.
It's probably worth noting that my system is far from perfect. For instance, it ignores the big boxes full of links at the bottom of most Wikipedia pages (a) because it makes the game much less interesting and (b) because they are really hard to track.
The betweenness centrality ranking is also an estimate since the exact calculation would take over a month to compute.