Frequently Asked Questions

What am I looking at?

Each dot represents a subreddit (a subsection of the discussion website Reddit) and you can mouse over it to see its name. Dot positions were determined by the excellent t-SNE algorithm: the distance between dots is non-linearly related to the similarity in posting patterns for the users who posted to the subreddits represented by those dots. The size of the dots is related to the number of posts a subreddit receives. The colour of the dots was algorithmically determined by the K-Means clustering algorithm. A final technical note: the subreddits were embedded in a 100-dimensional space with the Truncated SVD algorithm and K-Means and t-SNE were run on the coordinates of the subreddits in that space (see below for full details).

Some interesting patterns come out when you scan your cursor over dots and look at what appears near what! In particular, the colours show broad patterns (i.e. light blue contains a lot of sports, dark green contains many video games) and positions will show some finer patterns (i.e. individual sports like soccer and football and individual video games form distinct clusters).

Note: some subreddit names may be offensive and/or cryptic to non-users of Reddit and no effort was made to make the names 'safe for work'.

Why is X (not) near Y?

Some groups of subreddits show up in unintuitive locations: for example the soccer, football and hockey groups are not near the other sports in the graphic above, even though they are in the same cluster (light blue). The fact that they are in the same cluster suggests that they are near each other in the 100-dimensional space generated by the SVD step, so why are they not near each other after the t-SNE step?

This may be due to some unintuitive pattern in the data, or just to the way the t-SNE algorithm worked through its constraints to reduce a 100-dimensional space to 2 dimensions. To understand these constraints by analogy, think about a typical world map with the Americas on the left, which is a 2-dimensional representation of a globe: why is western Alaska not near eastern Russia? Because the globe's surface had to get 'cut' somewhere to fit into 2 dimensions.

Where did the data come from?

The data and a description of how it was collected can be found here. It was collected by the authors of a paper called Navigating the massive world of reddit, who were not involved in the generation of the above graphic. The authors of this paper have their own interactive map here.

What technology was used here?

The plot was rendered using the Bokeh Python plotting library. The data pipeline that produced this plot was implemented using scikit-learn and is shown in the graphic below. The production of this graphic was the topic of a detailed presentation at Montreal Python on Dec 1, 2014.

Who is behind this?

This graphic was produced by Nicolas Kruchten from Datacratic.