Data Science and (Unsupervised) Machine Learning with `scikit-learn`

By at Datacratic

Presented Dec 1, 2014 at Montreal Python

from IPython.display import YouTubeVideo
YouTubeVideo('2lpS6gUwiJQ')

Proud to be a Montreal startup
Specializing in machine learning
Founded in 2009
27 employees downtown (Peel Metro), 1 in Ottawa, 3 in NYC

This talk

A different way to look at graph analysis and visualization,
as an introduction to a few cool algorithms: Truncated SVD, K-Means and t-SNE
with a practical walkthrough using scikit-learn and friends numpy and bokeh,
and finishing off with some more general commentary on this approach to data analysis.

Links:

One-page HTML version at http://opensource.datacratic.com/mtlpy50
All materials available at https://github.com/datacratic/mtlpy50
Video of this presentation at https://www.youtube.com/watch?v=2lpS6gUwiJQ

A map of Reddit

Reddit is "the front page of the internet"
Basically a discussion board, with sub-boards called subreddits
Figure from this paper: Navigating the massive world of reddit: Using backbone networks to map user interests in social media
- http://arxiv.org/pdf/1312.3387v1.pdf
- They took data about who posted to what subreddit
- Two subreddits have a link if enough users posted to both
- Visualization using Gephi with the OpenOrd layout algorithm
- Very cool interactive version

Seems like a totally natural approach but I want to show another way of doing this

`scikit-learn` and friends

scikit-learn is becoming the de-facto machine learning library for Python
Works in conjunction with scipy and numpy
Part of the PyData toolbox, along with pandas and bokeh

A first look at the data

File comes from here: http://figshare.com/articles/reddit_user_posting_behavior/874101

%%bash
head reddit_user_posting_behavior.csv

603,politics,trees,pics
604,Metal,AskReddit,tattoos,redditguild,WTF,cocktails,pics,funny,gaming,Fitness,mcservers,TeraOnline,GetMotivated,itookapicture,Paleo,trackers,Minecraft,gainit
605,politics,IAmA,AdviceAnimals,movies,smallbusiness,Republican,todayilearned,AskReddit,WTF,IWantOut,pics,funny,DIY,Frugal,relationships,atheism,Jeep,Music,grandrapids,reddit.com,videos,yoga,GetMotivated,bestof,ShitRedditSays,gifs,technology,aww
606,CrohnsDisease,birthcontrol,IAmA,AdviceAnimals,AskReddit,Endo,WTF,TwoXChromosomes,pics,funny,Jeep,Mustang,4x4,CCW,dogpictures,Cartalk,aww
607,space,Fitment,cars,Economics,Libertarian,240sx,UserCars,AskReddit,WTF,Autos,formula1,pics,funny,bodybuilding,gaming,Drifting,Justrolledintotheshop,atheism,gadgets,videos,business,gamernews,Cartalk,worldnews,carporn,technology,motorsports,Nissan,startrek
608,politics,Flagstaff,Rainmeter,fffffffuuuuuuuuuuuu,pcgaming,screenshots,truegaming,AdviceAnimals,Guildwars2,gonewild,gamingsuggestions,Games,AskReddit,dubstep,skyrim,SuggestALaptop,battlefield3,WTF,starcraft,creepy,pics,funny,darksouls,books,gaming,mw3,hentai,halo,atheism,magicTCG,swtor,SOPA,anime,IndieGaming,Jokes,wow,gifs,Design,NAU,Android,technology,Minecraft,aww,GameDeals,playitforward,pokemon
609,Clarinet,AdviceAnimals,festivals,SubredditDrama,InternetAMA,AskReddit,aves,cringe,MemesIRL,Music,AmISexy,electricdaisycarnival,ForeverAlone
610,RedHotChiliPeppers,fffffffuuuuuuuuuuuu,tifu,civ,gameofthrones,IAmA,AdviceAnimals,movies,explainlikeimfive,SubredditDrama,gonewild,todayilearned,trees,AskReddit,soccer,skyrim,WTF,germany,pics,funny,seduction,circlebroke,sto,gaming,4chan,atheism,circlejerk,Music,apple,cats,videos,John_Frusciante,minimalism,trackers,worldnews,gifs,beermoney,Android,technology,startrek,Frisson
611,beertrade,AskReddit,WTF,beer,batman,BBQ,beerporn,Homebrewing
612,politics,2012Elections,Parenting,IAmA,fresno,picrequests,AskReddit,loseit,WTF,Marriage,Mommit,pics,funny,VirginiaTech,loseit_classic,RedditLaqueristas,atheism,LadyBoners,GradSchool

Formatting it a little bit...

import pandas as pd 

pd.read_csv("reddit_user_posting_behavior.csv", nrows=10, names=["user"]+range(25)).fillna("")

Loading the data into a sparse matrix

We want a matrix: rows are subreddits, columns are users
- cells will be 1 if user posted to subreddit, otherwise 0
This step was the most technically challenging
Naïve first few tries: 20 minute run-time
The code below runs around 10 seconds

%%time
user_ids = []
subreddit_ids = []
subreddit_to_id = {}
i=0
with open("reddit_user_posting_behavior.csv", 'r') as f:
    for line in f:
        for sr in line.rstrip().split(",")[1:]: 
            if sr not in subreddit_to_id: 
                subreddit_to_id[sr] = len(subreddit_to_id)
            user_ids.append(i)
            subreddit_ids.append(subreddit_to_id[sr])
        i+=1
  
import numpy as np
from scipy.sparse import csr_matrix 

rows = np.array(subreddit_ids)
cols = np.array(user_ids)
data = np.ones((len(user_ids),))
num_rows = len(subreddit_to_id)
num_cols = i

# the code above exists to feed this call
adj = csr_matrix( (data,(rows,cols)), shape=(num_rows, num_cols) )
print adj.shape
print ""

# now we have our matrix, so let's gather up a bit of info about it
users_per_subreddit = adj.sum(axis=1).A1
subreddits = range(len(subreddit_to_id))
for sr in subreddit_to_id:
    subreddits[subreddit_to_id[sr]] = sr
subreddits = np.array(subreddits)

(15122, 876961)

CPU times: user 9.77 s, sys: 288 ms, total: 10.1 s
Wall time: 10.1 s

Unwieldy data

Our adjacency matrix is a bit problematic to deal with as-is:

It's a very wide: 850,000 columns
It's very sparse: only about 0.06% full
It's a binary matrix: only 0's and 1's

Dimensionality reduction

Family of algorithms for solving this problem
AKA decomposition, compression, feature extraction
- it's to big matrices what JPEG is to photos, MP3 to music, MPEG for video etc
The output will be as wide as we ask for
- The wider it is the less lossy the compression will be
The output matrix will be dense
The output matrix will have continuous real values
scikit-learn has a decomposition package

We'll use `TruncatedSVD`

TruncatedSVD from scikit-learn
SVD stands for Singular Vector Decomposition, Truncated because we only want part of the computation
Good mathy description of how it works in Chapter 11 of online book Mining of Massive Datasets
Right now let's focus on how to use it and exploring the output

%%time
from sklearn.decomposition import TruncatedSVD 
from sklearn.preprocessing import normalize 

svd = TruncatedSVD(n_components=100)
embedded_coords = normalize(svd.fit_transform(adj), norm='l1')
print embedded_coords.shape

(15122, 100)
CPU times: user 1min 8s, sys: 4.94 s, total: 1min 13s
Wall time: 1min 14s

The output is kind of neat:

Each row is like a set of coordinates in a 100-dimensional space for a subreddit
Each column defines one axis of this 100-dimensional space, ordered by how much information they capture
We can look at how much of the original matrix we captured with the first N dimensions
The first 2 capture around 25%
The 100 we will use capture around 60%

%matplotlib inline
pd.DataFrame(np.cumsum(svd.explained_variance_ratio_)).plot(figsize=(13, 8))

<matplotlib.axes._subplots.AxesSubplot at 0x7f4fdd633a50>

# this function will show you the axes on which a particular subreddit scores the highest/lowest
def pickOutSubreddit(sr):
    sorted_axes = embedded_coords[list(subreddits).index(sr)].argsort()[::-1]
    return pd.DataFrame(subreddits[np.argsort(embedded_coords[:,sorted_axes], axis=0)[::-1]], columns=sorted_axes)

pickOutSubreddit("soccer")

A few interesting dimensions

I went through each column and picked out some interesting dimensions

pd.DataFrame(subreddits[np.argsort(embedded_coords[:,[0, 1, 44,51,84,50,47,40]], axis=0)[::-1]], 
             columns=[
            "0: big - small", 
            "1: big - small",
            "44: soccer - guns",
            "51: programming - food",
            "84: music - bikes", 
            "50: osx - books",
            "47: wow - starcraft", 
            "40: male grooming - life hacks"
])

# not shown but also amusing:
# 14: music - pot
# 24: science - porn

Visualizing these dimensions

Bokeh makes it easy to get hover-tooltips!
Each dot in our plot will be a subreddit (mouse over to see which one) scaled by size
We'll only look at the top 3,500 or so subreddits, for speed

import bokeh.plotting as bp
from bokeh.objects import HoverTool 
bp.output_notebook()
row_selector = np.where(users_per_subreddit>100)

bp.figure(plot_width=900, plot_height=700, title="Subreddit Map by Most Informative Dimensions",
       x_axis_label = "Dimension 0",
       y_axis_label = "Dimension 1",
       tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
       min_border=1)
bp.scatter(
    x = embedded_coords[:,0][row_selector],
    y = embedded_coords[:,1][row_selector],
    radius= np.log2(users_per_subreddit[row_selector])/6000, 
    source=bp.ColumnDataSource({"subreddit": subreddits[row_selector]})
).select(dict(type=HoverTool)).tooltips = {"/r/":"@subreddit"}
bp.show()

Hm... let's try some other dimensions!

bp.figure(plot_width=900, plot_height=700, title="Subreddit Map by Interesting Dimensions",
       x_axis_label = "Guns <–> Soccer (Dimension 44)",
       y_axis_label = "Food <–> Programming (Dimension 51)",
       tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
       min_border=1)
bp.scatter(
    x = embedded_coords[:,44][row_selector],
    y = embedded_coords[:,51][row_selector],
    radius= np.log2(users_per_subreddit[row_selector])/6000, 
    source=bp.ColumnDataSource({"subreddit": subreddits[row_selector]})
).select(dict(type=HoverTool)).tooltips = {"/r/":"@subreddit"}
bp.show()

Can we add some colour?

Would be nice to colour-code these dots by overall topic
20 colours is probably on the upper end of what we can perceive effectively in a single plot

Clustering

Family of algorithms for grouping items into buckets
scikit-learn has a clustering package

We'll use `KMeans`

KMeans from scikit-learn
Take K points and try to distribute them in the space so that they sit in the middle of clusters, i.e. close to the mean

%%time
from scipy.stats import rankdata
embedded_ranks = np.array([rankdata(c) for c in embedded_coords.T]).T

from sklearn.cluster import KMeans
n_clusters = 20
km = KMeans(n_clusters)
clusters = km.fit_predict(embedded_ranks)

CPU times: user 26.4 s, sys: 4 ms, total: 26.4 s
Wall time: 26.5 s

pd.DataFrame( [subreddits[clusters == i][users_per_subreddit[clusters == i].argsort()[-6:][::-1]] for i in range(n_clusters)] )

Visualization in Technicolour

This time we'll colour-code each dot by cluster

colormap = np.array([
    "#1f77b4", "#aec7e8", "#ff7f0e", "#ffbb78", "#2ca02c", 
    "#98df8a", "#d62728", "#ff9896", "#9467bd", "#c5b0d5", 
    "#8c564b", "#c49c94", "#e377c2", "#f7b6d2", "#7f7f7f", 
    "#c7c7c7", "#bcbd22", "#dbdb8d", "#17becf", "#9edae5"
])

bp.figure(plot_width=900, plot_height=700,  title="Subreddit Map by Interesting Dimensions",
       x_axis_label = "Guns <–> Soccer (Dimension 44)",
       y_axis_label = "Food <–> Programming (Dimension 51)",
       tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
       min_border=1)
bp.scatter(
    x = embedded_coords[:,44][row_selector],
    y = embedded_coords[:,51][row_selector],
    color= colormap[clusters[row_selector]], 
    radius= np.log2(users_per_subreddit[row_selector])/6000, 
    source=bp.ColumnDataSource({"subreddit": subreddits[row_selector]})
).select(dict(type=HoverTool)).tooltips = {"/r/":"@subreddit"}
bp.show()

Can we combine all these dimensions into one plot?

Just plotting the first two dimensions wasn't very interesting
We face the same problem map-makers face: making a high-dimensional thing 2-dimensional
There are lots of ways to make a 2-d map of our 3-d planet!

Manifold learning

Family of algorithms to transform many dimensions to 2 or 3
AKA low-dimensional embedding
scikit-learn has a manifold learning package

We'll use `TSNE`

TSNE from scikit-learn
Original paper is awesome
Math is fairly involved but basically tries to preserve micro and macro structure, sacrificing some global structure
Gives very good, human-readable results on real-world datasets, check the examples in the paper

%%time
from sklearn.manifold import TSNE
xycoords = TSNE().fit_transform(embedded_coords[row_selector])

CPU times: user 5min 32s, sys: 1min 1s, total: 6min 33s
Wall time: 6min 35s

bp.figure(plot_width=900, plot_height=700, title="Subreddit Map by t-SNE",
       tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
       x_axis_type=None, y_axis_type=None, min_border=1)
bp.scatter(
    x = xycoords[:,0],
    y = xycoords[:,1],
    color= colormap[clusters[row_selector]], 
    radius= np.log2(users_per_subreddit[row_selector])/60, 
    source=bp.ColumnDataSource({"subreddit": subreddits[row_selector]})
).select(dict(type=HoverTool)).tooltips = {"/r/":"@subreddit"}
bp.show()

Taking a step back

We loaded a dataset into a big sparse matrix
We used TruncatedSVD to make it into a surprisingly informative smaller matrix
We used KMeans to group our subreddits into some clusters
We used TSNE to get coordinates for a scatterplot
We used bokeh to make an interactive graphic
So what?

A hidden agenda

Use these tools on a dataset that you can a priori reason about
Show that meaningful patterns come out
Convince you that you can try this on data you can't a priori reason about
So that you will feel confident that the algo has found something useful

A totally different example

Latent Semantic Analysis on Twitter data
Same tricks as before
load up a sparse matrix of words vs tweets (instead of subreddits vs users)
Raw data is from here https://archive.org/details/twitter_cikm_2010
A few million tweets from 2010
I ran TruncatedSVD and TSNE as before
Notebook is available at https://github.com/datacratic/mtlpy50
HTML with interactive graphic is at http://opensource.datacratic.com/mtlpy50/tweets.html

Where to next?

Notebook is available at https://github.com/datacratic/mtlpy50
You can use these on your own data (dense or sparse!)
Big spreadsheets or datasets where each row is an entity are also good candidates for this type of analysis
Watch patterns appear before your very eyes!
NB: TruncatedSVD, KMeans and TSNE are some of my go-to algorightms but each belongs to a family with other options

One more thing...

scikit-learn is nice, but some of these algorithms can be slow and hard to use
Early next year, Datacratic will release an early version of our Machine Learning Database
It lets you do this type of analysis by calling a REST API and it's lightning-fast!
Stay tuned :)

	44	46	45	40	0	42	26	27	47	21	...	62	52	43	54	69	61	50	49	22	25
0	bevandele	PrettyOlderWomen	ParisSG	malefashionadvice	AskReddit	DoesAnybodyElse	CraftMasters	blackmailporn	ShitLiamDoes	ManchesterVegan	...	knockoutgifs	JapaneseFiction	gameofthrones	applehelp	PicsOfJithSleeping	outercourse	osx	ShitLiamDoes	gonewildstories	CraftMasters
1	ParisSG	soccerbot	bevandele	malefashion	funny	UrethraPorn	Nbz	underboob	silverandguns	asutosh	...	outercourse	davidfosterwallace	shewantstofuck	macsetups	EdmontonOilers	mindashq	simpleios	techsupport	BDSMpersonals	Nbz
2	soccerbot	bevandele	FootballMedia	goodyearwelt	pics	HotDogPorn	pokemon	Sexfight	techsupport	goatedition	...	mindashq	whatsthatbook	asoiaf	apple	hockeylayouttest	NBA2k	ios	bubbleswithfaces	BDSMcommunity	pokemon
3	FootballMedia	FTLStrikers	soccerbot	rawdenim	WTF	SpidersGoneWild	pokemonteams	gangbang	uscgames	GOAT_TRUTH	...	Articles	memphismayfire	aSongOfMemesAndRage	ios	hockey	StateFarm	retina	talesofmybowels	bdsm	vanguardtips
4	FTLStrikers	ParisSG	FTLStrikers	frugalmalefashion	gaming	SomeRandomReddit	vanguardtips	The_Porn_Family	Goblinism	worldnews	...	postcolonialism	Earlyjazz	asoiafcirclejerk	retina	hockeyplayers	nba	iOSProgramming	24hoursupport	polyamory	AsianFeet
5	PrettyOlderWomen	FootballMedia	PrettyOlderWomen	yusufcirclejerk	AdviceAnimals	FuckingFish	AsianFeet	ChangingRooms	Transmogrification	sticknpokes	...	PeterL	alt_lit	AGOTBoardGame	jasmineapp	leafs	Basketball	appletv	ouch	BDSMGW	thongsandals
6	soccer	reddevils	coys	AustralianMFA	IAmA	SnooPorn	thongsandals	CumFacials	wow	CatsInBusinessAttire	...	Basketball	walterjohnson	tesothemereddit	ipad	Habs	BasketballTips	mac	Buttcoin	JL2579	ebonyfeet
7	reddevils	LiverpoolFC	footballtactics	mfacirclejerk	videos	FlyingFuck	ebonyfeet	wetandwild	wowscrolls	douchebagfoundation	...	AskMen	booklists	TGOD	AlienBlue	rangers	heat	apple	IWantToBeAMod	chickflixxx	TruePokemon
8	Gunners	soccer	soccer	shittymspaints	todayilearned	KittyPorn	EvolutionofKits	TVnewsbabes	WowUI	nnDMT	...	BasketballTips	artshub	kdubz1298	mac	canucks	lakers	macapps	nexusq	gonewildaudio	pokemonteams
9	FantasyPL	MCFC	Gunners	europeanmalefashion	atheism	gnomewild	ShinyPokemon	LolitaCheng	WoWGoldMaking	AustralianFilm	...	StateFarm	Adamphotoshopped	asoiafreread	iphone	DetroitRedWings	benchgifs	applehelp	prestashop	pregnant	pokemontrades
10	coys	Gunners	soccercirclejerk	malehairadvice	Omaha	HobosGoneWild	nuzlocke	booty_gifs	wowstrat	AscensionOmaha	...	Mavericks	metacsec	Dreadfort	commonsense	AnaheimDucks	mildlyimpressive	macsetups	sergaljerk	sexpertslounge	pokemonrng
11	Barca	chelseafc	reddevils	ldshistory	aww	PicsOfHorseVaginas	Pokemongiveaway	SpyShots	woweconomy	JKP	...	benchgifs	breakingbad	CK2GameOfthrones	appletv	losangeleskings	LAClippers	iPhoneDev	purple	DeadBedrooms	EvolutionofKits
12	chelseafc	Fifa13	Barca	malefoodadvice	UIUC	HalloweenPorn	pokemonrng	plugged	wowguilds	indiansports	...	NBA2k	mildlyinteresting	CTI	GeekTool	OttawaSenators	NBASpurs	ipad	labradoodles	sex	pokemonarts
13	realmadrid	realmadrid	football	itafterdark	houston	WeirdSubreddits	pokemontrades	exgirlfriendpictures	WoWStreams	iknewgoatsweretrouble	...	lakers	laundryview	WesterosiProblems	JamieChung	penguins	DavidsQuotes	WebApps	TrackThrows	TwoXSex	Pokemoncollege
14	LiverpoolFC	FantasyPL	NUFC	TeenMFA	Columbus	WhyWouldYouFuckThat	stunfisk	RedHeelsGW	wowpodcasts	traphentai	...	nba	vinyldjs	Asoiafspoilersall	jailbreak	BlueJackets	torontoraptors	iphone	bttf	CowLand	nuzlocke
15	MCFC	coys	LiverpoolFC	preppy	Dallas	SantaPorn	Pokemoncollege	CumSwallowing	wowraf	Turkey	...	heat	SexyMusicVideos	livechat	osx	Flyers	SchooledUp	iphonehelp	RegretfulSexStories	diamondmine	ShinyPokemon
16	footballmanagergames	FIFA12	footballmanagergames	adventureporn	politics	burningporn	normalboots	souse	24hoursupport	KateeOwen	...	torontoraptors	bookhaul	GavinQuotes	booklists	devils	memphisgrizzlies	jailbreak	drawings	FanPuns	Pokemongiveaway
17	bootroom	EA_FIFA	Bundesliga	breitling	Austin	CucumberPorn	PokemonROMhacks	xxxstash	bubbleswithfaces	shewantstofuck	...	LAClippers	murakami	Tumba	AppHookup	EA_NHL	warriors	AlienBlue	computertechs	MinecraftChampions	stunfisk
18	Fifa13	FIFA	chelseafc	uniqlo	kansascity	PigsGoneWild	PokemonLeagueDS	drawngonewild	worldofwarcraft	Israel	...	AskWomen	circlejerkbreakingbad	HypnoHookup	iphonehelp	Coyotes	kanyewest	AppHookup	Sandwiches	WolfPAChq	GenerationOne
19	fcbayern	Barca	MilitaryGear	gayforsaulyd	Tucson	Clotheshangers	Pokemonexchange	Cumonboobs	WoWNostalgia	haligonients	...	SneakerDeals	skinnytail	OliviaMunn	simpleios	BostonBruins	Mavericks	obits	Kirbs2002	Swingers	pokemonconspiracies
20	football	bootroom	MCFC	malelivingspace	Purdue	WTF_Wallpapers	denpamen	manbetterporn	FTH	Filipinology	...	dating_advice	booksuggestions	maturewoman	iOSthemes	SanJoseSharks	bostonceltics	commonsense	tesothemereddit	DissidiaCraft	pokemonchallenges
21	FIFA	footballmanagergames	realmadrid	Watches	orlando	thismemeshoulddie	pkmntcgtrades	DragonageNSFW	redditguild	typescript	...	DavidsQuotes	bookclub	Spartacus_TV	SEGA32X	stlouisblues	nbacirclejerk	SEGA32X	ABotOfIceAndFire	boule	Pokemonexchange
22	FIFA12	fcbayern	KitSwap	swagteamsix	Boise	windowshots	GenerationOne	ToeSucking	Rift	HealthyWeightLoss	...	Oaxaca	books	freakyfetishstories	apps	winnipegjets	kings	macgaming	asoiaf	minecraftium	pokemonrp
23	EA_FIFA	Aleague	LigaMX	paulrudd	Atlanta	traversecity	TruePokemon	assgifs	ALS	arabic	...	mildlyimpressive	BooksAMA	Pokenawa	macgaming	hawks	chicagobulls	jasmineapp	gameofthrones	TriPixel	PokemonROMhacks
24	borussiadortmund	ACMilan	donenad	mensfashionadvice	Charlotte	ThisDayInHistory	pokemonconspiracies	joi	MMORPG	Syria	...	NBASpurs	publishing	Banshee	iosgaming	NewYorkIslanders	OrlandoMagic	ScreenplayCoverage	SurplusEngineering	mathias	fireemblem
25	SoccerBetting	TrueGunners	fcbayern	soccergaming	fsu	failedpilots	pkmntcgcollections	cumov	MovieWallpapers	mexico	...	AlienExchange	DogsWithCatHeads	CelebsInTights	Panera	ColoradoAvalanche	suns	iOSthemes	Dreadfort	GWAsians	pkmntcg
26	ACMilan	footballtactics	seriea	FantasyPL	Louisville	wikipedia	PokePlayThru2013	asstastic	wowtcg	pernicus	...	kanyewest	bookshelf	Conservativebooks	classics	hockeygoalies	GoNets	flextweak	kuro5hit	lmm	pkmntcgtrades
27	Aleague	borussiadortmund	LeedsUnited	supremeclothing	SaltLakeCity	bronycringe	pkmntcg	SymphonieVonBondage	punkshots	Palestine	...	OrlandoMagic	illusionporn	justified	askjailbreak	sabres	NYKnicks	zsh	PlayPassOrPause	mccountercraft	PokemonLeagueDS
28	NUFC	soccergaming	ACMilan	PrettyOlderWomen	VirginiaTech	carlhprogramming	pokemonbattles	MNGoneWild	turnoverpie	1428	...	SchooledUp	antiatheism	lickingdick	iWallpaper	caps	RNBA2KFantasy	Panera	kdubz1298	Hypermine	SoloPokes
29	footballtactics	NUFC	bootroom	malegrooming	Hawaii	pic	pokemonrp	gonewild	diablo3	BDS	...	bostonceltics	infographic	illustratingreddit	CSUDH	fantasyhockey	rockets	CSUDH	Kazakhstan	massage	hotmidgets
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
15092	shittygunpictures	TreesSuckingOnThings	bathroomgraffitiporn	Iron	Bad_ass_girlfriends	LadiesofScience	CuriosityCube	craftit	xenominer360	fragworks	...	asianproblems	browncoats	blackgirlgamers	Adamphotoshopped	dpmansen	baby	Holmes	DoctorWhoFreedom	StLouisRams	nfl
15093	Shotguns	lisp_ja	DAE	ReVenture	services	trashynovels	computer	Pinterest	wtfgames	Trathira	...	BanjoMonkey	euphonium	LadiesofScience	CAmmunity	Jaguars	steak	ournameisfun	paganmusic	49ers	mccountercraft
15094	scguns	comedywriting	polycomics	Neopsychedelia	Pornoeverywherexxx	NaturalBeauty	ykwih	CherylCole	starcraft2_class	Chydrego	...	WeirdFiction	classicwho	ForeverAloneWomen	cachatfanfic	miamidolphins	FoodVideos	horrorlit	ios	CatTeamBrotherhood	Browns
15095	progun	freesoftware	PlayPassOrPause	Fitness	Hardcore_PornGifs	IdeaExchange	pcgamingtechsupport	electricents	DNE	DLGOKN	...	Scoobydoo	thelook	fashion	HouseMD	pickupfootball	NFLhuddle	gallifrey	apple	nfl	lmm
15096	300BLK	tinycode	DYR	homegym	FitnessGirls	duckface	adelaidenews	TwoXSex	quicklooks	Shaskel	...	RadicalFeminism	maisiewilliams	adventureporn	GunslingerMusic	Hawkeye_Football	PacificNorthwest	AmyAdams	macsetups	BIRDTEAMS	miamidolphins
15097	opencarry	steamforlinux	wtfgames	AlternativeHealth	ShemaleSwag	FashionPlus	XXX_Animated_Gifs	TheGirlSurvivalGuide	Hobbies	androidtablets	...	sidewayspony	gallifrey	vegproblems	cell	Lincolnshire	caps	DoctorWhumour	macapps	fcs	minnesotavikings
15098	gunpolitics	systems	DNE	fargo	Entrenched	mydrunkkitchen	infinityblade	ABraThatFits	starcraft_strategy	PleX	...	malepolish	hilaribad	2XLite	irvine	eroticcomics	AskMen	lexlinguae	crazysteve	NYGiants	TriPixel
15099	Glocks	allthingsterranbot	The_Cheated	ecig_vendors	spacedikdiks	NonverbalComm	CableManagement	ProjectRunway	Destiny	Philanthropy	...	FemmeThoughtsFeminism	teefury	shittymspaints	avatarvideos	Texans	dating_advice	write	gamegenome	Chargers	mathias
15100	CZFirearms	coding	WriteWithMe	Hutchinson	frontiama	circlejerk	HarvardTonight	StreetArtPorn	skypepartycirclejerk	numerical	...	hugenaturals	ASMRmusic	EdmontonGoneWild	nwo	syrianrefugees	ravens	logophilia	drwho	Mariners	boule
15101	tc_archery	cloudcomputing	PicsOfHorseVaginas	askedreddit	front	InfertilityBabies	worms	GetEmployed	asd	yoyohoneysingh	...	olderlesbians	DoesAnybodyElse	Cheetahs	juanbuiza	UniversityOfHouston	nflblogs	litvideos	Sherlock	EvilLeagueOfEvil	minecraftium
15102	knives	puremathematics	FlyingFuck	kettlebell	frontscience	closetswap	eddfaction	entwives	castit	Valesrubberduckies	...	SeattleHistory	MLPArt	femmit	FlashForward	Madden_NFL	bengals	doctorwho	toolSquad	gifrequests	sportsbook
15103	gundeals	PostgreSQL	HotDogPorn	StudentNurse	2012askanything	technews	buildapcsales	2XLookbook	StarcraftCirclejerk	Surface	...	Cheetahs	MovieWallpapers	veggieteens	louie	ConnectedCareers	postcolonialism	neilgaiman	Johnlock	sexmix247	JordanCox2
15104	Mini14	mathisbeautiful	FuckingFish	kingcounty	bipoliticalandcurious	GillianJacobs	ClearBackblast	sheltie	HotSBetaKeys	bttf	...	OperationSayFuckALot	Digihentai	femalefashionadvice	Jericho	ravens	freddiemercury	80sElectro	classicwho	cowboys	NFL_Draft
15105	guns	dalcs4168	SomeRandomReddit	volunteeringsolutions	johndollarfullofcrap	AlienExchange	Zendaya	Mommit	itmejp	AbletonProductions	...	TwoXChromosomes	bedhg	frugalbeauty	iamgoingtohellforthis	boats	JordanCox2	booksuggestions	doctorwhocirclejerk	Tennesseetitans	MinecraftChampions
15106	CompetitionShooting	ocaml	SnooPorn	todayiwatched	alfrankenstein	TPOP	Jab	LumiaLovers	slothswitharms	PiCases	...	BreastPumps	DoctorWhumour	Endo	huntedseries	Seahawks	effzeh	ROIO	whovianents	CollegeBasketball	DissidiaCraft
15107	CCW	vim	Clotheshangers	RAMD	GreenCarLovers	preppy	HardcoreSex	USMilitarySO	StarcraftDeutschland	Ermahgerd	...	Rights4Men	Humiliation	BodyAcceptance	RandomActsOfBidets	podcast	LinusDiaries	orderofthephoenix	karengillan	SaintLouisRams	WolfPAChq
15108	ak47	scribblenauts	PigsGoneWild	fitmeals	AutomobileTechnology	TheSexCave	AvaSambora	VintageLadyBoners	starcraft	AvaDevine	...	grooveshark	GeeKnitting	malefashionadvice	maggielawson	ConnectedCareers2	sabres	hyperfurs	dwlounge	Patriots	CowLand
15109	saiga	pervasivecomputing	CucumberPorn	xxfitness	albanianarchandpics	StrandedWhale	WhatsInThisPool	rant	allthingsprotoss	GNURadio	...	GirlsinPinkUndies	t:latestoneage	2XLookbook	HHN	circuloidiota	Redskins	Random_Acts_of_Books	davidtennant	internetdefense	49ers
15110	Gunsforsale	emacs	UrethraPorn	shittyshitredditsays	engineersissues	Admin	MoundofVenus	TomeNET	wowbro	Information_Security	...	twilightzonedates	IsabelleFuhrman	feetish	circlejerkbreakingbad	sext	Flyers	artshub	IsabelleFuhrman	Texans	NYGiants
15111	gats	scheme	burningporn	ShittyTheoryOfReddit	nonseq	Gingers	buildapcforme	ThriftStoreHauls	BarCraft	braddoingbradthings	...	henna	bronycringe	rawdenim	Dexter	Colts	NFLFandom	selfpublish	mattandbenedict	nyjets	LinusDiaries
15112	prepping	xmonad	SantaPorn	microgreens	TotalFark	SRSFartsAndCrafts	TwinGirls	birthcontrol	allthingszerg	VicPD	...	trashynovels	cissp	findfashion	thingsoncats	nflblogs	detroitlions	OldEnglish	gallifrey	CFB	diamondmine
15113	Firearms	CasualMath	HobosGoneWild	naturalbodybuilding	OmegleVideos	MichelleTrachtenberg	aquajewhungerforce	cafe	starcraftnakama	internet2012	...	PunkLovers	bootstrap	asiantwoX	Pedberg	LinusDiaries	Bigtitssmalltits	bookshelf	gallifreyan	NFL_Draft	eagles
15114	reloading	ProgrammerHumor	gnomewild	EatCheapAndHealthy	CandidBikiniGirls	dataisterrifying	internetdefense	stepparents	Tumba	amateurastronomy	...	TrackThrows	CREST	TwoXChromosomes	Allison_Scagliotti	JordanCox2	40something	BooksAMA	Torchwood	CHIBears	oaklandraiders
15115	ar15	gnu	WTF_Wallpapers	PronePaddling	aitaiwan	LifeProTips	gamingpc	bigboobproblems	starcraft2clans	cavalierkingcharles	...	femalewriters	AmyAdams	TodayIWore	pipesgonewild	effzeh	GreenBayPackers	books	mattsmith	Redskins	cowboys
15116	longrange	nethack	KittyPorn	DeathProTips	baozoumanhua	mcwex	buildapc	childfree	AllThingsTerran	BitcoinMagazine	...	parentdeals	wmnf	malefashion	ohnoghosts	NFLFandom	knockoutgifs	alt_lit	wholock	eagles	Patriots
15117	Hunting	baduk	WhyWouldYouFuckThat	DepressedRage	NSFW_HotSlut_gonewild	wmnf	sexmix247	2XLite	Naytopia	trueshreddit	...	amateursologirls	FailedFedora	blackgirls	vinyldjs	freddiemercury	happierNYC	52book	MovieWallpapers	nflcirclejerk	Redskins
15118	BrassSwap	libredesign	WeirdSubreddits	2Chainz	sexy_nonnude_teens	allstars	Reflections	curvygirls	Limu	technology	...	SexyNerds	doctorwho	SRSTechnology	breakingbadcomics	NFLhuddle	buffalobills	bookhaul	hyperfurs	Reflections	CFB
15119	GunPorn	gtd	HalloweenPorn	IndianClubs	NSFW_nude_asian_teens	UKHealthcare	overclocking	elections	DoesAnybodyElse	TakeOneStepForward	...	MilitaryFamilies	dwlounge	UnsentMusic	breakingbad	Bingo	MontrealCanadiens	BookCollecting	AmyAdams	panthers	Texans
15120	USMilitia	compscipapers	SpidersGoneWild	Nicaragua	NSFW_sexynudeteens	CarboholicsAnonymous	NSFWSector	mormoncringe	UKStarcraft	slandrogangstas	...	veggieteens	hugs	goodyearwelt	memphismayfire	gameslist	Articles	booklists	doctorwho	HighlightGIFS	nyjets
15121	1911	LANL_French	DoesAnybodyElse	LifeProTips	wild_naked_girls	eisley	gifs	mobile	allthingsterranbot	preggocows	...	SRSTechnology	hyperfurs	freeforallfashion	Earlyjazz	happierNYC	steelers	WoTreread	DoctorWhumour	gifs	CHIBears

	0: big - small	1: big - small	44: soccer - guns	51: programming - food	84: music - bikes	50: osx - books	47: wow - starcraft	40: male grooming - life hacks
0	AskReddit	todayilearned	bevandele	threads	BibleBelievers	osx	ShitLiamDoes	malefashionadvice
1	funny	worldnews	ParisSG	smart	christianblogs	simpleios	silverandguns	malefashion
2	pics	politics	soccerbot	hocnet	mormonapologetics	ios	techsupport	goodyearwelt
3	WTF	videos	FootballMedia	pervasivecomputing	beerpong	retina	uscgames	rawdenim
4	gaming	technology	FTLStrikers	Suomipelit	anarchist_aid	iOSProgramming	Goblinism	frugalmalefashion
5	AdviceAnimals	blog	PrettyOlderWomen	lisp_ja	Jesus	appletv	Transmogrification	yusufcirclejerk
6	IAmA	promos	soccer	gnu	TrueChristian	mac	wow	AustralianMFA
7	videos	science	reddevils	ProgrammerArt	biogas	apple	wowscrolls	mfacirclejerk
8	todayilearned	til	Gunners	app	Christianity	macapps	WowUI	shittymspaints
9	atheism	atheism	FantasyPL	jenkinsci	JustChristians	applehelp	WoWGoldMaking	europeanmalefashion
10	Omaha	SOPA	coys	DanceTutorials	ChristianBooks	macsetups	wowstrat	malehairadvice
11	aww	IAmA	Barca	illjustleavethishere	ChristianCreationists	iPhoneDev	woweconomy	ldshistory
12	UIUC	ColbertRally	chelseafc	cloudcomputing	therealcollective	ipad	wowguilds	malefoodadvice
13	houston	USNEWS	realmadrid	softwaredevelopment	Reformed	WebApps	WoWStreams	itafterdark
14	Columbus	CMDY	LiverpoolFC	ComputerTips	stickers	iphone	wowpodcasts	TeenMFA
15	Dallas	NorthKoreaNews	MCFC	scheme	TheArk	iphonehelp	wowraf	preppy
16	politics	OperationGrabAss	footballmanagergames	ProgrammerHumor	RadicalChristianity	jailbreak	24hoursupport	adventureporn
17	Austin	WikiLeaks	bootroom	ocaml	truestchristian	AlienBlue	bubbleswithfaces	breitling
18	kansascity	sandy	Fifa13	tmbo	Sidehugs	AppHookup	worldofwarcraft	uniqlo
19	Tucson	AnythingGoesUltimate	fcbayern	QueerTheory	biblestudy	obits	WoWNostalgia	gayforsaulyd
20	Purdue	reddit.com	football	emacs	Catacombs	commonsense	FTH	malelivingspace
21	orlando	northkorea	FIFA	csbooks	ChristianApologetics	SEGA32X	redditguild	Watches
22	Boise	athiesm	FIFA12	d_language	PrayerRequests	macgaming	Rift	swagteamsix
23	Atlanta	Futurology	EA_FIFA	coding	Fiveheads	jasmineapp	ALS	paulrudd
24	Charlotte	silentcinema	borussiadortmund	CatsWithPeopleFeet	sketches	ScreenplayCoverage	MMORPG	mensfashionadvice
25	fsu	occupywallstreet	SoccerBetting	feministFAQ	OpenChristian	iOSthemes	MovieWallpapers	soccergaming
26	Louisville	softscience	ACMilan	Big_Ood	cristianoronaldo	flextweak	wowtcg	FantasyPL
27	SaltLakeCity	peoplesliberation	Aleague	securityCTF	ShitZimnySendsMe	zsh	punkshots	supremeclothing
28	VirginiaTech	ArtisanVideos	NUFC	Clojure	ahmadiyya	Panera	turnoverpie	PrettyOlderWomen
29	Hawaii	movies	footballtactics	blackflag	ChristianMusic	CSUDH	diablo3	malegrooming
...	...	...	...	...	...	...	...	...
15092	Bad_ass_girlfriends	Foamyfrogismoe	shittygunpictures	castiron	Boats_and_Beauties	Holmes	xenominer360	Iron
15093	services	SiriAmA	Shotguns	ramen	MagicCardPulls	ournameisfun	wtfgames	ReVenture
15094	Pornoeverywherexxx	HelloProperAlice	scguns	Halloween_Town	Magicdeckbuilding	horrorlit	starcraft2_class	Neopsychedelia
15095	Hardcore_PornGifs	allHailKingNoel	progun	asianeats	melodicmetal	gallifrey	DNE	Fitness
15096	FitnessGirls	LambofGod	300BLK	HiphopWorldwide	TechnicalDeathMetal	AmyAdams	quicklooks	homegym
15097	ShemaleSwag	VictorianSluts	opencarry	shittyhomes	Alisonhaislip	DoctorWhumour	Hobbies	AlternativeHealth
15098	Entrenched	dogdad	gunpolitics	grilling	mtgcube	lexlinguae	starcraft_strategy	fargo
15099	spacedikdiks	VintageMilf	Glocks	KitchenConfidential	magicTCG	write	Destiny	ecig_vendors
15100	frontiama	Onderka	CZFirearms	icecreamery	Deathmetal	logophilia	skypepartycirclejerk	Hutchinson
15101	front	softcorenights	tc_archery	Chefit	mtgaltered	litvideos	asd	askedreddit
15102	frontscience	CarShowHunnies	knives	recipes	spikes	doctorwho	castit	kettlebell
15103	2012askanything	RimofReality	gundeals	VictoriaSecret	EDH	neilgaiman	StarcraftCirclejerk	StudentNurse
15104	bipoliticalandcurious	HappyBaracky	Mini14	TakeaPlantLeaveaPlant	Metal	80sElectro	HotSBetaKeys	kingcounty
15105	johndollarfullofcrap	shitclintonsays	guns	sushi	metalmusicians	booksuggestions	itmejp	volunteeringsolutions
15106	alfrankenstein	babel	CompetitionShooting	Breadit	AliceInChains	ROIO	slothswitharms	todayiwatched
15107	GreenCarLovers	notocispa	CCW	DavisSquare	Musex	orderofthephoenix	StarcraftDeutschland	RAMD
15108	AutomobileTechnology	ProjectScarlett	ak47	GMO	ISU_GDC	hyperfurs	starcraft	fitmeals
15109	albanianarchandpics	bestplacetolearn	saiga	52weeksofbaking	MetalLadyBoners	Random_Acts_of_Books	allthingsprotoss	xxfitness
15110	engineersissues	mediocregiraffes	Gunsforsale	judytrinh	7String	artshub	wowbro	shittyshitredditsays
15111	nonseq	HIV	gats	Daleks	mtgfinance	selfpublish	BarCraft	ShittyTheoryOfReddit
15112	TotalFark	superheroinesdefeated	prepping	Charcuterie	spacedikdiks	OldEnglish	allthingszerg	microgreens
15113	OmegleVideos	violentpornography	Firearms	52weeksofcooking	Entrenched	bookshelf	starcraftnakama	naturalbodybuilding
15114	CandidBikiniGirls	necroporn	reloading	sousvide	Cichlid	BooksAMA	Tumba	EatCheapAndHealthy
15115	aitaiwan	deandrefall	ar15	FoodPorn	vomitporn	books	starcraft2clans	PronePaddling
15116	baozoumanhua	kasperrosa	longrange	food	TheStuntMuffins	alt_lit	AllThingsTerran	DeathProTips
15117	NSFW_HotSlut_gonewild	novafactory	Hunting	CajunMusic	BestSleepOfYourLife	52book	Naytopia	DepressedRage
15118	sexy_nonnude_teens	marathi	BrassSwap	FoodVideos	nationalsciencebowl	bookhaul	Limu	2Chainz
15119	NSFW_nude_asian_teens	PsiUEI	GunPorn	sharksinclothes	johndollarfullofcrap	BookCollecting	DoesAnybodyElse	IndianClubs
15120	NSFW_sexynudeteens	Factories	USMilitia	creepywiki	alfrankenstein	booklists	UKStarcraft	Nicaragua
15121	wild_naked_girls	AskReddit	1911	appetizers	Metal_Alberta	WoTreread	allthingsterranbot	LifeProTips

	0	1	2	3	4	5
0	Libertarian	conspiracy	philosophy	Economics	PoliticalDiscussion	business
1	doctorwho	thewalkingdead	scifi	community	TheLastAirbender	masseffect
2	AskReddit	funny	pics	WTF	gaming	AdviceAnimals
3	self	photography	TrueReddit	travel	space	AskHistorians
4	Games	starcraft	Diablo	wow	battlefield3	DotA2
5	programming	geek	linux	talesfromtechsupport	learnprogramming	sysadmin
6	Guitar	listentothis	WeAreTheMusicMakers	Metal	vinyl	Bass
7	anime	magicTCG	zelda	darksouls	3DS	Naruto
8	gonewild	nsfw	RealGirls	gonewildcurvy	NSFW_GIF	GoneWildPlus
9	TwoXChromosomes	loseit	relationships	AskWomen	tattoos	cats
10	mylittlepony	gamegrumps	MLPLounge	homestuck	furry	ClopClop
11	Drugs	hiphopheads	woahdude	seduction	electronicmusic	drunk
12	guns	cars	motorcycles	Autos	Military	formula1
13	bestof	MensRights	lgbt	SubredditDrama	gaybros	gaymers
14	fffffffuuuuuuuuuuuu	Minecraft	gifs	pokemon	skyrim	buildapc
15	Christianity	NoFap	TrueAtheism	DebateReligion	islam	Poetry
16	bicycling	beer	running	Homebrewing	Coffee	snowboarding
17	cringepics	cringe	mildlyinteresting	4chan	JusticePorn	reactiongifs
18	reddit.com	Android	explainlikeimfive	malefashionadvice	DoesAnybodyElse	Frugal
19	nfl	soccer	nba	hockey	fantasyfootball	baseball

	user	0	1	2	3	4	5	6	7	8	...	15	16	17	18	19	20	21	22	23	24
0	603	politics	trees	pics							...
1	604	Metal	AskReddit	tattoos	redditguild	WTF	cocktails	pics	funny	gaming	...	trackers	Minecraft	gainit
2	605	politics	IAmA	AdviceAnimals	movies	smallbusiness	Republican	todayilearned	AskReddit	WTF	...	atheism	Jeep	Music	grandrapids	reddit.com	videos	yoga	GetMotivated	bestof	ShitRedditSays
3	606	CrohnsDisease	birthcontrol	IAmA	AdviceAnimals	AskReddit	Endo	WTF	TwoXChromosomes	pics	...	Cartalk	aww
4	607	space	Fitment	cars	Economics	Libertarian	240sx	UserCars	AskReddit	WTF	...	Drifting	Justrolledintotheshop	atheism	gadgets	videos	business	gamernews	Cartalk	worldnews	carporn
5	608	politics	Flagstaff	Rainmeter	fffffffuuuuuuuuuuuu	pcgaming	screenshots	truegaming	AdviceAnimals	Guildwars2	...	SuggestALaptop	battlefield3	WTF	starcraft	creepy	pics	funny	darksouls	books	gaming
6	609	Clarinet	AdviceAnimals	festivals	SubredditDrama	InternetAMA	AskReddit	aves	cringe	MemesIRL	...
7	610	RedHotChiliPeppers	fffffffuuuuuuuuuuuu	tifu	civ	gameofthrones	IAmA	AdviceAnimals	movies	explainlikeimfive	...	skyrim	WTF	germany	pics	funny	seduction	circlebroke	sto	gaming	4chan
8	611	beertrade	AskReddit	WTF	beer	batman	BBQ	beerporn	Homebrewing		...
9	612	politics	2012Elections	Parenting	IAmA	fresno	picrequests	AskReddit	loseit	WTF	...	RedditLaqueristas	atheism	LadyBoners	GradSchool

Data Science and (Unsupervised) Machine Learning with scikit-learn

Presented Dec 1, 2014 at Montreal Python

This talk

A map of Reddit

scikit-learn and friends

A first look at the data

Formatting it a little bit...

Loading the data into a sparse matrix

Unwieldy data

Dimensionality reduction

We'll use TruncatedSVD

A few interesting dimensions

Visualizing these dimensions

Hm... let's try some other dimensions!

Can we add some colour?

Clustering

We'll use KMeans

Visualization in Technicolour

Can we combine all these dimensions into one plot?

Manifold learning

We'll use TSNE

Taking a step back

A hidden agenda

A totally different example

Where to next?

One more thing...

Data Science and (Unsupervised) Machine Learning with `scikit-learn`

`scikit-learn` and friends

We'll use `TruncatedSVD`

We'll use `KMeans`

We'll use `TSNE`