Data Science and (Unsupervised) Machine Learning with scikit-learn

By at Datacratic

Presented Dec 1, 2014 at Montreal Python

In [216]:
from IPython.display import YouTubeVideo
YouTubeVideo('2lpS6gUwiJQ')
Out[216]:
  • Proud to be a Montreal startup
  • Specializing in machine learning
  • Founded in 2009
  • 27 employees downtown (Peel Metro), 1 in Ottawa, 3 in NYC

This talk

  • A different way to look at graph analysis and visualization,
  • as an introduction to a few cool algorithms: Truncated SVD, K-Means and t-SNE
  • with a practical walkthrough using scikit-learn and friends numpy and bokeh,
  • and finishing off with some more general commentary on this approach to data analysis.

Links:

A map of Reddit

  • Reddit is "the front page of the internet"
  • Basically a discussion board, with sub-boards called subreddits
  • Figure from this paper: Navigating the massive world of reddit: Using backbone networks to map user interests in social media
  • Seems like a totally natural approach but I want to show another way of doing this

scikit-learn and friends

  • scikit-learn is becoming the de-facto machine learning library for Python
  • Works in conjunction with scipy and numpy
  • Part of the PyData toolbox, along with pandas and bokeh

A first look at the data

File comes from here: http://figshare.com/articles/reddit_user_posting_behavior/874101

In [177]:
%%bash
head reddit_user_posting_behavior.csv
603,politics,trees,pics
604,Metal,AskReddit,tattoos,redditguild,WTF,cocktails,pics,funny,gaming,Fitness,mcservers,TeraOnline,GetMotivated,itookapicture,Paleo,trackers,Minecraft,gainit
605,politics,IAmA,AdviceAnimals,movies,smallbusiness,Republican,todayilearned,AskReddit,WTF,IWantOut,pics,funny,DIY,Frugal,relationships,atheism,Jeep,Music,grandrapids,reddit.com,videos,yoga,GetMotivated,bestof,ShitRedditSays,gifs,technology,aww
606,CrohnsDisease,birthcontrol,IAmA,AdviceAnimals,AskReddit,Endo,WTF,TwoXChromosomes,pics,funny,Jeep,Mustang,4x4,CCW,dogpictures,Cartalk,aww
607,space,Fitment,cars,Economics,Libertarian,240sx,UserCars,AskReddit,WTF,Autos,formula1,pics,funny,bodybuilding,gaming,Drifting,Justrolledintotheshop,atheism,gadgets,videos,business,gamernews,Cartalk,worldnews,carporn,technology,motorsports,Nissan,startrek
608,politics,Flagstaff,Rainmeter,fffffffuuuuuuuuuuuu,pcgaming,screenshots,truegaming,AdviceAnimals,Guildwars2,gonewild,gamingsuggestions,Games,AskReddit,dubstep,skyrim,SuggestALaptop,battlefield3,WTF,starcraft,creepy,pics,funny,darksouls,books,gaming,mw3,hentai,halo,atheism,magicTCG,swtor,SOPA,anime,IndieGaming,Jokes,wow,gifs,Design,NAU,Android,technology,Minecraft,aww,GameDeals,playitforward,pokemon
609,Clarinet,AdviceAnimals,festivals,SubredditDrama,InternetAMA,AskReddit,aves,cringe,MemesIRL,Music,AmISexy,electricdaisycarnival,ForeverAlone
610,RedHotChiliPeppers,fffffffuuuuuuuuuuuu,tifu,civ,gameofthrones,IAmA,AdviceAnimals,movies,explainlikeimfive,SubredditDrama,gonewild,todayilearned,trees,AskReddit,soccer,skyrim,WTF,germany,pics,funny,seduction,circlebroke,sto,gaming,4chan,atheism,circlejerk,Music,apple,cats,videos,John_Frusciante,minimalism,trackers,worldnews,gifs,beermoney,Android,technology,startrek,Frisson
611,beertrade,AskReddit,WTF,beer,batman,BBQ,beerporn,Homebrewing
612,politics,2012Elections,Parenting,IAmA,fresno,picrequests,AskReddit,loseit,WTF,Marriage,Mommit,pics,funny,VirginiaTech,loseit_classic,RedditLaqueristas,atheism,LadyBoners,GradSchool

Formatting it a little bit...

In [178]:
import pandas as pd 

pd.read_csv("reddit_user_posting_behavior.csv", nrows=10, names=["user"]+range(25)).fillna("")
Out[178]:
user 0 1 2 3 4 5 6 7 8 ... 15 16 17 18 19 20 21 22 23 24
0 603 politics trees pics ...
1 604 Metal AskReddit tattoos redditguild WTF cocktails pics funny gaming ... trackers Minecraft gainit
2 605 politics IAmA AdviceAnimals movies smallbusiness Republican todayilearned AskReddit WTF ... atheism Jeep Music grandrapids reddit.com videos yoga GetMotivated bestof ShitRedditSays
3 606 CrohnsDisease birthcontrol IAmA AdviceAnimals AskReddit Endo WTF TwoXChromosomes pics ... Cartalk aww
4 607 space Fitment cars Economics Libertarian 240sx UserCars AskReddit WTF ... Drifting Justrolledintotheshop atheism gadgets videos business gamernews Cartalk worldnews carporn
5 608 politics Flagstaff Rainmeter fffffffuuuuuuuuuuuu pcgaming screenshots truegaming AdviceAnimals Guildwars2 ... SuggestALaptop battlefield3 WTF starcraft creepy pics funny darksouls books gaming
6 609 Clarinet AdviceAnimals festivals SubredditDrama InternetAMA AskReddit aves cringe MemesIRL ...
7 610 RedHotChiliPeppers fffffffuuuuuuuuuuuu tifu civ gameofthrones IAmA AdviceAnimals movies explainlikeimfive ... skyrim WTF germany pics funny seduction circlebroke sto gaming 4chan
8 611 beertrade AskReddit WTF beer batman BBQ beerporn Homebrewing ...
9 612 politics 2012Elections Parenting IAmA fresno picrequests AskReddit loseit WTF ... RedditLaqueristas atheism LadyBoners GradSchool

10 rows × 26 columns

Loading the data into a sparse matrix

  • We want a matrix: rows are subreddits, columns are users
    • cells will be 1 if user posted to subreddit, otherwise 0
  • This step was the most technically challenging
  • Naïve first few tries: 20 minute run-time
  • The code below runs around 10 seconds
In [179]:
%%time
user_ids = []
subreddit_ids = []
subreddit_to_id = {}
i=0
with open("reddit_user_posting_behavior.csv", 'r') as f:
    for line in f:
        for sr in line.rstrip().split(",")[1:]: 
            if sr not in subreddit_to_id: 
                subreddit_to_id[sr] = len(subreddit_to_id)
            user_ids.append(i)
            subreddit_ids.append(subreddit_to_id[sr])
        i+=1
  
import numpy as np
from scipy.sparse import csr_matrix 

rows = np.array(subreddit_ids)
cols = np.array(user_ids)
data = np.ones((len(user_ids),))
num_rows = len(subreddit_to_id)
num_cols = i

# the code above exists to feed this call
adj = csr_matrix( (data,(rows,cols)), shape=(num_rows, num_cols) )
print adj.shape
print ""

# now we have our matrix, so let's gather up a bit of info about it
users_per_subreddit = adj.sum(axis=1).A1
subreddits = range(len(subreddit_to_id))
for sr in subreddit_to_id:
    subreddits[subreddit_to_id[sr]] = sr
subreddits = np.array(subreddits)
(15122, 876961)

CPU times: user 9.77 s, sys: 288 ms, total: 10.1 s
Wall time: 10.1 s

Unwieldy data

Our adjacency matrix is a bit problematic to deal with as-is:

  • It's a very wide: 850,000 columns
  • It's very sparse: only about 0.06% full
  • It's a binary matrix: only 0's and 1's

Dimensionality reduction

  • Family of algorithms for solving this problem
  • AKA decomposition, compression, feature extraction
    • it's to big matrices what JPEG is to photos, MP3 to music, MPEG for video etc
  • The output will be as wide as we ask for
    • The wider it is the less lossy the compression will be
  • The output matrix will be dense
  • The output matrix will have continuous real values
  • scikit-learn has a decomposition package

We'll use TruncatedSVD

  • TruncatedSVD from scikit-learn
  • SVD stands for Singular Vector Decomposition, Truncated because we only want part of the computation
  • Good mathy description of how it works in Chapter 11 of online book Mining of Massive Datasets
  • Right now let's focus on how to use it and exploring the output
In [180]:
%%time
from sklearn.decomposition import TruncatedSVD 
from sklearn.preprocessing import normalize 

svd = TruncatedSVD(n_components=100)
embedded_coords = normalize(svd.fit_transform(adj), norm='l1')
print embedded_coords.shape
(15122, 100)
CPU times: user 1min 8s, sys: 4.94 s, total: 1min 13s
Wall time: 1min 14s

The output is kind of neat:

  • Each row is like a set of coordinates in a 100-dimensional space for a subreddit
  • Each column defines one axis of this 100-dimensional space, ordered by how much information they capture
  • We can look at how much of the original matrix we captured with the first N dimensions
  • The first 2 capture around 25%
  • The 100 we will use capture around 60%
In [181]:
%matplotlib inline
pd.DataFrame(np.cumsum(svd.explained_variance_ratio_)).plot(figsize=(13, 8))
Out[181]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4fdd633a50>
In [190]:
# this function will show you the axes on which a particular subreddit scores the highest/lowest
def pickOutSubreddit(sr):
    sorted_axes = embedded_coords[list(subreddits).index(sr)].argsort()[::-1]
    return pd.DataFrame(subreddits[np.argsort(embedded_coords[:,sorted_axes], axis=0)[::-1]], columns=sorted_axes)

pickOutSubreddit("soccer")
Out[190]:
44 46 45 40 0 42 26 27 47 21 ... 62 52 43 54 69 61 50 49 22 25
0 bevandele PrettyOlderWomen ParisSG malefashionadvice AskReddit DoesAnybodyElse CraftMasters blackmailporn ShitLiamDoes ManchesterVegan ... knockoutgifs JapaneseFiction gameofthrones applehelp PicsOfJithSleeping outercourse osx ShitLiamDoes gonewildstories CraftMasters
1 ParisSG soccerbot bevandele malefashion funny UrethraPorn Nbz underboob silverandguns asutosh ... outercourse davidfosterwallace shewantstofuck macsetups EdmontonOilers mindashq simpleios techsupport BDSMpersonals Nbz
2 soccerbot bevandele FootballMedia goodyearwelt pics HotDogPorn pokemon Sexfight techsupport goatedition ... mindashq whatsthatbook asoiaf apple hockeylayouttest NBA2k ios bubbleswithfaces BDSMcommunity pokemon
3 FootballMedia FTLStrikers soccerbot rawdenim WTF SpidersGoneWild pokemonteams gangbang uscgames GOAT_TRUTH ... Articles memphismayfire aSongOfMemesAndRage ios hockey StateFarm retina talesofmybowels bdsm vanguardtips
4 FTLStrikers ParisSG FTLStrikers frugalmalefashion gaming SomeRandomReddit vanguardtips The_Porn_Family Goblinism worldnews ... postcolonialism Earlyjazz asoiafcirclejerk retina hockeyplayers nba iOSProgramming 24hoursupport polyamory AsianFeet
5 PrettyOlderWomen FootballMedia PrettyOlderWomen yusufcirclejerk AdviceAnimals FuckingFish AsianFeet ChangingRooms Transmogrification sticknpokes ... PeterL alt_lit AGOTBoardGame jasmineapp leafs Basketball appletv ouch BDSMGW thongsandals
6 soccer reddevils coys AustralianMFA IAmA SnooPorn thongsandals CumFacials wow CatsInBusinessAttire ... Basketball walterjohnson tesothemereddit ipad Habs BasketballTips mac Buttcoin JL2579 ebonyfeet
7 reddevils LiverpoolFC footballtactics mfacirclejerk videos FlyingFuck ebonyfeet wetandwild wowscrolls douchebagfoundation ... AskMen booklists TGOD AlienBlue rangers heat apple IWantToBeAMod chickflixxx TruePokemon
8 Gunners soccer soccer shittymspaints todayilearned KittyPorn EvolutionofKits TVnewsbabes WowUI nnDMT ... BasketballTips artshub kdubz1298 mac canucks lakers macapps nexusq gonewildaudio pokemonteams
9 FantasyPL MCFC Gunners europeanmalefashion atheism gnomewild ShinyPokemon LolitaCheng WoWGoldMaking AustralianFilm ... StateFarm Adamphotoshopped asoiafreread iphone DetroitRedWings benchgifs applehelp prestashop pregnant pokemontrades
10 coys Gunners soccercirclejerk malehairadvice Omaha HobosGoneWild nuzlocke booty_gifs wowstrat AscensionOmaha ... Mavericks metacsec Dreadfort commonsense AnaheimDucks mildlyimpressive macsetups sergaljerk sexpertslounge pokemonrng
11 Barca chelseafc reddevils ldshistory aww PicsOfHorseVaginas Pokemongiveaway SpyShots woweconomy JKP ... benchgifs breakingbad CK2GameOfthrones appletv losangeleskings LAClippers iPhoneDev purple DeadBedrooms EvolutionofKits
12 chelseafc Fifa13 Barca malefoodadvice UIUC HalloweenPorn pokemonrng plugged wowguilds indiansports ... NBA2k mildlyinteresting CTI GeekTool OttawaSenators NBASpurs ipad labradoodles sex pokemonarts
13 realmadrid realmadrid football itafterdark houston WeirdSubreddits pokemontrades exgirlfriendpictures WoWStreams iknewgoatsweretrouble ... lakers laundryview WesterosiProblems JamieChung penguins DavidsQuotes WebApps TrackThrows TwoXSex Pokemoncollege
14 LiverpoolFC FantasyPL NUFC TeenMFA Columbus WhyWouldYouFuckThat stunfisk RedHeelsGW wowpodcasts traphentai ... nba vinyldjs Asoiafspoilersall jailbreak BlueJackets torontoraptors iphone bttf CowLand nuzlocke
15 MCFC coys LiverpoolFC preppy Dallas SantaPorn Pokemoncollege CumSwallowing wowraf Turkey ... heat SexyMusicVideos livechat osx Flyers SchooledUp iphonehelp RegretfulSexStories diamondmine ShinyPokemon
16 footballmanagergames FIFA12 footballmanagergames adventureporn politics burningporn normalboots souse 24hoursupport KateeOwen ... torontoraptors bookhaul GavinQuotes booklists devils memphisgrizzlies jailbreak drawings FanPuns Pokemongiveaway
17 bootroom EA_FIFA Bundesliga breitling Austin CucumberPorn PokemonROMhacks xxxstash bubbleswithfaces shewantstofuck ... LAClippers murakami Tumba AppHookup EA_NHL warriors AlienBlue computertechs MinecraftChampions stunfisk
18 Fifa13 FIFA chelseafc uniqlo kansascity PigsGoneWild PokemonLeagueDS drawngonewild worldofwarcraft Israel ... AskWomen circlejerkbreakingbad HypnoHookup iphonehelp Coyotes kanyewest AppHookup Sandwiches WolfPAChq GenerationOne
19 fcbayern Barca MilitaryGear gayforsaulyd Tucson Clotheshangers Pokemonexchange Cumonboobs WoWNostalgia haligonients ... SneakerDeals skinnytail OliviaMunn simpleios BostonBruins Mavericks obits Kirbs2002 Swingers pokemonconspiracies
20 football bootroom MCFC malelivingspace Purdue WTF_Wallpapers denpamen manbetterporn FTH Filipinology ... dating_advice booksuggestions maturewoman iOSthemes SanJoseSharks bostonceltics commonsense tesothemereddit DissidiaCraft pokemonchallenges
21 FIFA footballmanagergames realmadrid Watches orlando thismemeshoulddie pkmntcgtrades DragonageNSFW redditguild typescript ... DavidsQuotes bookclub Spartacus_TV SEGA32X stlouisblues nbacirclejerk SEGA32X ABotOfIceAndFire boule Pokemonexchange
22 FIFA12 fcbayern KitSwap swagteamsix Boise windowshots GenerationOne ToeSucking Rift HealthyWeightLoss ... Oaxaca books freakyfetishstories apps winnipegjets kings macgaming asoiaf minecraftium pokemonrp
23 EA_FIFA Aleague LigaMX paulrudd Atlanta traversecity TruePokemon assgifs ALS arabic ... mildlyimpressive BooksAMA Pokenawa macgaming hawks chicagobulls jasmineapp gameofthrones TriPixel PokemonROMhacks
24 borussiadortmund ACMilan donenad mensfashionadvice Charlotte ThisDayInHistory pokemonconspiracies joi MMORPG Syria ... NBASpurs publishing Banshee iosgaming NewYorkIslanders OrlandoMagic ScreenplayCoverage SurplusEngineering mathias fireemblem
25 SoccerBetting TrueGunners fcbayern soccergaming fsu failedpilots pkmntcgcollections cumov MovieWallpapers mexico ... AlienExchange DogsWithCatHeads CelebsInTights Panera ColoradoAvalanche suns iOSthemes Dreadfort GWAsians pkmntcg
26 ACMilan footballtactics seriea FantasyPL Louisville wikipedia PokePlayThru2013 asstastic wowtcg pernicus ... kanyewest bookshelf Conservativebooks classics hockeygoalies GoNets flextweak kuro5hit lmm pkmntcgtrades
27 Aleague borussiadortmund LeedsUnited supremeclothing SaltLakeCity bronycringe pkmntcg SymphonieVonBondage punkshots Palestine ... OrlandoMagic illusionporn justified askjailbreak sabres NYKnicks zsh PlayPassOrPause mccountercraft PokemonLeagueDS
28 NUFC soccergaming ACMilan PrettyOlderWomen VirginiaTech carlhprogramming pokemonbattles MNGoneWild turnoverpie 1428 ... SchooledUp antiatheism lickingdick iWallpaper caps RNBA2KFantasy Panera kdubz1298 Hypermine SoloPokes
29 footballtactics NUFC bootroom malegrooming Hawaii pic pokemonrp gonewild diablo3 BDS ... bostonceltics infographic illustratingreddit CSUDH fantasyhockey rockets CSUDH Kazakhstan massage hotmidgets
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
15092 shittygunpictures TreesSuckingOnThings bathroomgraffitiporn Iron Bad_ass_girlfriends LadiesofScience CuriosityCube craftit xenominer360 fragworks ... asianproblems browncoats blackgirlgamers Adamphotoshopped dpmansen baby Holmes DoctorWhoFreedom StLouisRams nfl
15093 Shotguns lisp_ja DAE ReVenture services trashynovels computer Pinterest wtfgames Trathira ... BanjoMonkey euphonium LadiesofScience CAmmunity Jaguars steak ournameisfun paganmusic 49ers mccountercraft
15094 scguns comedywriting polycomics Neopsychedelia Pornoeverywherexxx NaturalBeauty ykwih CherylCole starcraft2_class Chydrego ... WeirdFiction classicwho ForeverAloneWomen cachatfanfic miamidolphins FoodVideos horrorlit ios CatTeamBrotherhood Browns
15095 progun freesoftware PlayPassOrPause Fitness Hardcore_PornGifs IdeaExchange pcgamingtechsupport electricents DNE DLGOKN ... Scoobydoo thelook fashion HouseMD pickupfootball NFLhuddle gallifrey apple nfl lmm
15096 300BLK tinycode DYR homegym FitnessGirls duckface adelaidenews TwoXSex quicklooks Shaskel ... RadicalFeminism maisiewilliams adventureporn GunslingerMusic Hawkeye_Football PacificNorthwest AmyAdams macsetups BIRDTEAMS miamidolphins
15097 opencarry steamforlinux wtfgames AlternativeHealth ShemaleSwag FashionPlus XXX_Animated_Gifs TheGirlSurvivalGuide Hobbies androidtablets ... sidewayspony gallifrey vegproblems cell Lincolnshire caps DoctorWhumour macapps fcs minnesotavikings
15098 gunpolitics systems DNE fargo Entrenched mydrunkkitchen infinityblade ABraThatFits starcraft_strategy PleX ... malepolish hilaribad 2XLite irvine eroticcomics AskMen lexlinguae crazysteve NYGiants TriPixel
15099 Glocks allthingsterranbot The_Cheated ecig_vendors spacedikdiks NonverbalComm CableManagement ProjectRunway Destiny Philanthropy ... FemmeThoughtsFeminism teefury shittymspaints avatarvideos Texans dating_advice write gamegenome Chargers mathias
15100 CZFirearms coding WriteWithMe Hutchinson frontiama circlejerk HarvardTonight StreetArtPorn skypepartycirclejerk numerical ... hugenaturals ASMRmusic EdmontonGoneWild nwo syrianrefugees ravens logophilia drwho Mariners boule
15101 tc_archery cloudcomputing PicsOfHorseVaginas askedreddit front InfertilityBabies worms GetEmployed asd yoyohoneysingh ... olderlesbians DoesAnybodyElse Cheetahs juanbuiza UniversityOfHouston nflblogs litvideos Sherlock EvilLeagueOfEvil minecraftium
15102 knives puremathematics FlyingFuck kettlebell frontscience closetswap eddfaction entwives castit Valesrubberduckies ... SeattleHistory MLPArt femmit FlashForward Madden_NFL bengals doctorwho toolSquad gifrequests sportsbook
15103 gundeals PostgreSQL HotDogPorn StudentNurse 2012askanything technews buildapcsales 2XLookbook StarcraftCirclejerk Surface ... Cheetahs MovieWallpapers veggieteens louie ConnectedCareers postcolonialism neilgaiman Johnlock sexmix247 JordanCox2
15104 Mini14 mathisbeautiful FuckingFish kingcounty bipoliticalandcurious GillianJacobs ClearBackblast sheltie HotSBetaKeys bttf ... OperationSayFuckALot Digihentai femalefashionadvice Jericho ravens freddiemercury 80sElectro classicwho cowboys NFL_Draft
15105 guns dalcs4168 SomeRandomReddit volunteeringsolutions johndollarfullofcrap AlienExchange Zendaya Mommit itmejp AbletonProductions ... TwoXChromosomes bedhg frugalbeauty iamgoingtohellforthis boats JordanCox2 booksuggestions doctorwhocirclejerk Tennesseetitans MinecraftChampions
15106 CompetitionShooting ocaml SnooPorn todayiwatched alfrankenstein TPOP Jab LumiaLovers slothswitharms PiCases ... BreastPumps DoctorWhumour Endo huntedseries Seahawks effzeh ROIO whovianents CollegeBasketball DissidiaCraft
15107 CCW vim Clotheshangers RAMD GreenCarLovers preppy HardcoreSex USMilitarySO StarcraftDeutschland Ermahgerd ... Rights4Men Humiliation BodyAcceptance RandomActsOfBidets podcast LinusDiaries orderofthephoenix karengillan SaintLouisRams WolfPAChq
15108 ak47 scribblenauts PigsGoneWild fitmeals AutomobileTechnology TheSexCave AvaSambora VintageLadyBoners starcraft AvaDevine ... grooveshark GeeKnitting malefashionadvice maggielawson ConnectedCareers2 sabres hyperfurs dwlounge Patriots CowLand
15109 saiga pervasivecomputing CucumberPorn xxfitness albanianarchandpics StrandedWhale WhatsInThisPool rant allthingsprotoss GNURadio ... GirlsinPinkUndies t:latestoneage 2XLookbook HHN circuloidiota Redskins Random_Acts_of_Books davidtennant internetdefense 49ers
15110 Gunsforsale emacs UrethraPorn shittyshitredditsays engineersissues Admin MoundofVenus TomeNET wowbro Information_Security ... twilightzonedates IsabelleFuhrman feetish circlejerkbreakingbad sext Flyers artshub IsabelleFuhrman Texans NYGiants
15111 gats scheme burningporn ShittyTheoryOfReddit nonseq Gingers buildapcforme ThriftStoreHauls BarCraft braddoingbradthings ... henna bronycringe rawdenim Dexter Colts NFLFandom selfpublish mattandbenedict nyjets LinusDiaries
15112 prepping xmonad SantaPorn microgreens TotalFark SRSFartsAndCrafts TwinGirls birthcontrol allthingszerg VicPD ... trashynovels cissp findfashion thingsoncats nflblogs detroitlions OldEnglish gallifrey CFB diamondmine
15113 Firearms CasualMath HobosGoneWild naturalbodybuilding OmegleVideos MichelleTrachtenberg aquajewhungerforce cafe starcraftnakama internet2012 ... PunkLovers bootstrap asiantwoX Pedberg LinusDiaries Bigtitssmalltits bookshelf gallifreyan NFL_Draft eagles
15114 reloading ProgrammerHumor gnomewild EatCheapAndHealthy CandidBikiniGirls dataisterrifying internetdefense stepparents Tumba amateurastronomy ... TrackThrows CREST TwoXChromosomes Allison_Scagliotti JordanCox2 40something BooksAMA Torchwood CHIBears oaklandraiders
15115 ar15 gnu WTF_Wallpapers PronePaddling aitaiwan LifeProTips gamingpc bigboobproblems starcraft2clans cavalierkingcharles ... femalewriters AmyAdams TodayIWore pipesgonewild effzeh GreenBayPackers books mattsmith Redskins cowboys
15116 longrange nethack KittyPorn DeathProTips baozoumanhua mcwex buildapc childfree AllThingsTerran BitcoinMagazine ... parentdeals wmnf malefashion ohnoghosts NFLFandom knockoutgifs alt_lit wholock eagles Patriots
15117 Hunting baduk WhyWouldYouFuckThat DepressedRage NSFW_HotSlut_gonewild wmnf sexmix247 2XLite Naytopia trueshreddit ... amateursologirls FailedFedora blackgirls vinyldjs freddiemercury happierNYC 52book MovieWallpapers nflcirclejerk Redskins
15118 BrassSwap libredesign WeirdSubreddits 2Chainz sexy_nonnude_teens allstars Reflections curvygirls Limu technology ... SexyNerds doctorwho SRSTechnology breakingbadcomics NFLhuddle buffalobills bookhaul hyperfurs Reflections CFB
15119 GunPorn gtd HalloweenPorn IndianClubs NSFW_nude_asian_teens UKHealthcare overclocking elections DoesAnybodyElse TakeOneStepForward ... MilitaryFamilies dwlounge UnsentMusic breakingbad Bingo MontrealCanadiens BookCollecting AmyAdams panthers Texans
15120 USMilitia compscipapers SpidersGoneWild Nicaragua NSFW_sexynudeteens CarboholicsAnonymous NSFWSector mormoncringe UKStarcraft slandrogangstas ... veggieteens hugs goodyearwelt memphismayfire gameslist Articles booklists doctorwho HighlightGIFS nyjets
15121 1911 LANL_French DoesAnybodyElse LifeProTips wild_naked_girls eisley gifs mobile allthingsterranbot preggocows ... SRSTechnology hyperfurs freeforallfashion Earlyjazz happierNYC steelers WoTreread DoctorWhumour gifs CHIBears

15122 rows × 100 columns

A few interesting dimensions

  • I went through each column and picked out some interesting dimensions
In [191]:
pd.DataFrame(subreddits[np.argsort(embedded_coords[:,[0, 1, 44,51,84,50,47,40]], axis=0)[::-1]], 
             columns=[
            "0: big - small", 
            "1: big - small",
            "44: soccer - guns",
            "51: programming - food",
            "84: music - bikes", 
            "50: osx - books",
            "47: wow - starcraft", 
            "40: male grooming - life hacks"
])

# not shown but also amusing:
# 14: music - pot
# 24: science - porn
Out[191]:
0: big - small 1: big - small 44: soccer - guns 51: programming - food 84: music - bikes 50: osx - books 47: wow - starcraft 40: male grooming - life hacks
0 AskReddit todayilearned bevandele threads BibleBelievers osx ShitLiamDoes malefashionadvice
1 funny worldnews ParisSG smart christianblogs simpleios silverandguns malefashion
2 pics politics soccerbot hocnet mormonapologetics ios techsupport goodyearwelt
3 WTF videos FootballMedia pervasivecomputing beerpong retina uscgames rawdenim
4 gaming technology FTLStrikers Suomipelit anarchist_aid iOSProgramming Goblinism frugalmalefashion
5 AdviceAnimals blog PrettyOlderWomen lisp_ja Jesus appletv Transmogrification yusufcirclejerk
6 IAmA promos soccer gnu TrueChristian mac wow AustralianMFA
7 videos science reddevils ProgrammerArt biogas apple wowscrolls mfacirclejerk
8 todayilearned til Gunners app Christianity macapps WowUI shittymspaints
9 atheism atheism FantasyPL jenkinsci JustChristians applehelp WoWGoldMaking europeanmalefashion
10 Omaha SOPA coys DanceTutorials ChristianBooks macsetups wowstrat malehairadvice
11 aww IAmA Barca illjustleavethishere ChristianCreationists iPhoneDev woweconomy ldshistory
12 UIUC ColbertRally chelseafc cloudcomputing therealcollective ipad wowguilds malefoodadvice
13 houston USNEWS realmadrid softwaredevelopment Reformed WebApps WoWStreams itafterdark
14 Columbus CMDY LiverpoolFC ComputerTips stickers iphone wowpodcasts TeenMFA
15 Dallas NorthKoreaNews MCFC scheme TheArk iphonehelp wowraf preppy
16 politics OperationGrabAss footballmanagergames ProgrammerHumor RadicalChristianity jailbreak 24hoursupport adventureporn
17 Austin WikiLeaks bootroom ocaml truestchristian AlienBlue bubbleswithfaces breitling
18 kansascity sandy Fifa13 tmbo Sidehugs AppHookup worldofwarcraft uniqlo
19 Tucson AnythingGoesUltimate fcbayern QueerTheory biblestudy obits WoWNostalgia gayforsaulyd
20 Purdue reddit.com football emacs Catacombs commonsense FTH malelivingspace
21 orlando northkorea FIFA csbooks ChristianApologetics SEGA32X redditguild Watches
22 Boise athiesm FIFA12 d_language PrayerRequests macgaming Rift swagteamsix
23 Atlanta Futurology EA_FIFA coding Fiveheads jasmineapp ALS paulrudd
24 Charlotte silentcinema borussiadortmund CatsWithPeopleFeet sketches ScreenplayCoverage MMORPG mensfashionadvice
25 fsu occupywallstreet SoccerBetting feministFAQ OpenChristian iOSthemes MovieWallpapers soccergaming
26 Louisville softscience ACMilan Big_Ood cristianoronaldo flextweak wowtcg FantasyPL
27 SaltLakeCity peoplesliberation Aleague securityCTF ShitZimnySendsMe zsh punkshots supremeclothing
28 VirginiaTech ArtisanVideos NUFC Clojure ahmadiyya Panera turnoverpie PrettyOlderWomen
29 Hawaii movies footballtactics blackflag ChristianMusic CSUDH diablo3 malegrooming
... ... ... ... ... ... ... ... ...
15092 Bad_ass_girlfriends Foamyfrogismoe shittygunpictures castiron Boats_and_Beauties Holmes xenominer360 Iron
15093 services SiriAmA Shotguns ramen MagicCardPulls ournameisfun wtfgames ReVenture
15094 Pornoeverywherexxx HelloProperAlice scguns Halloween_Town Magicdeckbuilding horrorlit starcraft2_class Neopsychedelia
15095 Hardcore_PornGifs allHailKingNoel progun asianeats melodicmetal gallifrey DNE Fitness
15096 FitnessGirls LambofGod 300BLK HiphopWorldwide TechnicalDeathMetal AmyAdams quicklooks homegym
15097 ShemaleSwag VictorianSluts opencarry shittyhomes Alisonhaislip DoctorWhumour Hobbies AlternativeHealth
15098 Entrenched dogdad gunpolitics grilling mtgcube lexlinguae starcraft_strategy fargo
15099 spacedikdiks VintageMilf Glocks KitchenConfidential magicTCG write Destiny ecig_vendors
15100 frontiama Onderka CZFirearms icecreamery Deathmetal logophilia skypepartycirclejerk Hutchinson
15101 front softcorenights tc_archery Chefit mtgaltered litvideos asd askedreddit
15102 frontscience CarShowHunnies knives recipes spikes doctorwho castit kettlebell
15103 2012askanything RimofReality gundeals VictoriaSecret EDH neilgaiman StarcraftCirclejerk StudentNurse
15104 bipoliticalandcurious HappyBaracky Mini14 TakeaPlantLeaveaPlant Metal 80sElectro HotSBetaKeys kingcounty
15105 johndollarfullofcrap shitclintonsays guns sushi metalmusicians booksuggestions itmejp volunteeringsolutions
15106 alfrankenstein babel CompetitionShooting Breadit AliceInChains ROIO slothswitharms todayiwatched
15107 GreenCarLovers notocispa CCW DavisSquare Musex orderofthephoenix StarcraftDeutschland RAMD
15108 AutomobileTechnology ProjectScarlett ak47 GMO ISU_GDC hyperfurs starcraft fitmeals
15109 albanianarchandpics bestplacetolearn saiga 52weeksofbaking MetalLadyBoners Random_Acts_of_Books allthingsprotoss xxfitness
15110 engineersissues mediocregiraffes Gunsforsale judytrinh 7String artshub wowbro shittyshitredditsays
15111 nonseq HIV gats Daleks mtgfinance selfpublish BarCraft ShittyTheoryOfReddit
15112 TotalFark superheroinesdefeated prepping Charcuterie spacedikdiks OldEnglish allthingszerg microgreens
15113 OmegleVideos violentpornography Firearms 52weeksofcooking Entrenched bookshelf starcraftnakama naturalbodybuilding
15114 CandidBikiniGirls necroporn reloading sousvide Cichlid BooksAMA Tumba EatCheapAndHealthy
15115 aitaiwan deandrefall ar15 FoodPorn vomitporn books starcraft2clans PronePaddling
15116 baozoumanhua kasperrosa longrange food TheStuntMuffins alt_lit AllThingsTerran DeathProTips
15117 NSFW_HotSlut_gonewild novafactory Hunting CajunMusic BestSleepOfYourLife 52book Naytopia DepressedRage
15118 sexy_nonnude_teens marathi BrassSwap FoodVideos nationalsciencebowl bookhaul Limu 2Chainz
15119 NSFW_nude_asian_teens PsiUEI GunPorn sharksinclothes johndollarfullofcrap BookCollecting DoesAnybodyElse IndianClubs
15120 NSFW_sexynudeteens Factories USMilitia creepywiki alfrankenstein booklists UKStarcraft Nicaragua
15121 wild_naked_girls AskReddit 1911 appetizers Metal_Alberta WoTreread allthingsterranbot LifeProTips

15122 rows × 8 columns

Visualizing these dimensions

  • Bokeh makes it easy to get hover-tooltips!
  • Each dot in our plot will be a subreddit (mouse over to see which one) scaled by size
  • We'll only look at the top 3,500 or so subreddits, for speed
In [185]:
import bokeh.plotting as bp
from bokeh.objects import HoverTool 
bp.output_notebook()
row_selector = np.where(users_per_subreddit>100)
BokehJS successfully loaded.
In [186]:
bp.figure(plot_width=900, plot_height=700, title="Subreddit Map by Most Informative Dimensions",
       x_axis_label = "Dimension 0",
       y_axis_label = "Dimension 1",
       tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
       min_border=1)
bp.scatter(
    x = embedded_coords[:,0][row_selector],
    y = embedded_coords[:,1][row_selector],
    radius= np.log2(users_per_subreddit[row_selector])/6000, 
    source=bp.ColumnDataSource({"subreddit": subreddits[row_selector]})
).select(dict(type=HoverTool)).tooltips = {"/r/":"@subreddit"}
bp.show()

Hm... let's try some other dimensions!

In [192]:
bp.figure(plot_width=900, plot_height=700, title="Subreddit Map by Interesting Dimensions",
       x_axis_label = "Guns <–> Soccer (Dimension 44)",
       y_axis_label = "Food <–> Programming (Dimension 51)",
       tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
       min_border=1)
bp.scatter(
    x = embedded_coords[:,44][row_selector],
    y = embedded_coords[:,51][row_selector],
    radius= np.log2(users_per_subreddit[row_selector])/6000, 
    source=bp.ColumnDataSource({"subreddit": subreddits[row_selector]})
).select(dict(type=HoverTool)).tooltips = {"/r/":"@subreddit"}
bp.show()

Can we add some colour?

  • Would be nice to colour-code these dots by overall topic
  • 20 colours is probably on the upper end of what we can perceive effectively in a single plot

Clustering

  • Family of algorithms for grouping items into buckets
  • scikit-learn has a clustering package

We'll use KMeans

  • KMeans from scikit-learn
  • Take K points and try to distribute them in the space so that they sit in the middle of clusters, i.e. close to the mean
In [193]:
%%time
from scipy.stats import rankdata
embedded_ranks = np.array([rankdata(c) for c in embedded_coords.T]).T

from sklearn.cluster import KMeans
n_clusters = 20
km = KMeans(n_clusters)
clusters = km.fit_predict(embedded_ranks)
CPU times: user 26.4 s, sys: 4 ms, total: 26.4 s
Wall time: 26.5 s

In [194]:
pd.DataFrame( [subreddits[clusters == i][users_per_subreddit[clusters == i].argsort()[-6:][::-1]] for i in range(n_clusters)] )
Out[194]:
0 1 2 3 4 5
0 Libertarian conspiracy philosophy Economics PoliticalDiscussion business
1 doctorwho thewalkingdead scifi community TheLastAirbender masseffect
2 AskReddit funny pics WTF gaming AdviceAnimals
3 self photography TrueReddit travel space AskHistorians
4 Games starcraft Diablo wow battlefield3 DotA2
5 programming geek linux talesfromtechsupport learnprogramming sysadmin
6 Guitar listentothis WeAreTheMusicMakers Metal vinyl Bass
7 anime magicTCG zelda darksouls 3DS Naruto
8 gonewild nsfw RealGirls gonewildcurvy NSFW_GIF GoneWildPlus
9 TwoXChromosomes loseit relationships AskWomen tattoos cats
10 mylittlepony gamegrumps MLPLounge homestuck furry ClopClop
11 Drugs hiphopheads woahdude seduction electronicmusic drunk
12 guns cars motorcycles Autos Military formula1
13 bestof MensRights lgbt SubredditDrama gaybros gaymers
14 fffffffuuuuuuuuuuuu Minecraft gifs pokemon skyrim buildapc
15 Christianity NoFap TrueAtheism DebateReligion islam Poetry
16 bicycling beer running Homebrewing Coffee snowboarding
17 cringepics cringe mildlyinteresting 4chan JusticePorn reactiongifs
18 reddit.com Android explainlikeimfive malefashionadvice DoesAnybodyElse Frugal
19 nfl soccer nba hockey fantasyfootball baseball

Visualization in Technicolour

  • This time we'll colour-code each dot by cluster
In [199]:
colormap = np.array([
    "#1f77b4", "#aec7e8", "#ff7f0e", "#ffbb78", "#2ca02c", 
    "#98df8a", "#d62728", "#ff9896", "#9467bd", "#c5b0d5", 
    "#8c564b", "#c49c94", "#e377c2", "#f7b6d2", "#7f7f7f", 
    "#c7c7c7", "#bcbd22", "#dbdb8d", "#17becf", "#9edae5"
])

bp.figure(plot_width=900, plot_height=700,  title="Subreddit Map by Interesting Dimensions",
       x_axis_label = "Guns <–> Soccer (Dimension 44)",
       y_axis_label = "Food <–> Programming (Dimension 51)",
       tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
       min_border=1)
bp.scatter(
    x = embedded_coords[:,44][row_selector],
    y = embedded_coords[:,51][row_selector],
    color= colormap[clusters[row_selector]], 
    radius= np.log2(users_per_subreddit[row_selector])/6000, 
    source=bp.ColumnDataSource({"subreddit": subreddits[row_selector]})
).select(dict(type=HoverTool)).tooltips = {"/r/":"@subreddit"}
bp.show()

Can we combine all these dimensions into one plot?

  • Just plotting the first two dimensions wasn't very interesting
  • We face the same problem map-makers face: making a high-dimensional thing 2-dimensional
  • There are lots of ways to make a 2-d map of our 3-d planet!

Manifold learning

  • Family of algorithms to transform many dimensions to 2 or 3
  • AKA low-dimensional embedding
  • scikit-learn has a manifold learning package

We'll use TSNE

  • TSNE from scikit-learn
  • Original paper is awesome
  • Math is fairly involved but basically tries to preserve micro and macro structure, sacrificing some global structure
  • Gives very good, human-readable results on real-world datasets, check the examples in the paper
In [197]:
%%time
from sklearn.manifold import TSNE
xycoords = TSNE().fit_transform(embedded_coords[row_selector])
CPU times: user 5min 32s, sys: 1min 1s, total: 6min 33s
Wall time: 6min 35s

In [198]:
bp.figure(plot_width=900, plot_height=700, title="Subreddit Map by t-SNE",
       tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
       x_axis_type=None, y_axis_type=None, min_border=1)
bp.scatter(
    x = xycoords[:,0],
    y = xycoords[:,1],
    color= colormap[clusters[row_selector]], 
    radius= np.log2(users_per_subreddit[row_selector])/60, 
    source=bp.ColumnDataSource({"subreddit": subreddits[row_selector]})
).select(dict(type=HoverTool)).tooltips = {"/r/":"@subreddit"}
bp.show()

Taking a step back

  • We loaded a dataset into a big sparse matrix
  • We used TruncatedSVD to make it into a surprisingly informative smaller matrix
  • We used KMeans to group our subreddits into some clusters
  • We used TSNE to get coordinates for a scatterplot
  • We used bokeh to make an interactive graphic
  • So what?

A hidden agenda

  1. Use these tools on a dataset that you can a priori reason about
  2. Show that meaningful patterns come out
  3. Convince you that you can try this on data you can't a priori reason about
  4. So that you will feel confident that the algo has found something useful

A totally different example

Where to next?

  • Notebook is available at https://github.com/datacratic/mtlpy50
  • You can use these on your own data (dense or sparse!)
  • Big spreadsheets or datasets where each row is an entity are also good candidates for this type of analysis
  • Watch patterns appear before your very eyes!
  • NB: TruncatedSVD, KMeans and TSNE are some of my go-to algorightms but each belongs to a family with other options

One more thing...

  • scikit-learn is nice, but some of these algorithms can be slow and hard to use
  • Early next year, Datacratic will release an early version of our Machine Learning Database
  • It lets you do this type of analysis by calling a REST API and it's lightning-fast!
  • Stay tuned :)