June 13, 2015 – Montreal, Talks

MTL Data: Montreal 311 Service Requests, an Analysis

Earlier this year, I collaborated with a reporter from the Montreal Gazette to analyze a dataset containing information about 1.4 million service requests received by the City of Montreal from its citizens. The resulting article was entitled "Montreal's 311 records shed light on residents' concerns — to a point" and credits me at the bottom. I have also published my own interactive analysis of the dataset here: Montreal 311 Service Requests, an Analysis. The dataset, obtained from the city's Gestion des demandes clients (GDC) system via an Access to Information request, covered the five years from 2008 to 2012 and contained the date and a very short description for each request, and in most cases, an address. The service requests were received by the city through its 311 phone line or at service counters throughout the city.

My major contribution to this article was to have come up with a categorization for these requests, and to have found some interesting patterns and anomalies in the categorized data for Roberto at the Gazette to dig into. In the rest of this blog post, I detail how I first approached the dataset and categorized the 1.4 million calls using the Jupyter Notebook system.

Diving into the data¶

First we load up the data into a Pandas DataFrame and get a very high-level overview.

In [1]:

import pandas as pd

requests = pd.DataFrame.from_csv("311.csv.gz")
requests.describe()

Out[1]:

	creation_date	Borough	Nature
count	1249761	1249761	1249761
unique	1826	19	4397
top	2012-05-29	Saint-Laurent	Chien - Permis
freq	1952	142198	61138

Time and Space¶

We seem to have 1826 days or 5 years of data across all 19 of Montreal's boroughs, so let's make a quick chart of request counts over time by borough. Daily is too fine a grain so we'll go monthly.

In [2]:

from datetime import date

requests.creation_date = pd.to_datetime(requests.creation_date)
requests["year_month"] = requests.creation_date.apply(
    lambda x: date(year=x.year, month=x.month, day=1)
)
requests["Year"] = requests.creation_date.apply(lambda x: x.year)
requests["Month"] = requests.creation_date.apply(lambda x: x.strftime("%b"))

%matplotlib inline
from matplotlib import rcParams, style
style.use('ggplot')
rcParams['figure.figsize'] = 12, 8

requests.pivot_table(
    index='year_month', columns='Borough', 
    values='Nature', aggfunc=len
).plot().legend(loc='center right', bbox_to_anchor=(1.45, 0.5))

Out[2]:

<matplotlib.legend.Legend at 0x117545750>

This is not a particularly readable chart, but it certainly looks like we have data from every borough for every month. There appear to be some major inter-borough differences (Saint-Laurent is the only the 8th most populous borough but has by far the most requests) as well as temporal patterns (a yearly pattern of low-winter/high-summer). The idea behind this analysis is to get a handle on these differences.

The Nature of Things¶

Let's take a look at that Nature column; it seems to have too many different values to easily make sense of.

In [3]:

nature_counts = requests.Nature.value_counts().reset_index(name="counts")
nature_counts

Out[3]:

	index	counts
0	Chien - Permis	61138
1	Déchets - Objet volumineux à ramasser	54913
2	Éclairage existant - Entretien	52255
3	Environnement - Bac ou sac de récupération - D...	44306
4	Déchets - Collecte	44141
5	Déchets - Dépôt illégal	43487
6	Arbre - Élagage	31731
7	Feux de circulation - Entretien	30841
8	Environnement - Collecte sélective	29808
9	Intervention stationnement	29550
10	Neige - Divers	27685
11	Eau - Fermeture d'entrée d'eau	26518
12	Égout - Puisard	24656
13	Environnement - Collecte de branches d'arbre	23812
14	Voirie - Divers	21869
15	Eau - Divers	20665
16	Signalisation écrite - Nouveau et entretien	19844
17	Voirie - Nid-de-poule	19833
18	20 - Surveillance particulière	18546
19	Changement d'adresse - Taxes eau et déchets	13822
20	Changement d'adresse - Taxes foncières	13466
21	Voirie - Nettoyage du domaine public	13029
22	Voirie - Débris sur la voie publique	12556
23	Eau - Ouverture d'entrée d'eau	11889
24	Voirie - Pavage - Réparation	11556
25	Signalisation - Circulation - Études	11289
26	Eau - Fuite d'eau	10424
27	Environnement - Terrain insalubre	10259
28	Voirie - Animal mort	9845
29	Arbre - Danger potentiel	9200
...	...	...
4367	Autre - Résidus verts	1
4368	Panneau de signalisation entretien prioritaire...	1
4369	Signalisation - Circulation - Études Urgence	1
4370	Occupation du domaine public herbes	1
4371	Logement insalubre sélective	1
4372	Surveillance et sécurité travaux - Construction	1
4373	Voirie - Balai mécanique Réparation	1
4374	Autre - Éclairage existant - Entretien Entretien	1
4375	Égout - Puisard services municipaux	1
4376	Environnement - oiseaux	1
4377	Environnement - Collecte de branches d'arbre Rats	1
4378	Chien - Fourrière et patrouille branches d'arbre	1
4379	Environnement - Terrain insalubre Réparation	1
4380	Environnement - Bac ou sac de récupération - D...	1
4381	Environnement - Bac montréalais (67 litres) ré...	1
4382	Surveillance et sécurité publique foncières	1
4383	Eau - Fermeture d'entrée d'eau GDC	1
4384	Arbre - Élagage Urgence	1
4385	Éclairage existant - Entretien Existant	1
4386	*Services administratifs Sud- Ouest ancien	1
4387	*Retour d'affiche ramasser	1
4388	Signalisation écrite - Nouveau et entretien - ...	1
4389	Parc - Menuiserie municipaux	1
4390	Subvention - Habitation Non résident	1
4391	Permis - Logement récupération - Distribution	1
4392	Déchets - Objet volumineux non ramassé Distrib...	1
4393	Déchets - Collecte Travaux publics - Divers	1
4394	Environnement - Herbes hautes - Terrain privé ...	1
4395	*Voirie - Borne Fontaine - Entretien et répara...	1
4396	Voirie - Gazon - Réparation patrouille	1

4397 rows × 2 columns

And now graphically:

In [4]:

p = nature_counts.cumsum().plot(legend=False)
p.set_xlabel("Term Rank")
p.set_ylabel("Cumulative Number of Requests")

Out[4]:

<matplotlib.text.Text at 0x108f5c5d0>

Cardinality Reduction¶

In order to get a better sense of the temporal and spatial patterns of this data by digging into the kinds of requests we have in this dataset, we'll need to group the 4000+ values of the Nature column into a smaller number of categories. A manageable number would be between 10 and 20 categories, including a catch-all "Other" which should not be the biggest, and where there is less than a 10-to-1 difference in size between the biggest and the smallest categories.

The general approach we'll take is to group the requests based on keywords present in the Nature field. We'll start with some basic prep on the data by normalizing it all to ASCII. This isn't strictly necessary but makes it easier to write string-matching code without having to type accents on a standard US keyboard. It will also help with string-matching in case a french word was mistyped without accents (admittedly, not a particularly common occurrence).

In [5]:

import unicodedata

def to_ascii(input_str):
    if isinstance(input_str, str):
        input_str = input_str.decode('utf-8')
    elif not isinstance(input_str, unicode):
        input_str = str(input_str).decode('utf-8')
    nkfd_form = unicodedata.normalize('NFKD', input_str)
    only_ascii = nkfd_form.encode('ASCII', 'ignore')
    return only_ascii

requests.Nature = requests.Nature.apply(to_ascii)

A fairly subjective iterative categorization process yields the following categorization function. The keys of the keywords dictionary are the category names and the values are the set of words which will cause that category to be applied to a given request if present in the Nature field. The first category to match is the one that is applied to the request.

The process used was to start with no filters (i.e. all requests are categorized as "Other") and then iterate through:

Generating a list of the most frequent values of Nature per category
Adding or modifying the list of filters based on the list above to move requests out of the "Other" category
Stopping if the categories meet the goals, otherwise iterating again

In [6]:

from collections import OrderedDict
import re

def categorize(x):
    keywords = OrderedDict()
    keywords["Dogs"] = ["chien"]
    keywords["Taxes"] = ["changement d'adresse", "taxes", "taxe"]
    keywords["Permits"] = ["permis", "occupation","zonage"]
    keywords["Selective Collection"] = ["bac","bacs","sac","sacs","recyclage","residus",
                                        "collecte selective","collecte de branches",
                                        "collecte de feuilles"]
    keywords["Bulky Collection"] = ["volumineux"]
    keywords["Garbage Collection"] = ["collecte","dechets"]
    keywords["Trees"] = ["arbre"]
    keywords["Water"] = ["eau"]
    keywords["Sewers"] = ["egout","puisard"]
    keywords["Signage"] = ["signalisation","marquage","feux"]
    keywords["Snow"] = ["neige","deneigement"]
    keywords["Roads"] = ["voirie","voie","travaux","rue","chaussee","charretiere","trottoir"]
    keywords["Environment"] = ["environnement","maisons fleuries","horticulture","pesticide",
                               "compost","animal", "punaises", "jardins","extermination",
                               "animaux","sauvage","gazon","bruit","nuisances"]
    keywords["Parking"] = ["stationnement","ads","srrr","parcometre"]
    keywords["Lighting"] = ["eclairage","lampadaire"]
    keywords["Parks"] = ["parc", "parcs", "chalet", "chalets","sport","sports","piscine"]
    keywords["Security"] = ["surveillance", "police", "securite"]

    keyword_res = OrderedDict() 
    for k in keywords:
        pat = "\\b%s\\b" % "\\b|\\b".join(keywords[k])
        keyword_res[k] = re.compile(pat, re.IGNORECASE)

    for r in keyword_res:
        if keyword_res[r].search(x): return r
    return "Other"

requests["Category"] = requests.Nature.map( 
    {x: categorize(x) for x in requests.Nature.unique()}.get 
)
requests.groupby("Category").size().plot(kind="barh", title="Requests by Category")

Out[6]:

<matplotlib.axes._subplots.AxesSubplot at 0x10e790450>

A cursory glance at the list reveals that it is close to being a mutually-exclusive, collectively-exhaustive (MECE) description of the service areas the city is involved in. This intuitively makes sense: any citizen request about anything else wouldn't be recorded in this dataset and it would be very surprising if the city offered a service that few people inquired or complained about.

Of special note are the three "Collection" fields: waste collection requests were split up into three separate categories to avoid having one huge "Waste Collection" category. "Selective" collection covers recycling, branches etc and "Bulky" collection covers large waste items such as mattresses, furniture etc and "Garbage" is everything else.

So what kinds of requests ended up in each category? Here's a list of the top 5 values of Nature per category.

In [7]:

x = requests.groupby(["Category", "Nature"]).size().reset_index(name="counts")\
    .sort(columns=["Category", "counts"], ascending=False)
x["grouprank"] = x.groupby(["Category"])['counts'].rank(ascending=False)
x.groupby(["Category"]).head(5).pivot_table(
    index="Category", columns="grouprank", values="Nature", aggfunc=lambda x:x)

Out[7]:

grouprank	1.0	2.0	3.0	4.0	5.0
Category
Bulky Collection	Dechets - Objet volumineux a ramasser	Dechets - Objet volumineux non ramasse	Dechets - Objet volumineux a	Dechets - Objet volumineux non	Dechets - Objet volumineux non ramass
Dogs	Chien - Permis	Chien - Fourriere et patrouille	*Chien sans permis	Chien - Fourriere et	Chien qui aboie
Environment	Environnement - Terrain insalubre	Environnement - Graffitis - Propriete privee	Environnement - Divers	Maisons fleuries	Environnement - Animal : pollution et nourriture
Garbage Collection	Dechets - Collecte	Dechets - Depot illegal	Environnement - Collecte d'ecran (ordinateur e...	*Dechets robotises - Non ramasses	*Autre - Dechets robotises
Lighting	Eclairage existant - Entretien	Lampadaire - Reparation	Eclairage - Etudes	Autre - Eclairage existant - Entretien	Eclairage insuffisant
Other	Subvention - Habitation	Reclamation	UIR - Utilite publique	Retour DDS	Services Techniques - Varia
Parking	Intervention stationnement	Intervention constat d'infraction SRRR	SRRR	Intervention de stationnement - Enquete	Agent de stationnement - Plainte contre un ADS
Parks	Parc - Varia	Parc - Proprete	Parc - Mobilier	15 - Expulsion chalets, parcs, stationnements	Sports et loisirs - Installations
Permits	Permis - Divers	Permis - Logement	Permis - Construction	Permis - Bruit	Occupation du domaine public
Roads	Voirie - Divers	Voirie - Nid-de-poule	Voirie - Nettoyage du domaine public	Voirie - Debris sur la voie publique	Voirie - Pavage - Reparation
Security	20 - Surveillance particuliere	UIR - Police	Surveillance et securite publique	Perimetre de securite - Urgence	Securite incendie - Themes administratifs
Selective Collection	Environnement - Bac ou sac de recuperation - D...	Environnement - Collecte selective	Environnement - Collecte de branches d'arbre	Environnement - Bac roulant	Environnement - Residus verts
Sewers	Egout - Puisard	Egout - Refoulement	Egout - Divers	Puisard - Nettoyage - Rue	Puisard - A nettoyer
Signage	Feux de circulation - Entretien	Signalisation ecrite - Nouveau et entretien	Signalisation - Circulation - Etudes	Panneau de signalisation entretien prioritaire	Marquage de la chaussee - Existant
Snow	Neige - Divers	Neige - Chaussee glissante	Neige - Trottoir glissant	Neige - Avis d'infraction	Deneigement - Plainte
Taxes	Changement d'adresse - Taxes eau et dechets	Changement d'adresse - Taxes foncieres	Taxes foncieres	Taxe d'eau	Changement d'adresse - Taxes
Trees	Arbre - Elagage	Arbre - Danger potentiel	Arbre - Maladie et insectes	Arbre - Branche tombee	Arbre - Varia
Water	Eau - Fermeture d'entree d'eau	Eau - Divers	Eau - Ouverture d'entree d'eau	Eau - Fuite d'eau	Eau - Borne-fontaine - Entretien ou deplacement

We can also look how many different values of Nature ended up in each category, and see that things make sense: a big variety in the "Other" category and a fair number of different values collapsed into the various specific categories, roughly in proportion to how many requests fall in each category.

In [8]:

requests.groupby(["Category", "Nature"]).size().reset_index(name="counts")\
    .groupby("Category").size().plot(kind="barh", title="Natures by Category")

Out[8]:

<matplotlib.axes._subplots.AxesSubplot at 0x10edf9ed0>

Next Steps¶

We now have our categorized data, ready for some more interactive analysis using PivotTable.js, an in-browser interactive pivot table tool. We'll export the data to a CSV to load into the pivot table.

In [9]:

requests.groupby([
"Year", "Month", "Borough", "Category"
]).size().reset_index(name="Requests").to_csv("311_pivot.csv", index=False)

You can follow along with the rest of the analysis here: Montreal 311 Service Requests, an Analysis

MTL Data: Montreal 311 Service Requests, an Analysis

Diving into the data¶

Time and Space¶

The Nature of Things¶

Cardinality Reduction¶

Next Steps¶

Follow Nicolas

Code on Github

More Montreal