MTL Data: Montreal 311 Service Requests, an Analysis

Earlier this year, I collaborated with a reporter from the Montreal Gazette to analyze a dataset containing information about 1.4 million service requests received by the City of Montreal from its citizens. The resulting article was entitled "Montreal's 311 records shed light on residents' concerns — to a point" and credits me at the bottom. I have also published my own interactive analysis of the dataset here: Montreal 311 Service Requests, an Analysis. The dataset, obtained from the city's Gestion des demandes clients (GDC) system via an Access to Information request, covered the five years from 2008 to 2012 and contained the date and a very short description for each request, and in most cases, an address. The service requests were received by the city through its 311 phone line or at service counters throughout the city.

My major contribution to this article was to have come up with a categorization for these requests, and to have found some interesting patterns and anomalies in the categorized data for Roberto at the Gazette to dig into. In the rest of this blog post, I detail how I first approached the dataset and categorized the 1.4 million calls using the Jupyter Notebook system.

Diving into the data

First we load up the data into a Pandas DataFrame and get a very high-level overview.

In [1]:
import pandas as pd

requests = pd.DataFrame.from_csv("311.csv.gz")
requests.describe()
Out[1]:
creation_date Borough Nature
count 1249761 1249761 1249761
unique 1826 19 4397
top 2012-05-29 Saint-Laurent Chien - Permis
freq 1952 142198 61138

Time and Space

We seem to have 1826 days or 5 years of data across all 19 of Montreal's boroughs, so let's make a quick chart of request counts over time by borough. Daily is too fine a grain so we'll go monthly.

In [2]:
from datetime import date

requests.creation_date = pd.to_datetime(requests.creation_date)
requests["year_month"] = requests.creation_date.apply(
    lambda x: date(year=x.year, month=x.month, day=1)
)
requests["Year"] = requests.creation_date.apply(lambda x: x.year)
requests["Month"] = requests.creation_date.apply(lambda x: x.strftime("%b"))

%matplotlib inline
from matplotlib import rcParams, style
style.use('ggplot')
rcParams['figure.figsize'] = 12, 8

requests.pivot_table(
    index='year_month', columns='Borough', 
    values='Nature', aggfunc=len
).plot().legend(loc='center right', bbox_to_anchor=(1.45, 0.5))
Out[2]:
<matplotlib.legend.Legend at 0x117545750>

This is not a particularly readable chart, but it certainly looks like we have data from every borough for every month. There appear to be some major inter-borough differences (Saint-Laurent is the only the 8th most populous borough but has by far the most requests) as well as temporal patterns (a yearly pattern of low-winter/high-summer). The idea behind this analysis is to get a handle on these differences.


The Nature of Things

Let's take a look at that Nature column; it seems to have too many different values to easily make sense of.

In [3]:
nature_counts = requests.Nature.value_counts().reset_index(name="counts")
nature_counts
Out[3]:
index counts
0 Chien - Permis 61138
1 Déchets - Objet volumineux à ramasser 54913
2 Éclairage existant - Entretien 52255
3 Environnement - Bac ou sac de récupération - D... 44306
4 Déchets - Collecte 44141
5 Déchets - Dépôt illégal 43487
6 Arbre - Élagage 31731
7 Feux de circulation - Entretien 30841
8 Environnement - Collecte sélective 29808
9 Intervention stationnement 29550
10 Neige - Divers 27685
11 Eau - Fermeture d'entrée d'eau 26518
12 Égout - Puisard 24656
13 Environnement - Collecte de branches d'arbre 23812
14 Voirie - Divers 21869
15 Eau - Divers 20665
16 Signalisation écrite - Nouveau et entretien 19844
17 Voirie - Nid-de-poule 19833
18 20 - Surveillance particulière 18546
19 Changement d'adresse - Taxes eau et déchets 13822
20 Changement d'adresse - Taxes foncières 13466
21 Voirie - Nettoyage du domaine public 13029
22 Voirie - Débris sur la voie publique 12556
23 Eau - Ouverture d'entrée d'eau 11889
24 Voirie - Pavage - Réparation 11556
25 Signalisation - Circulation - Études 11289
26 Eau - Fuite d'eau 10424
27 Environnement - Terrain insalubre 10259
28 Voirie - Animal mort 9845
29 Arbre - Danger potentiel 9200
... ... ...
4367 Autre - Résidus verts 1
4368 Panneau de signalisation entretien prioritaire... 1
4369 Signalisation - Circulation - Études Urgence 1
4370 Occupation du domaine public herbes 1
4371 Logement insalubre sélective 1
4372 Surveillance et sécurité travaux - Construction 1
4373 Voirie - Balai mécanique Réparation 1
4374 Autre - Éclairage existant - Entretien Entretien 1
4375 Égout - Puisard services municipaux 1
4376 Environnement - oiseaux 1
4377 Environnement - Collecte de branches d'arbre Rats 1
4378 Chien - Fourrière et patrouille branches d'arbre 1
4379 Environnement - Terrain insalubre Réparation 1
4380 Environnement - Bac ou sac de récupération - D... 1
4381 Environnement - Bac montréalais (67 litres) ré... 1
4382 Surveillance et sécurité publique foncières 1
4383 Eau - Fermeture d'entrée d'eau GDC 1
4384 Arbre - Élagage Urgence 1
4385 Éclairage existant - Entretien Existant 1
4386 *Services administratifs Sud- Ouest ancien 1
4387 *Retour d'affiche ramasser 1
4388 Signalisation écrite - Nouveau et entretien - ... 1
4389 Parc - Menuiserie municipaux 1
4390 Subvention - Habitation Non résident 1
4391 Permis - Logement récupération - Distribution 1
4392 Déchets - Objet volumineux non ramassé Distrib... 1
4393 Déchets - Collecte Travaux publics - Divers 1
4394 Environnement - Herbes hautes - Terrain privé ... 1
4395 *Voirie - Borne Fontaine - Entretien et répara... 1
4396 Voirie - Gazon - Réparation patrouille 1

4397 rows × 2 columns

And now graphically:

In [4]:
p = nature_counts.cumsum().plot(legend=False)
p.set_xlabel("Term Rank")
p.set_ylabel("Cumulative Number of Requests")
Out[4]:
<matplotlib.text.Text at 0x108f5c5d0>

Cardinality Reduction

In order to get a better sense of the temporal and spatial patterns of this data by digging into the kinds of requests we have in this dataset, we'll need to group the 4000+ values of the Nature column into a smaller number of categories. A manageable number would be between 10 and 20 categories, including a catch-all "Other" which should not be the biggest, and where there is less than a 10-to-1 difference in size between the biggest and the smallest categories.

The general approach we'll take is to group the requests based on keywords present in the Nature field. We'll start with some basic prep on the data by normalizing it all to ASCII. This isn't strictly necessary but makes it easier to write string-matching code without having to type accents on a standard US keyboard. It will also help with string-matching in case a french word was mistyped without accents (admittedly, not a particularly common occurrence).

In [5]:
import unicodedata

def to_ascii(input_str):
    if isinstance(input_str, str):
        input_str = input_str.decode('utf-8')
    elif not isinstance(input_str, unicode):
        input_str = str(input_str).decode('utf-8')
    nkfd_form = unicodedata.normalize('NFKD', input_str)
    only_ascii = nkfd_form.encode('ASCII', 'ignore')
    return only_ascii

requests.Nature = requests.Nature.apply(to_ascii)

A fairly subjective iterative categorization process yields the following categorization function. The keys of the keywords dictionary are the category names and the values are the set of words which will cause that category to be applied to a given request if present in the Nature field. The first category to match is the one that is applied to the request.

The process used was to start with no filters (i.e. all requests are categorized as "Other") and then iterate through:

  1. Generating a list of the most frequent values of Nature per category
  2. Adding or modifying the list of filters based on the list above to move requests out of the "Other" category
  3. Stopping if the categories meet the goals, otherwise iterating again
In [6]:
from collections import OrderedDict
import re

def categorize(x):
    keywords = OrderedDict()
    keywords["Dogs"] = ["chien"]
    keywords["Taxes"] = ["changement d'adresse", "taxes", "taxe"]
    keywords["Permits"] = ["permis", "occupation","zonage"]
    keywords["Selective Collection"] = ["bac","bacs","sac","sacs","recyclage","residus",
                                        "collecte selective","collecte de branches",
                                        "collecte de feuilles"]
    keywords["Bulky Collection"] = ["volumineux"]
    keywords["Garbage Collection"] = ["collecte","dechets"]
    keywords["Trees"] = ["arbre"]
    keywords["Water"] = ["eau"]
    keywords["Sewers"] = ["egout","puisard"]
    keywords["Signage"] = ["signalisation","marquage","feux"]
    keywords["Snow"] = ["neige","deneigement"]
    keywords["Roads"] = ["voirie","voie","travaux","rue","chaussee","charretiere","trottoir"]
    keywords["Environment"] = ["environnement","maisons fleuries","horticulture","pesticide",
                               "compost","animal", "punaises", "jardins","extermination",
                               "animaux","sauvage","gazon","bruit","nuisances"]
    keywords["Parking"] = ["stationnement","ads","srrr","parcometre"]
    keywords["Lighting"] = ["eclairage","lampadaire"]
    keywords["Parks"] = ["parc", "parcs", "chalet", "chalets","sport","sports","piscine"]
    keywords["Security"] = ["surveillance", "police", "securite"]

    keyword_res = OrderedDict() 
    for k in keywords:
        pat = "\\b%s\\b" % "\\b|\\b".join(keywords[k])
        keyword_res[k] = re.compile(pat, re.IGNORECASE)

    for r in keyword_res:
        if keyword_res[r].search(x): return r
    return "Other"

requests["Category"] = requests.Nature.map( 
    {x: categorize(x) for x in requests.Nature.unique()}.get 
)
requests.groupby("Category").size().plot(kind="barh", title="Requests by Category")
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x10e790450>

A cursory glance at the list reveals that it is close to being a mutually-exclusive, collectively-exhaustive (MECE) description of the service areas the city is involved in. This intuitively makes sense: any citizen request about anything else wouldn't be recorded in this dataset and it would be very surprising if the city offered a service that few people inquired or complained about.

Of special note are the three "Collection" fields: waste collection requests were split up into three separate categories to avoid having one huge "Waste Collection" category. "Selective" collection covers recycling, branches etc and "Bulky" collection covers large waste items such as mattresses, furniture etc and "Garbage" is everything else.

So what kinds of requests ended up in each category? Here's a list of the top 5 values of Nature per category.

In [7]:
x = requests.groupby(["Category", "Nature"]).size().reset_index(name="counts")\
    .sort(columns=["Category", "counts"], ascending=False)
x["grouprank"] = x.groupby(["Category"])['counts'].rank(ascending=False)
x.groupby(["Category"]).head(5).pivot_table(
    index="Category", columns="grouprank", values="Nature", aggfunc=lambda x:x)
Out[7]:
grouprank 1.0 2.0 3.0 4.0 5.0
Category
Bulky Collection Dechets - Objet volumineux a ramasser Dechets - Objet volumineux non ramasse Dechets - Objet volumineux a Dechets - Objet volumineux non Dechets - Objet volumineux non ramass
Dogs Chien - Permis Chien - Fourriere et patrouille *Chien sans permis Chien - Fourriere et Chien qui aboie
Environment Environnement - Terrain insalubre Environnement - Graffitis - Propriete privee Environnement - Divers Maisons fleuries Environnement - Animal : pollution et nourriture
Garbage Collection Dechets - Collecte Dechets - Depot illegal Environnement - Collecte d'ecran (ordinateur e... *Dechets robotises - Non ramasses *Autre - Dechets robotises
Lighting Eclairage existant - Entretien Lampadaire - Reparation Eclairage - Etudes Autre - Eclairage existant - Entretien Eclairage insuffisant
Other Subvention - Habitation Reclamation UIR - Utilite publique Retour DDS Services Techniques - Varia
Parking Intervention stationnement Intervention constat d'infraction SRRR SRRR Intervention de stationnement - Enquete Agent de stationnement - Plainte contre un ADS
Parks Parc - Varia Parc - Proprete Parc - Mobilier 15 - Expulsion chalets, parcs, stationnements Sports et loisirs - Installations
Permits Permis - Divers Permis - Logement Permis - Construction Permis - Bruit Occupation du domaine public
Roads Voirie - Divers Voirie - Nid-de-poule Voirie - Nettoyage du domaine public Voirie - Debris sur la voie publique Voirie - Pavage - Reparation
Security 20 - Surveillance particuliere UIR - Police Surveillance et securite publique Perimetre de securite - Urgence Securite incendie - Themes administratifs
Selective Collection Environnement - Bac ou sac de recuperation - D... Environnement - Collecte selective Environnement - Collecte de branches d'arbre Environnement - Bac roulant Environnement - Residus verts
Sewers Egout - Puisard Egout - Refoulement Egout - Divers Puisard - Nettoyage - Rue Puisard - A nettoyer
Signage Feux de circulation - Entretien Signalisation ecrite - Nouveau et entretien Signalisation - Circulation - Etudes Panneau de signalisation entretien prioritaire Marquage de la chaussee - Existant
Snow Neige - Divers Neige - Chaussee glissante Neige - Trottoir glissant Neige - Avis d'infraction Deneigement - Plainte
Taxes Changement d'adresse - Taxes eau et dechets Changement d'adresse - Taxes foncieres Taxes foncieres Taxe d'eau Changement d'adresse - Taxes
Trees Arbre - Elagage Arbre - Danger potentiel Arbre - Maladie et insectes Arbre - Branche tombee Arbre - Varia
Water Eau - Fermeture d'entree d'eau Eau - Divers Eau - Ouverture d'entree d'eau Eau - Fuite d'eau Eau - Borne-fontaine - Entretien ou deplacement

We can also look how many different values of Nature ended up in each category, and see that things make sense: a big variety in the "Other" category and a fair number of different values collapsed into the various specific categories, roughly in proportion to how many requests fall in each category.

In [8]:
requests.groupby(["Category", "Nature"]).size().reset_index(name="counts")\
    .groupby("Category").size().plot(kind="barh", title="Natures by Category")
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x10edf9ed0>

Next Steps

We now have our categorized data, ready for some more interactive analysis using PivotTable.js, an in-browser interactive pivot table tool. We'll export the data to a CSV to load into the pivot table.

In [9]:
requests.groupby([
"Year", "Month", "Borough", "Category"
]).size().reset_index(name="Requests").to_csv("311_pivot.csv", index=False)

You can follow along with the rest of the analysis here: Montreal 311 Service Requests, an Analysis


© Nicolas Kruchten 2010-2017