Earlier this year, I collaborated with a reporter from the Montreal Gazette to analyze a dataset containing information about 1.4 million service requests received by the City of Montreal from its citizens. The resulting article was entitled "Montreal's 311 records shed light on residents' concerns — to a point" and credits me at the bottom. I have also published my own interactive analysis of the dataset here: Montreal 311 Service Requests, an Analysis. The dataset, obtained from the city's Gestion des demandes clients (GDC) system via an Access to Information request, covered the five years from 2008 to 2012 and contained the date and a very short description for each request, and in most cases, an address. The service requests were received by the city through its 311 phone line or at service counters throughout the city.
My major contribution to this article was to have come up with a categorization for these requests, and to have found some interesting patterns and anomalies in the categorized data for Roberto at the Gazette to dig into. In the rest of this blog post, I detail how I first approached the dataset and categorized the 1.4 million calls using the Jupyter Notebook system.
First we load up the data into a Pandas DataFrame and get a very high-level overview.
import pandas as pd
requests = pd.DataFrame.from_csv("311.csv.gz")
requests.describe()
We seem to have 1826 days or 5 years of data across all 19 of Montreal's boroughs, so let's make a quick chart of request counts over time by borough. Daily is too fine a grain so we'll go monthly.
from datetime import date
requests.creation_date = pd.to_datetime(requests.creation_date)
requests["year_month"] = requests.creation_date.apply(
lambda x: date(year=x.year, month=x.month, day=1)
)
requests["Year"] = requests.creation_date.apply(lambda x: x.year)
requests["Month"] = requests.creation_date.apply(lambda x: x.strftime("%b"))
%matplotlib inline
from matplotlib import rcParams, style
style.use('ggplot')
rcParams['figure.figsize'] = 12, 8
requests.pivot_table(
index='year_month', columns='Borough',
values='Nature', aggfunc=len
).plot().legend(loc='center right', bbox_to_anchor=(1.45, 0.5))
This is not a particularly readable chart, but it certainly looks like we have data from every borough for every month. There appear to be some major inter-borough differences (Saint-Laurent is the only the 8th most populous borough but has by far the most requests) as well as temporal patterns (a yearly pattern of low-winter/high-summer). The idea behind this analysis is to get a handle on these differences.
Let's take a look at that Nature
column; it seems to have too many different values to easily make sense of.
nature_counts = requests.Nature.value_counts().reset_index(name="counts")
nature_counts
And now graphically:
p = nature_counts.cumsum().plot(legend=False)
p.set_xlabel("Term Rank")
p.set_ylabel("Cumulative Number of Requests")
In order to get a better sense of the temporal and spatial patterns of this data by digging into the kinds of requests we have in this dataset, we'll need to group the 4000+ values of the Nature
column into a smaller number of categories. A manageable number would be between 10 and 20 categories, including a catch-all "Other" which should not be the biggest, and where there is less than a 10-to-1 difference in size between the biggest and the smallest categories.
The general approach we'll take is to group the requests based on keywords present in the Nature
field. We'll start with some basic prep on the data by normalizing it all to ASCII. This isn't strictly necessary but makes it easier to write string-matching code without having to type accents on a standard US keyboard. It will also help with string-matching in case a french word was mistyped without accents (admittedly, not a particularly common occurrence).
import unicodedata
def to_ascii(input_str):
if isinstance(input_str, str):
input_str = input_str.decode('utf-8')
elif not isinstance(input_str, unicode):
input_str = str(input_str).decode('utf-8')
nkfd_form = unicodedata.normalize('NFKD', input_str)
only_ascii = nkfd_form.encode('ASCII', 'ignore')
return only_ascii
requests.Nature = requests.Nature.apply(to_ascii)
A fairly subjective iterative categorization process yields the following categorization function. The keys of the keywords
dictionary are the category names and the values are the set of words which will cause that category to be applied to a given request if present in the Nature
field. The first category to match is the one that is applied to the request.
The process used was to start with no filters (i.e. all requests are categorized as "Other") and then iterate through:
Nature
per categoryfrom collections import OrderedDict
import re
def categorize(x):
keywords = OrderedDict()
keywords["Dogs"] = ["chien"]
keywords["Taxes"] = ["changement d'adresse", "taxes", "taxe"]
keywords["Permits"] = ["permis", "occupation","zonage"]
keywords["Selective Collection"] = ["bac","bacs","sac","sacs","recyclage","residus",
"collecte selective","collecte de branches",
"collecte de feuilles"]
keywords["Bulky Collection"] = ["volumineux"]
keywords["Garbage Collection"] = ["collecte","dechets"]
keywords["Trees"] = ["arbre"]
keywords["Water"] = ["eau"]
keywords["Sewers"] = ["egout","puisard"]
keywords["Signage"] = ["signalisation","marquage","feux"]
keywords["Snow"] = ["neige","deneigement"]
keywords["Roads"] = ["voirie","voie","travaux","rue","chaussee","charretiere","trottoir"]
keywords["Environment"] = ["environnement","maisons fleuries","horticulture","pesticide",
"compost","animal", "punaises", "jardins","extermination",
"animaux","sauvage","gazon","bruit","nuisances"]
keywords["Parking"] = ["stationnement","ads","srrr","parcometre"]
keywords["Lighting"] = ["eclairage","lampadaire"]
keywords["Parks"] = ["parc", "parcs", "chalet", "chalets","sport","sports","piscine"]
keywords["Security"] = ["surveillance", "police", "securite"]
keyword_res = OrderedDict()
for k in keywords:
pat = "\\b%s\\b" % "\\b|\\b".join(keywords[k])
keyword_res[k] = re.compile(pat, re.IGNORECASE)
for r in keyword_res:
if keyword_res[r].search(x): return r
return "Other"
requests["Category"] = requests.Nature.map(
{x: categorize(x) for x in requests.Nature.unique()}.get
)
requests.groupby("Category").size().plot(kind="barh", title="Requests by Category")
A cursory glance at the list reveals that it is close to being a mutually-exclusive, collectively-exhaustive (MECE) description of the service areas the city is involved in. This intuitively makes sense: any citizen request about anything else wouldn't be recorded in this dataset and it would be very surprising if the city offered a service that few people inquired or complained about.
Of special note are the three "Collection" fields: waste collection requests were split up into three separate categories to avoid having one huge "Waste Collection" category. "Selective" collection covers recycling, branches etc and "Bulky" collection covers large waste items such as mattresses, furniture etc and "Garbage" is everything else.
So what kinds of requests ended up in each category? Here's a list of the top 5 values of Nature
per category.
x = requests.groupby(["Category", "Nature"]).size().reset_index(name="counts")\
.sort(columns=["Category", "counts"], ascending=False)
x["grouprank"] = x.groupby(["Category"])['counts'].rank(ascending=False)
x.groupby(["Category"]).head(5).pivot_table(
index="Category", columns="grouprank", values="Nature", aggfunc=lambda x:x)
We can also look how many different values of Nature
ended up in each category, and see that things make sense: a big variety in the "Other" category and a fair number of different values collapsed into the various specific categories, roughly in proportion to how many requests fall in each category.
requests.groupby(["Category", "Nature"]).size().reset_index(name="counts")\
.groupby("Category").size().plot(kind="barh", title="Natures by Category")
We now have our categorized data, ready for some more interactive analysis using PivotTable.js, an in-browser interactive pivot table tool. We'll export the data to a CSV to load into the pivot table.
requests.groupby([
"Year", "Month", "Borough", "Category"
]).size().reset_index(name="Requests").to_csv("311_pivot.csv", index=False)
You can follow along with the rest of the analysis here: Montreal 311 Service Requests, an Analysis
⁂