writes code and visualizes data
in Montréal, Québec, Canada.
The business world is full of streams of items that need to be filtered or evaluated: parts on an assembly line, resumés in an application pile, emails in a delivery queue, transactions awaiting processing. Machine learning techniques are increasingly being used to make such processes more efficient: image processing to flag bad parts, text analysis to surface good candidates, spam filtering to sort email, fraud detection to lower transaction costs etc.
In this article, I show how you can take business factors into account when using machine learning to solve these kinds of problems with binary classifiers. Specifically, I show how the concept of expected utility from the field of economics maps onto the Receiver Operating Characteristic (ROC) space often used by machine learning practitioners to compare and evaluate models for binary classification. I begin with a parable illustrating the dangers of not taking such factors into account. This concrete story is followed by a more formal mathematical look at the use of indifference curves in ROC space to avoid this kind of problem and guide model development. I wrap up with some recommendations for successfully using binary classifiers to solve business problems.
As you can see in the video above, during the talk I just scrolled through an R file in RStudio. What you see below is the result of slightly modifying that file and running it through the RMarkdown process to capture the output.
pivottablejs module. This has been possible for RStudio users for a while now via rPivotTable, but why should they have all the fun?
For the latest in my series of maps of the results of the 2013 Montreal municipal election, I’ve produced a pair of graduated symbol maps, representing the results as a pie charts overlaid on a base map. It’s interesting to compare this type of visualization to my previous efforts: the dot map, the choropleth, and the ternary plot.
I had the pleasure of visiting with many members of my wife’s family this summer, some of whom are genealogy enthusiasts. I made a pair of visualizations of the data they had collected: one in the run-up to a family reunion and one to find my way around the large family we visited in Saskatchewan.
I learned to program in the late nineties through a game called RoboWar. RoboWar provides a simulated arena in which two virtual robots try to destroy each other by running a program written by their respective players. The goal was to create a robot which could win a tournament, which were held about twice a year; entrants would email their creation to someone with a fast computer who would simulate hundreds of battles and then let everyone know who had won and make the entries public.
Earlier this year, I collaborated with a reporter from the Montreal Gazette to analyze a dataset containing information about 1.4 million service requests received by the City of Montreal from its citizens. The resulting article was entitled "Montreal's 311 records shed light on residents' concerns — to a point" and credits me at the bottom. I have also published my own interactive analysis of the dataset here: Montreal 311 Service Requests, an Analysis. The dataset, obtained from the city's Gestion des demandes clients (GDC) system via an Access to Information request, covered the five years from 2008 to 2012 and contained the date and a very short description for each request, and in most cases, an address. The service requests were received by the city through its 311 phone line or at service counters throughout the city.