writes code and visualizes data
in Montréal, Québec, Canada.
I recently did a guest lecture (in French!) at the Université de Montréal in the context of the École d’été en Architecture de l’information (Summer program for Information Architecture).
As part of my second collaboration with data journalist Roberto Rocha, I made an interactive map for his recent piece on where and when Car2Go vehicles park in Montreal (shorter english version). Earlier in the year, Roberto told me about people in certain neighbourhoods complaining about Car2Go vehicles causing parking problems. He and I hit upon the idea of querying Car2Go’s API every few minutes to find out where all their available cars were parked in Montreal, to take a look at some real data on this issue. I’m a huge fan and user of car-sharing services and in my neighbourhood of Rosemont I feel they prevent parking problems by enabling lower car ownership. As my map makes clear, however, this is not the case in areas like the Mile End. In any case, the CBC articles do a great job of reporting on the situation, and I wanted to share some of the thinking and code that went into making the map.
After my photographic metro platform maps went viral last week, I received a lot of feedback in the form of emails and comments, telling me about the experiences of subway riders in other cities. Here are some interesting vignettes.
The photo above (click here for a zoomable version) is a collage of panoramic scans of the Angrignon-bound platforms of the Montreal metro’s green line. I used my phone to record videos from the rear-most window of the train and wrote a bit of software to stitch the frames together. My goal was to create a way to figure out where to stand while waiting for the metro so as to get out closest to where you want to go at your destination, and I used these scans to build a little interactive comparison page for just this purpose.
Visualizing datasets as circle-and-arrow networks or graphs is a popular and easy way to make attention-grabbing graphics. As the number of data points grows, however, these graphics become crowded and marginally useful. Dimensionality-reduction algorithms such as t-SNE represent a different approach to visualizing the relationships between large numbers of data points, which in certain cases can produce graphics which do not suffer from the same types of problems as graph-visualization approaches. In this talk I compare and contrast the two approaches and give pointers to those who wish to try them out.
By using machine learning algorithms, we are increasingly able to use computers to perform intellectual tasks at a level approaching that of humans. Given that computers cost less than employees, many people are afraid that humans will therefore necessarily lose their jobs to computers. Contrary to this belief, in this article I show that even when a computer can perform a task more economically than a human, careful analysis suggests that humans and computers working together can sometimes yield even better business outcomes than simply replacing one with the other.
Specifically, I show how a classifier with a reject option can increase worker productivity for certain types of tasks, and I show how to construct and tune such a classifier from a simple scoring function by using two thresholds. I begin with a parable featuring the same characters as the one from Part 1 of this Machine Learning Meets Economics series. I recommend reading Part 1 first, as it sets up much of the terminology I use here.
I presented MLDB today at the BigData Innovators Gathering (BIG) 2016 conference.
I was recently invited to give a talk about auction theory and online advertising at Concordia University for a course entitled Social and Information Networks, which uses a really interesting textbook called Networks, Crowds, and Markets.
The business world is full of streams of items that need to be filtered or evaluated: parts on an assembly line, resumés in an application pile, emails in a delivery queue, transactions awaiting processing. Machine learning techniques are increasingly being used to make such processes more efficient: image processing to flag bad parts, text analysis to surface good candidates, spam filtering to sort email, fraud detection to lower transaction costs etc.
In this article, I show how you can take business factors into account when using machine learning to solve these kinds of problems with binary classifiers. Specifically, I show how the concept of expected utility from the field of economics maps onto the Receiver Operating Characteristic (ROC) space often used by machine learning practitioners to compare and evaluate models for binary classification. I begin with a parable illustrating the dangers of not taking such factors into account. This concrete story is followed by a more formal mathematical look at the use of indifference curves in ROC space to avoid this kind of problem and guide model development. I wrap up with some recommendations for successfully using binary classifiers to solve business problems.