Nicolas Kruchten

Nicolas Kruchten
is a software engineer
in Montréal, Québec, Canada.

Datacratic

I worked for 6 years at Datacratic, a software company specializing in Machine Learning. I wrote a number of articles and gave a number of talks throughout my time there, which are collected here.

Machine Learning Meets Economics, Part 2

Machine Learning Meets Economics, Part 2

By using machine learning algorithms, we are increasingly able to use computers to perform intellectual tasks at a level approaching that of humans. Given that computers cost less than employees, many people are afraid that humans will therefore necessarily lose their jobs to computers. Contrary to this belief, in this article I show that even when a computer can perform a task more economically than a human, careful analysis suggests that humans and computers working together can sometimes yield even better business outcomes than simply replacing one with the other.

Specifically, I show how a classifier with a reject option can increase worker productivity for certain types of tasks, and I show how to construct and tune such a classifier from a simple scoring function by using two thresholds. I begin with a parable featuring the same characters as the one from Part 1 of this Machine Learning Meets Economics series. I recommend reading Part 1 first, as it sets up much of the terminology I use here.

Full post »


BIG 2016: The Machine Learning Database


I presented MLDB today at the BigData Innovators Gathering (BIG) 2016 conference.

The whitepaper is available as a PDF.

Full post »


Machine Learning Meets Economics

Machine Learning Meets Economics

The business world is full of streams of items that need to be filtered or evaluated: parts on an assembly line, resumés in an application pile, emails in a delivery queue, transactions awaiting processing. Machine learning techniques are increasingly being used to make such processes more efficient: image processing to flag bad parts, text analysis to surface good candidates, spam filtering to sort email, fraud detection to lower transaction costs etc.

In this article, I show how you can take business factors into account when using machine learning to solve these kinds of problems with binary classifiers. Specifically, I show how the concept of expected utility from the field of economics maps onto the Receiver Operating Characteristic (ROC) space often used by machine learning practitioners to compare and evaluate models for binary classification. I begin with a parable illustrating the dangers of not taking such factors into account. This concrete story is followed by a more formal mathematical look at the use of indifference curves in ROC space to avoid this kind of problem and guide model development. I wrap up with some recommendations for successfully using binary classifiers to solve business problems.

Full post »


The Programmatic Waterfall Mystery: Solved


A recent article on AdExchanger asks “In the supposedly super-efficient world of RTB, why would publishers continue to waterfall their demand sources?”. The article goes on to say that the publisher’s justification is “Because it works” but that “Any economist could tell you that this is a bad idea”. I’m not an economist but I can still pull together enough auction theory to show that this practice isn’t necessarily a bad one today.

Full post »


PAPIs.io 2014: Behind the scenes with a Predictive API


I gave a talk at in Barcelona at the PAPIs.io 2014 Predictive APIs conference last November.

Full post »


Interactive Subreddit Map with t-SNE

Interactive Subreddit Map with t-SNE

For part of my presentation at Montreal Python, I made an interactive map of the various sub-sections of the website Reddit (called subreddits). You can take a look at the interactive version or see a static annotated one above. The interactive one includes basic info on how I made it and full details are in the presentation. I got some nice comments in the /r/DataIsBeautiful subreddit post

Full post »


Montreal Python: Unsupervised ML with scikit-learn


I gave a talk at Montreal Python on Data Science and Unsupervised Machine Learning with scikit-learn. The video is above and I posted all of my presentation materials online.

Full post »


Arbitraging an RTB Exchange


Last week, Bloomberg came out with an article on RTB arbitrage, which included a couple of sentences that made it sound a lot like it was possible to front-run an RTB auction: “Some buy from an exchange and sell it right back to that very same exchange” and “Some agencies are poorly connected to exchanges and can’t respond to a first auction in time, allowing middlemen to buy and flip within the same market”. This seemed surprising to me at first, given that all auction participants (as far as I know) get the same opportunity to bid on an impression, so how could you make money buying and selling the same impression on the same exchange? Upon further thought, however, here’s a theory about how it might work.

Full post »


Visualizing High-Dimensional Data in the Browser with SVD, t-SNE and Three.js


Data visualization, by definition, involves making a two- or three-dimensional picture of data, so when the data being visualized inherently has many more dimensions than two or three, a big component of data visualization is dimensionality reduction. Dimensionality reduction is also often the first step in a big-data machine-learning pipeline, because most machine-learning algorithms suffer from the Curse of Dimensionality: more dimensions in the input means you need exponentially more training data to create a good model. Datacratic’s products operate on billions of data points (big data) in tens of thousands of dimensions (big problem), and in this post, we show off a proof of concept for interactively visualizing this kind of data in a browser, in 3D (of course, the images on the screen are two-dimensional but we use interactivity, motion and perspective to evoke a third dimension).

Full post »


Big Data Week Montreal: From Big Data to Big Value


Video and slides from my talk at the kickoff of Big Data Week Montreal 2014.

Full post »


Peeking Into the Black Box, Part 4 - Shady Bidding


In Part 1 of this series, I said that in real-time bidding, we should “bid truthfully”, i.e. that you should bid whatever it is worth to you to win. To compute this truthful value, given a target cost per action (CPA) for a campaign, I said you could just multiply that target by the computed probability of seeing an action after the impression, and that would give you your bid value.

I added that by calculating an expected cost of winning an auction, you could compute the expected surplus for that auction and that to pace your spending efficiently, you would only bid truthfully when this expected surplus was above some threshold value, and not bid otherwise. This threshold value would be the output of a closed-loop pace control system (described in Part 0) whose job it is to keep the spend rate close to some target.

In Part 3 of this series, I then showed that in fact, the second claim of Part 1 was not optimal and that instead of setting an expected surplus threshold, you should set an expected return-on-investment (ROI) threshold.

In this post, Part 4 of the series, I show that the meaning of “bidding truthfully” can be slipperier than expected, and that you can get the same results as an ROI-based pacing strategy with a perfect expected-cost model, without even needing to use an expected-cost model.

Full post »


Peeking Into the Black Box, Part 3 - Beyond Surplus


In Part 1 of this Peeking Into the Black Box series, I described how you could compute the expected economic surplus of truthfully bidding on an impression in an RTB context. I then explained that you could use this computation to decide which bidding opportunities were “better” than others and therefore decide when to bid and and when not to bid, based on the output of a closed-loop pace control system such as the one described in Part 0.

In this post, I show that in order to maximize the economic surplus over a whole campaign, the quantity you should use on an auction-by-auction basis to decide when to bid is actually the expected return on investment (ROI) rather than the expected surplus. At Datacratic, we actually switched to an ROI-based strategy in late 2012.

Full post »


Rubicon Tech Talk: The Algorithms Automating Advertising


I was invited to speak on a panel at a Rubicon Project product launch, and this is the video of the event.

Full post »


My QR-Code Business Card

My QR-Code Business Card

This is the machine-readable back of my new nerdy QR-code business card!

Full post »


Peeking Into the Black Box, Part 2 - Algorithm Meets World

Peeking Into the Black Box, Part 2 - Algorithm Meets World

In Part 1 of this series, I claimed that Datacratic’s RTB algorithm is able to take advantage of other bidders’ sub-optimal behaviour and navigate around publisher price floor in order to achieve advertiser goals. I then described the algorithm, which applies what can be called a “bid truthfully, pace economically” approach. In this second part, I show how this algorithm can in fact live up to these claims.

When you bid truthfully and pace economically, you are always trying to allocate your budget to the auctions which look like the best deals, whether that means that the user is very likely to click, or that the price is low because fewer bidders are in the running or there is no publisher price floor.

Full post »


Peeking Into the Black Box, Part 1 - Datacratic’s RTB Algorithms

Peeking Into the Black Box, Part 1 - Datacratic’s RTB Algorithms

This article is about the statistical and economic theory that underlies Datacratic’s real-time bidding strategies, as a follow-up to my previous article on how we apply control theory to pace our spending, which I’m grandfathering in as “Part 0” of this series called Peeking Into the Black Box.

At Datacratic, we develop real-time bidding algorithms. In order to accomplish advertiser goals, our algorithms automatically take advantage of other bidders’ sub-optimal behaviour, as well as navigate around publisher price floors. These are bold claims, and we want our partners to understand how our technology works and be comfortable with it.No “trust the Black Box” value proposition for us. In Part 1 of this series I’ll explain the basics of our algorithm, first how we determine how much to bid, and then how we determine when to bid. In Part 2 I’ll show how this approach responds in some real-world situations.

Full post »


Peeking Into the Black Box, Part 0 - RTB Pacing, is everyone doing it wrong?

Peeking Into the Black Box, Part 0 - RTB Pacing, is everyone doing it wrong?

I've written some follow-up posts to this one, as a series called Peeking into the Black Box, so I'm grandfathering this post in as "Part 0" of the series.

I read an interesting post on the AppNexus tech blog about their campaign monitoring tools and the screenshots there almost exclusively contained various pacing measurements. Some of the graphs there looked a lot like the ones I had sketched up while trying to solve the pacing problem for our real-time bidding (RTB) client. Here are the basics of the problem: if someone gives you a fixed amount of money to run a display advertising campaign over a specific time period, it's generally advisable to spend exactly that amount of money, spread out reasonably evenly over that time-period. Over-spending could mean you're on the hook for the difference, and under-spending doesn't look great if you want repeat business. And if you don't spend it evenly, you'll get some pissed-off customers, like this guy who had his $50 budget blown in minutes. Sounds obvious, right? Apparently it's harder than it looks!

Full post »


Real Time Bidding, Characterized


There doesn't appear to be a good Wikipedia entry for RTB for me to link at the moment, when I want to blog about it so I'll draft my own explanation here. (Edit: there is an entry now, but I like my characterization better!) Keep in mind while reading this that I'm looking at RTB as a software engineer with an interest in economics, rather than as an ad industry veteran!

Full post »


Using make to Orchestrate Machine Learning Tasks

Using make to Orchestrate Machine Learning Tasks

One of the things we do at Datacratic is to use machine learning algorithms to optimize real-time bidding (RTB) policies for online display advertising. This means we train software models to predict, for example, the cost and the value of showing a given ad impression, and we then incorporate these prediction models into systems which make informed bidding decisions on behalf of our clients to show their ads to their potential customers.

Full post »


Datacratic's Dataviz System

Datacratic's Dataviz System

At Datacratic, working with data often means data visualization (or dataviz): making pretty pictures with data. This is usually more like making fully machine-generated images than carefully laying out "infographics" of the Information Is Beautiful school but I find they usually end up looking pretty good. There are lots of good tools for graphing data, like matplotlib or R or just plain old Excel-clone spreadsheets but what we use most often is Protovis, the Javascript library for generating SVG, coupled with CoffeeScript, which is a concise and expressive language that compiles down to Javascript.

Full post »


Statsd, Graphite and Nagios

Statsd, Graphite and Nagios

At Datacratic we tend to worship, like Etsy (and AppNexus!), at the Church of Graphs. We've even started using Statsd, the system they've released to collect stats and relay them to Carbon for display in Graphite. And by display, I mean display on a dashboard visible to the entire dev team at the office, as seen above! Statsd is a very simple system to which you can send UDP messages about various stats you want to track, which it then aggregates and passes along to Carbon, which stores them in Whisper, Graphite's back-end data store. That's a lot of moving parts but it works very well. Sending stats to statsd is extremely easy from any language (we do it from Javascript and C++) and carries low overhead, which is key for the type of work we do.

Full post »



© Nicolas Kruchten 2010-2017