<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>nicolas.kruchten.com</title>
	<atom:link href="http://nicolas.kruchten.com/content/feed/" rel="self" type="application/rss+xml" />
	<link>http://nicolas.kruchten.com/content</link>
	<description>fighting entropy in the 21st century</description>
	<lastBuildDate>Sat, 18 Feb 2012 17:07:04 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Peeking Into the Black Box, Part 1: Recoset’s RTB Algorithms</title>
		<link>http://nicolas.kruchten.com/content/2011/12/peeking-into-the-black-box-part-1/</link>
		<comments>http://nicolas.kruchten.com/content/2011/12/peeking-into-the-black-box-part-1/#comments</comments>
		<pubDate>Thu, 22 Dec 2011 22:17:43 +0000</pubDate>
		<dc:creator>Nicolas Kruchten</dc:creator>
				<category><![CDATA[Work]]></category>
		<category><![CDATA[algorithm]]></category>
		<category><![CDATA[black box]]></category>
		<category><![CDATA[economics]]></category>
		<category><![CDATA[pacing]]></category>
		<category><![CDATA[real-time bidding]]></category>
		<category><![CDATA[recoset]]></category>
		<category><![CDATA[rtb]]></category>
		<category><![CDATA[surplus]]></category>

		<guid isPermaLink="false">http://nicolas.kruchten.com/content/?p=532</guid>
		<description><![CDATA[This article is about the statistical and economic theory that underlies Recoset’s real-time bidding strategies, as a follow-up to my previous article on how we apply control theory to pace our spending, which I’m grandfathering in as “Part 0” of this short series. At Recoset, we develop real-time bidding algorithms. In order to accomplish advertiser [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: center;"><a href="http://nicolas.kruchten.com/content/wp-content/uploads/2011/12/Surplus.png" rel="lightbox[532]"><img class="aligncenter  wp-image-538" title="The Surplus Function" src="http://nicolas.kruchten.com/content/wp-content/uploads/2011/12/Surplus.png" alt="" width="430" height="295" /></a></p>
<p><em>This article is about the statistical and economic theory that underlies <a href="http://www.recoset.com/" target="_blank">Recoset</a>’s real-time bidding strategies, as a follow-up to my previous article on </em><em><a title="RTB Pacing: is everyone doing it wrong?" href="http://nicolas.kruchten.com/content/2011/09/rtb-pacing-is-everyone-doing-it-wrong/">how we apply control theory to pace our spending</a></em><em>, which I’m grandfathering in as “Part 0” of this short series.</em></p>
<p>At <a href="http://www.recoset.com/" target="_blank">Recoset</a>, we develop <a title="Real Time Bidding, Characterized" href="http://nicolas.kruchten.com/content/2011/09/real-time-bidding-characterized/">real-time bidding</a> algorithms. In order to accomplish advertiser goals, our algorithms automatically take advantage of other bidders’ sub-optimal behaviour, as well as navigate around publisher price floors. These are bold claims, and we want our partners to understand how our technology works and be comfortable with it.  No “trust the Black Box” value proposition for us.</p>
<p><span id="more-532"></span></p>
<p>In Part 1 of this series I’ll explain the basics of our algorithm, first how we determine how much to bid, and then how we determine when to bid. In <a title="Peeking Into the Black Box, Part 2: Algorithm Meets World" href="http://nicolas.kruchten.com/content/2011/12/peeking-into-the-black-box-part-2/">Part 2</a> I’ll show how this approach responds in some real-world situations.</p>
<h2>First, Tell the Truth</h2>
<p>Let’s say, for sake of illustration, that you are Recoset’s bidder.  You’re configured for a 3-month direct-response campaign with a total budget of $100,000 and a target cost-per-click (CPC) of $1.00.  Like every other participant in real-time auctions, you subscribe to a stream of several thousand bid requests (also known as ad calls) per second, each representing an opportunity to immediately show an ad to a particular user on a particular site. You must respond to each bid-request which matches your targeting criteria with a bid, within a few dozen milliseconds. How do you decide what amount to bid?</p>
<p>Most real-time auctions are what are known as <a href="http://en.wikipedia.org/wiki/Vickrey_auction" target="_blank">Vickrey, or second-price, auctions</a>.  The winner is the one who places the highest bid, but pays the amount of the second-highest bid. This type of auction was likely chosen by the designers of real-time exchange because the best strategy for bidders in such a situation is to “bid truthfully” (one of the well-known results of auction theory). This means that the bid you place should be equal to the amount that the good is worth to you.  In this case the “good” is the right to show an ad.  How much is that “worth”?</p>
<p>The target CPC is $1.00, so it’s not a bad assumption that a click is worth at least that much to the advertiser.  Let’s say that the average click-through rate (CTR) is 0.05% for this campaign: one out of every two thousand impressions gets clicked on.  So if you win this auction, you have a 0.05% chance of getting something that is worth a dollar, and multiplying those together we can say that the expected value of winning this auction is 0.05 cents.  If you’re bidding truthfully (which auction theory says you should), that’s how much you should bid.  One equation you could use to come up with a bid is therefore:</p>
<p>$$ bid = value = CPC_{target} * CTR $$</p>
<p>That’s a bit too simplistic, though, and doesn’t take advantage of the real-time nature of the exchange.  We’re a predictive analytics company, so as our bidder you have access to some fancy models.  These models predict, in real-time and on a bid-request by bid-request basis, the probability that this specific user will click on this specific ad in this specific context at this specific time. This means that you don’t have to use the campaign-wide average CTR when computing your bid, you can use the probability that our predictive model has generated for this request:</p>
<p>$$ bid = value = CPC_{target} * P(click) $$</p>
<p>So far so good, but not particularly groundbreaking: where is the secret sauce, other than in the probability-of-click prediction model which is beyond the scope of this article? That predictive model is certainly part of it, but there’s actually an additional wrinkle to the above equation: it only tells you what the bid should be <em>if you actually decide to bid. </em>And when deciding whether or not to bid, you shouldn’t just look at the value of winning the auction, but the cost as well.</p>
<h2>To Bid or Not To Bid, or: Cost is not Value</h2>
<p>As I talked about in <a title="RTB Pacing: is everyone doing it wrong?" href="http://nicolas.kruchten.com/content/2011/09/rtb-pacing-is-everyone-doing-it-wrong/">Part 0</a>, if you just bid on every bid-request that comes your way, you’ll spend the budgeted $100,000 way ahead of the 3-month schedule. In that article I also detailed a system for spreading your spend out over time by using a closed-loop control system to vary the probability of a bid being submitted.  Let’s try to take that idea further: what if instead of randomly selecting which requests to bid on, you could find a way to bid on only the best requests? I just showed a way to model the expected value of winning an auction but in this context “best” doesn’t just mean “highest value”; that’s only half the picture.</p>
<p>As an example of this, let’s look at an importer purchasing one of two types of goods for resale. Good A has a 10% lower resale value than Good B, but it costs half as much:</p>
<div id="attachment_539" class="wp-caption aligncenter" style="width: 413px"><a href="http://nicolas.kruchten.com/content/wp-content/uploads/2011/12/value-cost-surplus.png" rel="lightbox[532]"><img class=" wp-image-539  " title="Value, Cost, Surplus" src="http://nicolas.kruchten.com/content/wp-content/uploads/2011/12/value-cost-surplus.png" alt="" width="403" height="333" /></a><p class="wp-caption-text">Two goods with similar values but different surpluses.</p></div>
<p>&nbsp;</p>
<p>In this case the less valuable Good A is “better” because because what matters isn’t just the value or the cost but the difference between the two, or the <a href="http://en.wikipedia.org/wiki/Economic_surplus">surplus</a>.  In business this is called <a href="http://en.wikipedia.org/wiki/Profit_(accounting)">profit</a> and in game theory this is usually called the <a href="http://en.wikipedia.org/wiki/Normal_form_game">payoff</a>.</p>
<p>$$ surplus = profit = payoff = value &#8211; cost  $$</p>
<p>The equation above holds for the RTB context if you win the auction, but what if you don’t? In that case, you get no chance at a click, so the value is zero, and you also don’t pay anything, so the cost is zero (modulo fixed costs, which are out of the scope of this article).  So for our purposes, the expected surplus is the same as above, multiplied by the probability of you actually winning the auction:</p>
<p>$$ surplus = (value &#8211; cost) * P(win) $$</p>
<p>Now assuming that alongside Recoset’s click-probability predictor, you also have access to a clearing-price-predictor (again beyond the scope of the current article) that tells you for each bid-request how likely you are to win the auction at each price, you can now compute the expected surplus.  (As it happens, one can also maximize this equation as a function of the bid to prove the well-known “bid truthfully” result mentioned above: the bid which gives you the highest surplus, no matter the price distribution, is equal to the value).</p>
<div id="attachment_537" class="wp-caption aligncenter" style="width: 430px"><a href="http://nicolas.kruchten.com/content/wp-content/uploads/2011/12/3d.png" rel="lightbox[532]"><img class=" wp-image-537 " title="Surplus Surface" src="http://nicolas.kruchten.com/content/wp-content/uploads/2011/12/3d-1024x807.png" alt="" width="420" height="330" /></a><p class="wp-caption-text">A surplus surface as a function of bid and value. For any given value, the maximum surplus occurs where bid=value.</p></div>
<p>&nbsp;</p>
<p>So now when our <a title="RTB Pacing: is everyone doing it wrong?" href="http://nicolas.kruchten.com/content/2011/09/rtb-pacing-is-everyone-doing-it-wrong/">closed-loop pacing control system</a> says that in order to meet the budget requirements, you should only bid on a quarter of the request stream, you can choose to bid only on the bid-requests which have expected surpluses in the top quartile. This means that the control value is no longer the probability of bidding, but some measure of selectivity in the auctions you are willing to participate in: the lower the control value, the more selective you should be. You’re still placing the optimal bid every time, but now you’re specifically targeting the auctions where you think you will make the most profit given the target CPC, thereby increasing your chances of actually achieving a much lower CPC.</p>
<h2>Algorithm Meets World</h2>
<p>This post explains the two major components of our system: bidding truthfully and pacing economically. In <a title="Peeking Into the Black Box, Part 2: Algorithm Meets World" href="http://nicolas.kruchten.com/content/2011/12/peeking-into-the-black-box-part-2/">Part 2 of this series</a>, I’ll show how this approach to RTB handles some real-world situations advantageously.</p>
]]></content:encoded>
			<wfw:commentRss>http://nicolas.kruchten.com/content/2011/12/peeking-into-the-black-box-part-1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Peeking Into the Black Box, Part 2: Algorithm Meets World</title>
		<link>http://nicolas.kruchten.com/content/2011/12/peeking-into-the-black-box-part-2/</link>
		<comments>http://nicolas.kruchten.com/content/2011/12/peeking-into-the-black-box-part-2/#comments</comments>
		<pubDate>Thu, 22 Dec 2011 22:17:28 +0000</pubDate>
		<dc:creator>Nicolas Kruchten</dc:creator>
				<category><![CDATA[Work]]></category>
		<category><![CDATA[algorithm]]></category>
		<category><![CDATA[black box]]></category>
		<category><![CDATA[economics]]></category>
		<category><![CDATA[pacing]]></category>
		<category><![CDATA[price floors]]></category>
		<category><![CDATA[profit]]></category>
		<category><![CDATA[real-time bidding]]></category>
		<category><![CDATA[recoset]]></category>
		<category><![CDATA[rtb]]></category>
		<category><![CDATA[surplus]]></category>

		<guid isPermaLink="false">http://nicolas.kruchten.com/content/?p=544</guid>
		<description><![CDATA[In Part 1 of this series, I claimed that Recoset’s RTB algorithm is able to take advantage of other bidders’ sub-optimal behaviour and navigate around publisher price floor in order to achieve advertiser goals. I then described the algorithm, which applies what can be called a “bid truthfully, pace economically” approach. In this second part, [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: center;"><a href="http://nicolas.kruchten.com/content/wp-content/uploads/2011/12/Prezi006-e1324592429687.jpg" rel="lightbox[544]"><img class="aligncenter  wp-image-545" title="Black Box" src="http://nicolas.kruchten.com/content/wp-content/uploads/2011/12/Prezi006-e1324592429687.jpg" alt="" width="401" height="295" /></a></p>
<p><em>In </em><em><a title="Peeking Into the Black Box, Part 1: Recoset’s RTB Algorithms" href="http://nicolas.kruchten.com/content/2011/12/peeking-into-the-black-box-part-1/">Part 1 of this series</a></em><em>, I claimed that <a href="http://www.recoset.com/" target="_blank">Recoset</a>’s RTB algorithm is able to take advantage of other bidders’ sub-optimal behaviour and navigate around publisher price floor in order to achieve advertiser goals. I then described the algorithm, which applies what can be called a “bid truthfully, pace economically” approach. In this second part, I show how this algorithm can in fact live up to these claims.</em></p>
<p>When you bid truthfully and pace economically, you are always trying to allocate your budget to the auctions which look like the best deals, whether that means that the user is very likely to click, or that the price is low because fewer bidders are in the running or there is no publisher price floor.</p>
<p><span id="more-544"></span></p>
<h2>How to Take Advantage of Other Bidders’ Behaviour</h2>
<p>In <a title="RTB Pacing: is everyone doing it wrong?" href="http://nicolas.kruchten.com/content/2011/09/rtb-pacing-is-everyone-doing-it-wrong/">Part 0 of this series</a>, I showed that the average win-price for <a title="Real Time Bidding, Characterized" href="http://nicolas.kruchten.com/content/2011/09/real-time-bidding-characterized/">RTB</a> auctions jumps up at midnight Eastern Time and described a pacing system which, if used by everyone, would prevent this jump, by spending at a constant rate. This system works by frequently updating a control value which is interpetted as the probability of bidding on any given auction. There is an implicit assumption underlying this system though: that any two randomly-selected bids are equivalent, regardless of, for example, the time of day. So while I introduced this system by talking about variability in the clearing price over the hours of the day, the simple pacer actually ignores this fact. This did not escape all readers of my post, some of whom commented in private that if we know the price is much higher at certain times, then we just shouldn’t bid at those times!</p>
<p>Now that I’ve described the way we compute expected surplus on a bid by bid basis, I can explain a principled way for a pacer to simultaneously handle variability in clearing prices and impression quality that doesn’t involve hardcoding hours when not to bid. All that is needed is for the pacing system to update the control value much less frequently, say once a day, and to treat the control value as a “minimum acceptable expected surplus” threshold. Consider the following graph comparing the evolution of average clearing-price and average expected surplus over a few days, noting how at midnight the price jumps and the surplus drops:</p>
<div id="attachment_553" class="wp-caption aligncenter" style="width: 610px"><a href="http://nicolas.kruchten.com/content/wp-content/uploads/2011/12/winprice_vs_surplus.png" rel="lightbox[544]"><img class="size-large wp-image-553" title="Clearing Price vs Surplus" src="http://nicolas.kruchten.com/content/wp-content/uploads/2011/12/winprice_vs_surplus-1024x445.png" alt="" width="600" height="260" /></a><p class="wp-caption-text">The evolution of clearing price and surplus over time.</p></div>
<p>A well-tuned, infrequently-updating pacing system will essentially draw a horizontal line across this graph, and our bidder will bid whenever the surplus is higher than this threshold. The key insight here is that the curves here represent the means of some fairly wide and skewed distributions. Even if the mean price goes up and therefore the expected surplus drops, that doesn’t mean there aren’t any “good deals” to be had: it just means there are fewer of them. By setting a hard, slowly-changing threshold, our bidder can continue bidding throughout the day, but it will probably spend less in the hour right after midnight than in the hour right before midnight.</p>
<p>We like to say that this is taking advantage of other bidders’ sub-optimal behaviour because some bidders are irrationally piling into the market right after midnight and driving the price up. Our algorithm continues to “skim the cream”, bidding in the best-looking auctions, but does back off a bit, holding its budget in reserve for such times as other bidders have run out of money for the day. As that happens, the drop in demand causes a drop in clearing price and a concurrent rise in expected surplus, causing our bidder to spend more.</p>
<h2>Navigating Around Publisher Price Floors</h2>
<p>Having covered how our algorithm responds to the actions of other bidders, how about publishers? How does this system respond to publisher <a href="http://en.wikipedia.org/wiki/Price_floor" target="_blank">price floors</a>? A price floor, or reserve price, is basically a statement by a publisher that they will not sell below a certain price. It’s almost as if a publisher is bidding on its own inventory. If there are no higher bids than the price floor, there is no transaction. If there is more than one bid above the price floor then the price floor does nothing and the winner pays the second price. If there is only one higher bid, however, that bidder wins and pays the reserve price. In this third case, the publisher gets some of the surplus from the auction that the winner would have gotten had there been no floor.</p>
<table style="width: 600px; margin: 0 auto;">
<tbody>
<tr>
<th style="text-align: center;">First Price</th>
<th style="text-align: center;">Second Price</th>
<th style="text-align: center;">Reserve Price</th>
<th style="text-align: center;">Clearing Price</th>
</tr>
<tr>
<td style="text-align: center;">5</td>
<td style="text-align: center;">3</td>
<td style="text-align: center;">(n/a)</td>
<td style="text-align: center;">3</td>
</tr>
<tr>
<td style="text-align: center;">5</td>
<td style="text-align: center;">3</td>
<td style="text-align: center;">2</td>
<td style="text-align: center;">3</td>
</tr>
<tr>
<td style="text-align: center;">5</td>
<td style="text-align: center;">3</td>
<td style="text-align: center;">4</td>
<td style="text-align: center;">4</td>
</tr>
<tr>
<td style="text-align: center;">5</td>
<td style="text-align: center;">3</td>
<td style="text-align: center;">6</td>
<td style="text-align: center;">(no transaction)</td>
</tr>
</tbody>
</table>
<p>This all makes perfect sense at the micro level, as by definition if we win the auction, we were willing to pay more than we did: our bid was how much it was worth to us, and we paid the second price so the difference between the two is profit for us. Any publisher would rather that we had just paid them the entire bid amount so they could get that profit, and therefore they have an incentive to raise the price floor all the way to our bid amount to “capture back” as much of that surplus as they can.</p>
<p>Our algorithm responds to a publisher price floor the same way it responds to the actions of other bidders causing the price to rise. At any given time, the pacer’s control value tells our bidder the minimum expected surplus required for it to bid. If a publisher raises their price floor (and if our price model captures this), the expected surplus of auctions for their impressions will drop. If it drops below the control value, we will simply not bid on those auctions at all, rather than accepting the higher price. This is what we call navigating around publisher price floors.</p>
<p>The control value is set so that the budget will be met, and is based on the availability of surplus on the whole exchange, not just on any one publisher’s sites. In effect, this puts an upper bound on the price floor any publisher can set before losing our business, so long as we can get a higher expected surplus elsewhere. This upper bound will be lower than the value to us of winning the auction, but could still be higher than the next-highest bid, so we do give up some of the surplus to the publisher. This is pretty much the way the exchange should work: it’s a market mechanism which uses competition to divide up the surplus pie.</p>
<div id="attachment_546" class="wp-caption aligncenter" style="width: 453px"><a href="http://nicolas.kruchten.com/content/wp-content/uploads/2011/12/price-floor.png" rel="lightbox[544]"><img class=" wp-image-546   " title="Price Floors" src="http://nicolas.kruchten.com/content/wp-content/uploads/2011/12/price-floor.png" alt="" width="443" height="366" /></a><p class="wp-caption-text">The control value places an upper bound on the price floor that is lower than the value.</p></div>
<p>It’s worth mentioning that at the macro level, publishers don’t necessarily sell all their inventory on a spot basis on the RTB exchange: they try to sell inventory at a much higher price on a <a href="http://en.wikipedia.org/wiki/Forward_contract" target="_blank">forward contract</a> basis ahead of time. In industry jargon, inventory sold on a forward contract basis is called “premium” and everything else is called “remnant”. Without getting too deeply into this, publishers have an incentive to put up very high floor prices so that advertisers don’t stop buying expensive premium inventory because they think they can buy the same impressions more cheaply on the spot market (i.e. as remnant inventory). The industry lingo for this effect is “cross-channel conflict”, because the remnant &#8220;channel&#8221; is perceived to be hurting the premium &#8220;channel&#8221;. If a publisher is using this type of rationale to set a price floor, they will likely set it higher than the amount of our bid, rather than trying to squeeze it in between the first and second highest bids. This would translate into an expected surplus of zero, as there would be no chance of us winning the auction. Again, our algorithm just doesn’t bid on this type of auction; it certainly doesn’t raise the bid value to win an unprofitable impression. <a href="http://www.mikeonads.com/2010/09/25/price-floors-second-price-auctions-and-market-dynamics/" target="_blank">Mike on Ads has a good post</a> about why it’s a bad idea for publishers to do this, based on some hypothetical behaviour of bidders. Our algorithm can automatically implement the types of behaviour he describes.</p>
]]></content:encoded>
			<wfw:commentRss>http://nicolas.kruchten.com/content/2011/12/peeking-into-the-black-box-part-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>RTB Pacing: is everyone doing it wrong?</title>
		<link>http://nicolas.kruchten.com/content/2011/09/rtb-pacing-is-everyone-doing-it-wrong/</link>
		<comments>http://nicolas.kruchten.com/content/2011/09/rtb-pacing-is-everyone-doing-it-wrong/#comments</comments>
		<pubDate>Thu, 08 Sep 2011 20:35:12 +0000</pubDate>
		<dc:creator>Nicolas Kruchten</dc:creator>
				<category><![CDATA[Work]]></category>
		<category><![CDATA[arbitrage]]></category>
		<category><![CDATA[budget]]></category>
		<category><![CDATA[campaign]]></category>
		<category><![CDATA[control]]></category>
		<category><![CDATA[pacing]]></category>
		<category><![CDATA[real-time bidding]]></category>
		<category><![CDATA[recoset]]></category>
		<category><![CDATA[rtb]]></category>

		<guid isPermaLink="false">http://nicolas.kruchten.com/content/?p=413</guid>
		<description><![CDATA[&#160; I&#8217;ve written some follow-up posts to this one, as a series called Peeking into the Black Box, so I&#8217;m grandfathering this post in as &#8220;Part 0&#8243; of the series. I read an interesting post on the AppNexus tech blog about their campaign monitoring tools and the screenshots there almost exclusively contained various pacing measurements. [...]]]></description>
			<content:encoded><![CDATA[<p>&nbsp;</p>
<p style="text-align: center;"><a href="http://nicolas.kruchten.com/content/wp-content/uploads/2011/08/pace.png" rel="lightbox[413]"><img class="size-full wp-image-423 aligncenter" title="Pace" src="http://nicolas.kruchten.com/content/wp-content/uploads/2011/08/pace.png" alt="Spend Rate vs Bid Probability" width="426" height="257" /></a></p>
<p><em>I&#8217;ve written some follow-up posts to this one, as a series called <a title="Peeking Into the Black Box, Part 1: Recoset’s RTB Algorithms" href="http://nicolas.kruchten.com/content/2011/12/peeking-into-the-black-box-part-1/">Peeking into the Black Box</a>, so I&#8217;m grandfathering this post in as &#8220;Part 0&#8243; of the series.</em></p>
<p>I read an <a href="http://techblog.appnexus.com/2011/behind-the-screens-the-making-of-campaign-monitor/#comment-78" target="_blank">interesting post on the AppNexus tech blog</a> about their campaign monitoring tools and the screenshots there almost exclusively contained various pacing measurements. Some of the graphs there looked a lot like the ones I had sketched up while trying to solve the pacing problem for our real-time bidding (RTB) client.</p>
<p>Here are the basics of the problem: if someone gives you a fixed amount of money to run a display advertising campaign over a specific time period, it&#8217;s generally advisable to spend exactly that amount of money, spread out reasonably evenly over that time-period. Over-spending could mean you&#8217;re on the hook for the difference, and under-spending doesn&#8217;t look great if you want repeat business. And if you don&#8217;t spend it evenly, you&#8217;ll get some pissed-off customers, like <a href="http://blog.fotobookapp.com/59958087">this guy who had his $50 budget blown in minutes</a>. Sounds obvious, right? Apparently it&#8217;s harder than it looks!<span id="more-413"></span></p>
<h2>Doin&#8217; it wrong</h2>
<p>When Recoset first dipped its toe into RTB (real-time bidding, characterized in <a title="Real Time Bidding, Characterized" href="http://nicolas.kruchten.com/content/2011/09/real-time-bidding-characterized/">this sister post</a>), the main tool at our disposal to control budget/pacing was a daily spending cap. It took about 48 hours for us to figure out that that wasn&#8217;t going to work for us: depending on the types of filters we used to decide when to bid at all, we&#8217;d either not manage spend the daily budget (more on that at the bottom) or blow the budget very quickly, usually the latter. In practice this meant that we would start buying impressions at midnight (all times in this post are in Montreal local time: EST or EDT) and generally hit our cap mid-morning.</p>
<div id="attachment_431" class="wp-caption aligncenter" style="width: 512px"><a href="http://nicolas.kruchten.com/content/wp-content/uploads/2011/08/dailycap.png" rel="lightbox[413]"><img class="size-full wp-image-431" title="Daily Cap" src="http://nicolas.kruchten.com/content/wp-content/uploads/2011/08/dailycap.png" alt="Daily Cap" width="502" height="345" /></a><p class="wp-caption-text">No bidding occurs after daily cap is reached</p></div>
<p>&nbsp;</p>
<p>On a daily spend graph over a 30-60 day period resulting from such a policy, this would probably look just fine: you&#8217;re spending the budget in the time given, and spending exactly the right amount per day. But having this sort of &#8216;dead time&#8217; where we don&#8217;t bid at all bothered us, so as soon as we spun up our own bidder we added options like &#8220;bid on X% of the request stream&#8221; and when we looked at the resulting data we noticed something really interesting: the impressions we were buying before were actually among the most expensive of the day!</p>
<p>The average win-price graph below (i.e. what the next-highest bid was to our winning one) from a high-bidding campaign shows a pretty clear pattern: if you start at noon, the dollar cost for a thousand impressions (the CPM) generally drops until precisely midnight, then jumps something like 40%. It fluctuates a bit, hits a high in the morning and then drops down again until noon to start the cycle again. This pattern repeats day after day:</p>
<div id="attachment_452" class="wp-caption aligncenter" style="width: 501px"><a href="http://nicolas.kruchten.com/content/wp-content/uploads/2011/08/prices.png" rel="lightbox[413]"><img class="size-full wp-image-452  " title="Win Price" src="http://nicolas.kruchten.com/content/wp-content/uploads/2011/08/prices.png" alt="Win Price" width="491" height="354" /></a><p class="wp-caption-text">Notice the discontinuous jumps every midnight</p></div>
<p>&nbsp;</p>
<p>So what&#8217;s with that midnight jump? This is supposed to be an efficient marketplace, but are impressions a minute past midnight really worth 40% more than impressions 2 minutes earlier? Had we not had our initial experiences described above it might have taken us a long time to puzzle out! Our current working theory is in fact that there must be lots of bidders out there that behave just like our first one did: they have a fixed daily budget and they reset their counters at midnight. This would mean that starting at midnight, there&#8217;s a whole lot more demand for impressions, and thanks to supply and demand, the price spikes way up (alternately, odds are that at least one of the many bidders coming back online is set up to bid higher than the highest-bidder in the smaller pool of 11:59PM bidders). And as bidders run out of budget throughout the day, at different times, depending on their budget and how quickly they spend it, the price drops to reduced demand. It&#8217;s not obvious from the above graph but when you plot the averages over multiple days together there is a similar step at 3AM, when it&#8217;s midnight on the West Coast, where a lot of other developers live and work and code:</p>
<div id="attachment_443" class="wp-caption aligncenter" style="width: 490px"><a href="http://nicolas.kruchten.com/content/wp-content/uploads/2011/08/prices2.png" rel="lightbox[413]"><img class="size-large wp-image-443 " title="Average Price Per Hour" src="http://nicolas.kruchten.com/content/wp-content/uploads/2011/08/prices2-1024x701.png" alt="Average Price Per Hour" width="480" height="328" /></a><p class="wp-caption-text">Average Price Per Minute of the Day For the Month of August</p></div>
<p>&nbsp;</p>
<p>Now there might be some other explanations for this price jump. We&#8217;re in the Northeast and that impacts the composition of our bid-request stream, so maybe this is just a regional pattern (do other people see this also?) although the 3AM jump suggests otherwise. Midnight also happens to be when our bid-request volume starts its nightly dip, but that dip is gradual and isn&#8217;t enough to alone explain that huge spike. Maybe some major sites have some sort of reserve price that kicks in at midnight. Or maybe everyone else is way smarter than us and noticed that click-through rates are much higher a minute past midnight than they are a minute before. I have trouble believing those explanations, though, so if anyone reading this has any thoughts on this I&#8217;d love to hear about them by email or in the comments below. Whatever the reason, it looks like <em>if you&#8217;re going to only be bidding for a few hours a day, don&#8217;t choose the time-period starting at midnight and ending at 10AM!</em></p>
<h2>Closed-Loop Control to the Rescue</h2>
<p>So let&#8217;s assume that this post gets a ton of coverage among the right circles and that this before/after-midnight arbitrage opportunity eventually disappears, or maybe you just want to actually spread your spending evenly throughout the day, how would you do it? As I mentioned above, we have the ability to specify a percentage of the bid-volume to bid on. The obvious way to use this capability is to see how far into the day (percentage-wise) we get with our daily budget and bid only on that percentage of the bid-request stream, thereby ensuring that we run out of budget just as midnight rolls around. In practice, though, bidding on a fixed percentage of all requests can and does result in a different amount spent each day, sometimes higher than the daily budget (if you remove the daily cap) and sometimes lower. Maybe some days you get a different volume of bid-requests, because your upstream provider is making changes. Maybe there&#8217;s an outage. Maybe the composition of your bid-request stream is changing such that all of a sudden the percentage of all bid-requests that&#8217;s in the geo-area you&#8217;re targeting goes way up (or down). Maybe your bidding logic doesn&#8217;t result in a consistent spend. All of these factors could amount to a pretty jagged spend profile:</p>
<div id="attachment_432" class="wp-caption aligncenter" style="width: 512px"><a href="http://nicolas.kruchten.com/content/wp-content/uploads/2011/08/openloop.png" rel="lightbox[413]"><img class="size-full wp-image-432" title="Open Loop Control" src="http://nicolas.kruchten.com/content/wp-content/uploads/2011/08/openloop.png" alt="Open Loop Control" width="502" height="345" /></a><p class="wp-caption-text">Spend rate can change despite constant bid probability</p></div>
<p>&nbsp;</p>
<p>So what&#8217;s the solution? For us it&#8217;s to look at this problem as a control problem and to move from open-loop control to <a href="http://en.wikipedia.org/wiki/Control_theory">closed-loop control</a>. If you theorize about the appropriate bid probability, set it up and let things run for a week and look at the results later, that&#8217;s basically open-loop control. If every morning you look at what the spend was the day before and adjust the coming day&#8217;s bid probability accordingly, you&#8217;re doing a form of closed-loop control: you&#8217;re feeding back information from the output into the input. It&#8217;s pretty easy to automate (at least we thought so), and once that&#8217;s done, there&#8217;s basically no reason not to shorten the feedback loop from daily to every few minutes.</p>
<p>Our RTB client was designed from the ground up to facilitate this sort of thing so we were able to control spend to within 1-2% of target with the following simple control loop: every 2 minutes, set the bid probability for the subsequent 2 minutes to that which would have resulted in the &#8216;correct spend&#8217; over the preceding 2 minutes (maybe with a bit of damping). So if for whatever reason we get 20% fewer bid requests, for example at night, the bid probability will rise so that the spend rate stays roughly constant. There are more sophisticated ways to do closed-loop control, like the <a href="http://en.wikipedia.org/wiki/PID_controller">PID controller</a> which I <a title="Rancilio Silvia iPhone Remote Control" href="http://nicolas.kruchten.com/content/2011/05/rancilio-silvia-iphone-remote-control/" target="_blank">hacked into my espresso machine</a>, but this really basic system works pretty well:</p>
<div id="attachment_467" class="wp-caption aligncenter" style="width: 550px"><a href="http://nicolas.kruchten.com/content/wp-content/uploads/2011/09/beforeafter.png" rel="lightbox[413]"><img class="size-large wp-image-467 " title="Before and After" src="http://nicolas.kruchten.com/content/wp-content/uploads/2011/09/beforeafter-1024x727.png" alt="Before and After" width="540" height="383" /></a><p class="wp-caption-text">This is actual data from around the time we activated our simple controller. Notice how Open Loop Control gives output that looks like the mountains of Mordor whereas Closed Loop Control makes it look like the fields of the Shire</p></div>
<p>&nbsp;</p>
<div id="attachment_423" class="wp-caption aligncenter" style="width: 527px"><a href="http://nicolas.kruchten.com/content/wp-content/uploads/2011/08/pace.png" rel="lightbox[413]"><img class="size-full wp-image-423  " title="Pace" src="http://nicolas.kruchten.com/content/wp-content/uploads/2011/08/pace.png" alt="Spend Rate vs Bid Probability" width="517" height="312" /></a><p class="wp-caption-text">Actual production data, smoothed a little bit to reveal underlying trend. Pace (blue, prob*1000) rises at night when volume drops; Spend Rate (green) stays close to target (red)</p></div>
<p>&nbsp;</p>
<p>Like any good optimizer, the scheme described above does exactly what you ask it to: it tries to keep the error between &#8216;instantaneous&#8217; spend rate and the target as close to zero as possible. If there&#8217;s a full-on outage where no bidding occurs for an hour, this system will not try to make up the spend, it will just try to get the spend <em>rate</em> back to where it should be according to the originally-specified target, once bidding resumes:</p>
<div id="attachment_430" class="wp-caption aligncenter" style="width: 512px"><a href="http://nicolas.kruchten.com/content/wp-content/uploads/2011/08/closedloop.png" rel="lightbox[413]"><img class="size-full wp-image-430" title="Closed Loop Control" src="http://nicolas.kruchten.com/content/wp-content/uploads/2011/08/closedloop.png" alt="Closed Loop Control" width="502" height="345" /></a><p class="wp-caption-text">Spend rate after outage is too low to catch up (slope remains parallel to initial target)</p></div>
<p>&nbsp;</p>
<p>In order to deal with outages, as well as over- or under-spending due to accumulated error or drift from the instantaneous control system&#8217;s imperfect responses, we have to dynamically adjust this spend-rate target as well. The way we&#8217;ve come up with to deal with this is to define the error function in terms of the original problem: continuously work to keep the spend rate at a level that will spend the remaining budget in the remaining time. This means that if there is an outage for an hour or a day, the spend rate for the rest of the campaign will be marginally higher to make up for it, without human intervention. This system also has the nice property of automatically stopping right on the dot when the remaining budget is zero:</p>
<div id="attachment_429" class="wp-caption aligncenter" style="width: 512px"><a href="http://nicolas.kruchten.com/content/wp-content/uploads/2011/08/adaptive.png" rel="lightbox[413]"><img class="size-full wp-image-429" title="Closed Loop Control w Adaptive Target" src="http://nicolas.kruchten.com/content/wp-content/uploads/2011/08/adaptive.png" alt="Closed Loop Control w Adaptive Target" width="502" height="345" /></a><p class="wp-caption-text">Spend rate after outage increases to catch up (slope gets steeper as outage gets longer)</p></div>
<p>&nbsp;</p>
<p>Now obviously you might not want to always have exactly the same spend rate all the time, in fact I almost guarantee that you don&#8217;t. A closed-loop control system can support pretty much whatever spending profile you like. You can also use one to manipulate other variables than the bid-probability (think: the actual bid price, or the targeting criteria, or whatever other scheme your quant brain can conjure up).</p>
<p>Finally, the astute reader will notice that the approach described above is only really useful in situations where there is more supply than your budget permits you to buy, and so you need to allocate your budget over time. The approach above won&#8217;t help at all if the optimal bid probability to meet your needs is up above 100%, that is to say that even when you bid on and win everything that matches your targeting criteria, you&#8217;re not able to spend your budget. That&#8217;s probably a topic for a different blog post.</p>
<p><strong>Follow-up</strong>: I also posted this question on <a href="http://www.quora.com/Real-Time-Bidding/Why-does-the-average-RTB-win-price-jump-up-significantly-at-midnight-EST-and-do-others-see-this-jump-at-midnight-in-their-own-timezone" target="_blank">Quora, where I got a few responses</a>.</p>
<p><strong>Follow-up posts</strong>: This post is now Part 0 of a series I&#8217;m calling Peeking into the Black Box. Check out <a title="Peeking Into the Black Box, Part 1: Recoset’s RTB Algorithms" href="http://nicolas.kruchten.com/content/2011/12/peeking-into-the-black-box-part-1/">Part 1</a> and <a title="Peeking Into the Black Box, Part 2: Algorithm Meets World" href="http://nicolas.kruchten.com/content/2011/12/peeking-into-the-black-box-part-2/">Part 2</a> of the series, where I show how to really take advantage of other bidders&#8217; behaviour.</p>
]]></content:encoded>
			<wfw:commentRss>http://nicolas.kruchten.com/content/2011/09/rtb-pacing-is-everyone-doing-it-wrong/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Real Time Bidding, Characterized</title>
		<link>http://nicolas.kruchten.com/content/2011/09/real-time-bidding-characterized/</link>
		<comments>http://nicolas.kruchten.com/content/2011/09/real-time-bidding-characterized/#comments</comments>
		<pubDate>Thu, 08 Sep 2011 20:34:38 +0000</pubDate>
		<dc:creator>Nicolas Kruchten</dc:creator>
				<category><![CDATA[Work]]></category>
		<category><![CDATA[auction]]></category>
		<category><![CDATA[real-time bidding]]></category>
		<category><![CDATA[recoset]]></category>
		<category><![CDATA[rtb]]></category>

		<guid isPermaLink="false">http://nicolas.kruchten.com/content/?p=455</guid>
		<description><![CDATA[There doesn&#8217;t appear to be a good Wikipedia entry for RTB for me to link at the moment, when I want to blog about it so I&#8217;ll draft my own explanation here. (Edit: there is an entry now, but I like my characterization better!) Keep in mind while reading this that I&#8217;m looking at RTB [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://nicolas.kruchten.com/content/wp-content/uploads/2011/09/auction-gavil.jpg" rel="lightbox[455]"><img class="aligncenter size-medium wp-image-460" title="Auction" src="http://nicolas.kruchten.com/content/wp-content/uploads/2011/09/auction-gavil-300x240.jpg" alt="Auction" width="300" height="240" /></a></p>
<p>There doesn&#8217;t appear to be a good Wikipedia entry for RTB for me to link at the moment, when I want to blog about it so I&#8217;ll draft my own explanation here. (Edit: <a href="http://en.wikipedia.org/wiki/Real_Time_Bidding" target="_blank">there is an entry now</a>, but I like my characterization better!) Keep in mind while reading this that I&#8217;m looking at RTB as a software engineer with an interest in economics, rather than as an ad industry veteran!</p>
<h2>What is RTB?</h2>
<p>Real-Time Bidding (RTB) is a way for advertisers to pay publishers to show ad impressions on their websites. In this context, an &#8216;advertiser&#8217; can be anyone with a product to sell: a car company, a small e-commerce website etc. A &#8216;publisher&#8217; can be anyone with a website: a newspaper, a blogger, Facebook, Google etc. An &#8216;impression&#8217; is a single instance of a banner ad being shown on a web page. The term RTB is usually not applied to other ad formats such as text ads ads, or ads alongside Google search results or Facebook news feeds, as different technology is used to buy and sell those other formats.</p>
<p>The word &#8216;bidding&#8217; comes from the fact that the price of impressions is set via an auction mechanism, where the role of the auctioneer is played by an &#8216;ad exchange&#8217; who mediates between advertisers and publishers. The term &#8216;real-time&#8217; stems from the fact that advertisers bidding in the auction must specify the price they are willing to pay for each impression separately for each impression, and within a strict time-constraint: usually around one tenth of a second, so the auction can run as the web page on which the impression will be shown is loading.<span id="more-455"></span></p>
<h2>How does RTB work on Technical Level?</h2>
<p>In terms of the technical aspects of participating in the auctions as a bidder, a computer subscribes to a stream of many thousands of bid-requests per second from the exchange, each one representing an impression to be shown very shortly. For each one, within the hard-real-time constraints, the computer must respond by specifying if/how much it wants to bid, in units of millionth of a dollar (μ$). The cost of online advertising is usually discussed in terms of cost-per-mille (CPM) in dollars, so $1 CPM implies 1 million microdollars for 1 thousand impressions, or μ$1000 per impression. The bidding computer can make its decisions based on the various attributes contained within the bid-request: the URL, the position of the ad spot in the page, the user&#8217;s location and local time, the number of times this user has previously been shown a given ad, any additional data the bidder has on file or has purchased on that particular user, etc.</p>
<h2>What do RTB Auctions Look Like?</h2>
<p>From my reading on auction theory, here is how I would characterize what goes on in most RTB auctions:</p>
<ul>
<li>There exists a sequence of auctions to sell impressions, and the participants (bidders) are not always the same from auction to auction.</li>
<li>The auctions are closed (aka sealed-bid) and single-round, so participants do not see each other&#8217;s bids. All bidders submit their bids once and the winner is determined.</li>
<li>The auctions are second-price, so the winner is the person who bids the highest, but pays whatever the amount of the next-highest bid was. Incidentally, only the winner knows what the winning bid was, and they learn what the next-highest price was, also, because that&#8217;s what they paid.</li>
<li>The auctions are single-unit, in that participants are bidding on single ad impressions at a time.</li>
<li>The goods being auctioned are perishable: you use them immediately when you buy them and then they&#8217;re gone. They have no resale value.</li>
<li>The goods being auctioned are slightly different from auction to auction: the history of the user who will be seeing the eventual ad the winner will show matters to each bidder, as does for example the time of day etc.</li>
<li>The valuations are uncertain, in that each bidder only has an estimate of how much it is worth to them to win this auction, and usually not even a very good estimate.</li>
<li>The valuations are independent, in that the value of winning the auction to each bidder doesn&#8217;t impact the value of winning the auction to any other bidder.</li>
<li>The valuations are private, in that the value of winning the auction to each bidder is not known to the others.</li>
<li>Not every publisher is willing to sell to every advertiser, and vice-versa, due to various brand, appropriateness and competitive concerns.</li>
<li>Publishers will sometimes set a reserve price: a price below which they will not sell, so as not to enable advertisers to use RTB to undercut existing sales channels.</li>
</ul>
<p>These characteristics make the RTB marketplace fairly different from, say, a stock exchange, where people buy and sell large pools of indistinguishable units of stock, whose value is intimately linked with the price at which it can later be resold, and in general no one cares who is buying or selling.</p>
<h2>Why?</h2>
<p>Participating in RTB auctions is fairly technically challenging, and good bidding strategies are not obvious, so there must be some perceived advantage over the alternative. In this case the alternative is a complicated web of relationships between publishers and advertisers, sometimes with &#8216;ad networks&#8217; and other middlemen in between. The purpose of RTB is to set up a more efficient marketplace for advertising inventory. One where fine-grained control over bids allows advertisers to save money by purchasing a smaller quantity of better targetted ads, even if they pay more per impression. One where publishers are able to earn more from their web properties because they can more economically reach the advertisers who are willing to pay a high price per impression for the ads they do show. And one where the end-users who actually see the ads are happier because they see fewer cheap and irrelevant ads.</p>
<p>Now obviously, for buyers to spend less and sellers to earn more something has to give, and that&#8217;s where the word &#8216;efficient&#8217; comes in. Right now there is a lot of opportunity cost and operations cost on both sides of the market and RTB is meant to reduce a lot of that. Whether or not any of the promise of RTB is actually materializing for any side of the market equation is still an open question, and it&#8217;s a rapidly-evolving area of the online advertising industry.</p>
<h2>More Reading &amp; Listening</h2>
<p><a href="http://www.mikeonads.com/" target="_blank">Mike on Ads</a> is a good blog by the CTO of AppNexus, which is a big company in this space. Here are a <a href="http://www.google.com/search?q=rtb+site%3Amikeonads.com" target="_blank">bunch of his posts on RTB</a>. I&#8217;ve also enjoyed a few episodes of the <a href="http://www.exchangewire.com/tradertalk/">Trader Talk podcast</a> on RTB.</p>
]]></content:encoded>
			<wfw:commentRss>http://nicolas.kruchten.com/content/2011/09/real-time-bidding-characterized/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Using make to Orchestrate Machine Learning Tasks</title>
		<link>http://nicolas.kruchten.com/content/2011/07/using-make-for-machine-learning/</link>
		<comments>http://nicolas.kruchten.com/content/2011/07/using-make-for-machine-learning/#comments</comments>
		<pubDate>Tue, 26 Jul 2011 13:56:54 +0000</pubDate>
		<dc:creator>Nicolas Kruchten</dc:creator>
				<category><![CDATA[Work]]></category>
		<category><![CDATA[gnu make]]></category>
		<category><![CDATA[machine learning]]></category>
		<category><![CDATA[make]]></category>
		<category><![CDATA[models]]></category>
		<category><![CDATA[parallelization]]></category>
		<category><![CDATA[recoset]]></category>
		<category><![CDATA[rtb]]></category>
		<category><![CDATA[testing]]></category>
		<category><![CDATA[training]]></category>

		<guid isPermaLink="false">http://nicolas.kruchten.com/content/?p=345</guid>
		<description><![CDATA[One of the things we do at Recoset is to use machine learning algorithms to optimize real-time bidding (RTB) policies for online display advertising. This means we train software models to predict, for example, the cost and the value of showing a given ad impression, and we then incorporate these prediction models into systems which [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: center;"><a href="http://nicolas.kruchten.com/content/wp-content/uploads/2011/07/dependencies.png" rel="lightbox[345]"><img class="aligncenter size-large wp-image-377" title="Dependencies" src="http://nicolas.kruchten.com/content/wp-content/uploads/2011/07/dependencies-1024x787.png" alt="" width="420" height="323" /></a></p>
<p>One of the things we do at <a href="http://www.recoset.com/" target="_blank">Recoset</a> is to use machine learning algorithms to optimize real-time bidding (RTB) policies for online <a href="http://en.wikipedia.org/wiki/Display_advertising" target="_blank">display advertising</a>. This means we <a href="http://en.wikipedia.org/wiki/Supervised_learning" target="_blank">train software models</a> to predict, for example, the cost and the value of showing a given ad impression, and we then incorporate these prediction models into systems which make informed bidding decisions on behalf of our clients to show their ads to their potential customers. The basic process of training a model is pretty simple: you  show it a bunch of examples of whatever you want it to predict and apply whatever algorithm you&#8217;ve elected to use to set some internal parameters such that it&#8217;s likely to give you the right output for that example. A trained model is basically set of parameters for whatever algorithm you&#8217;re using. Once you&#8217;ve trained your model, you test it: show it a bunch of new examples, to see if the training actually succeeded at making it any good at outputting whatever it is you wanted it to. The testing step often results in some datasets which we can feed into<a title="Recoset’s Dataviz System" href="http://nicolas.kruchten.com/content/2011/07/recoset-dataviz/" target="_blank"> our data visualization system</a>.</p>
<p><span id="more-345"></span>Now the bidding policies we&#8217;ve developed depend upon multiple such trained models, as mentioned above. Let&#8217;s use the example of a policy P which depends on a value model V and a cost model C, so P(V,C). We won&#8217;t need to train this policy per se, but we will want to run some simulations on it to test how good we would expect it to be at meeting our clients&#8217; advertising goals. We might come up with 2 different types of value models (say V1 and V2) and 2 different types of cost models (say C1 and C2) to train and test.  So we end up with 4 possible combinations of value and cost models and thus 4 different possible policies to test: P(V1, C1), P(V1, C2), P(V2, C1), P(V2, C2). Note that in practice, we will likely come up with many more than 2 variations on value and cost models, but I&#8217;ll use 2 of each to keep this example manageable. So how do we actually go about doing all this testing and training efficiently, both from the point of view of the computer and the computer&#8217;s operator (e.g. me)?</p>
<p>When you&#8217;re just training and testing models independently of each other, you can easily get by with a simple loop in a script of some kind: <em>for each model, call train and test functions</em>. When you have some second step that depends on combinations of models, like the situation described above, you can just add a second loop after the first one: <em>for each valid combination, call test function</em>. This is how we first built our training and testing systems, but then we decided to parallelize the process to take advantage of our 16-core development machine. While brainstorming how we might build something which could manage the sequence of parallel jobs that would need to run, we realized that we&#8217;d already been using such a system throughout the development of these models: <a href="http://www.gnu.org/software/make/" target="_blank">make</a>. More specifically: make with the -j option for &#8216;number of jobs to run simultaneously&#8217;. The make command basically tries to ensure that whatever file you&#8217;ve asked it to make exists and is up to date, first by recursively ensuring that all the files your file depends on also exist and are up to date, and then by executing some command you give it to create your file. Casting our training/testing problem in terms of commands which make can understand is fairly simple and gives rise to the diagram at the top of this page: rectangles represent files on disk and ovals represent the processes which create those files. The only change required to our existing code was that the train and test functions had to be broken out into separate executable commands which read and write to files.</p>
<p>&nbsp;</p>
<div id="attachment_388" class="wp-caption aligncenter" style="width: 610px"><a href="http://nicolas.kruchten.com/content/wp-content/uploads/2011/07/make-exp-system.jpg" rel="lightbox[345]"><img class="size-large wp-image-388" title="make-exp-system" src="http://nicolas.kruchten.com/content/wp-content/uploads/2011/07/make-exp-system-1024x764.jpg" alt="" width="600" height="447" /></a><p class="wp-caption-text">Here&#39;s the actual whiteboard brainstorm that led to this design :)</p></div>
<p>&nbsp;</p>
<p>Why not just parallelize our simple for-loops using some threading/multiprocessing primitives? Because using those kinds of primitives can be difficult and frustrating, while using make is comparatively easy and gives us a number of other benefits. Beyond the aforementioned -j option which gives us parallelism, the -k option makes it less painful for some steps to error out without having to build any specific error-handling logic: make just keeps going with whatever doesn&#8217;t depend on the failed targets. And when we fix the code that crashed and re-make, make can just pick up with those failed targets, because all the rest of the files on disk don&#8217;t need to be updated and essentially act as a reusable cache or checkpoint for our computation (unless they caused the crash, at which point you delete them and make rebuilds them and keeps going). Having named targets also makes it easy to ask make to just give us a subset of the test results: it automatically runs only what it needs to to generate those targets we ask for. When you invoke make, you have to tell it what to make, in terms of targets, for example &#8216;make all_test_results&#8217; or &#8216;make trained_model_V1&#8242; etc. Adding new steps, such as the ability to test multiple types of policies, for example, or a &#8216;deploy to production&#8217; step, becomes fairly trivial. Finally, the decomposition of our code into separate executables was actually helpful in decoupling modules and encouraging a more maintainable architecture. And whenever we get around to buying/building a cluster (or spinning up a bunch of Amazon instances), we can just prefix our executable commands with whatever command causes them to be run on a remote machine!</p>
<p>So, armed with our diagram, and our excitement about the benefits that using make would bring us, we implemented a script which essentially steps through the following pseudo-code, given a single master configuration file that defines the models to be used:</p>
<ol>
<li>for each command to be run (ovals in the diagram), generate a command-specific configuration file and write it to disk, <em>unless an identical one is already there</em>
<ul>
<li>Setup, training and testing configuration files for V and C models</li>
<li>Testing configuration files for P policies</li>
</ul>
</li>
<li>for each target (rectangles in the diagram), write to the makefile the command to run and the targets it depends on, adding the configuration file to the list of dependencies
<ul>
<li>Setup targets for V and C models</li>
<li>Training and testing targets for each V and C models</li>
<li>Testing targets for policies P for each combination of V and C models</li>
</ul>
</li>
<li>call make, passing through whatever command-line arguments were given (e.g. -j8 -k targetName)</li>
</ol>
<p>Step 1 is helpful in allowing us to change a few values in the master configuration file, and having make only rebuild the parts of the dependency tree that are affected, given that make rebuilds a target only if any of its dependencies have more recent file modification dates. This means that, for example, I can choose to change some value in the master configuration file related to the training of C models, and when I call make again, it will not retrain/retest any V models. Doing this in effect adds dependencies from the various commands to certain <em>parts</em> of the master configuration file and lets make figure out what work actually needs to be done. And, let it be said again: it will do everything in parallel!</p>
<p>Unfortunately, I don&#8217;t have any code to put up on GitHub yet as the script we wrote and I described above is very specific to our model implementation, but I wrote this all up to share our experiences with this structure and how pleased we are with the final result!</p>
]]></content:encoded>
			<wfw:commentRss>http://nicolas.kruchten.com/content/2011/07/using-make-for-machine-learning/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Recoset&#8217;s Dataviz System</title>
		<link>http://nicolas.kruchten.com/content/2011/07/recoset-dataviz/</link>
		<comments>http://nicolas.kruchten.com/content/2011/07/recoset-dataviz/#comments</comments>
		<pubDate>Thu, 07 Jul 2011 19:40:05 +0000</pubDate>
		<dc:creator>Nicolas Kruchten</dc:creator>
				<category><![CDATA[Work]]></category>
		<category><![CDATA[coffeescript]]></category>
		<category><![CDATA[dataviz]]></category>
		<category><![CDATA[protovis]]></category>
		<category><![CDATA[recoset]]></category>
		<category><![CDATA[visualization]]></category>

		<guid isPermaLink="false">http://nicolas.kruchten.com/content/?p=342</guid>
		<description><![CDATA[At Recoset, working with data often means data visualization (or dataviz): making pretty pictures with data. This is usually more like making fully machine-generated images than carefully laying out &#8220;infographics&#8221; of the Information Is Beautiful school but I find they usually end up looking pretty good. There are lots of good tools for graphing data, [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: center;"><a href="http://nicolas.kruchten.com/content/wp-content/uploads/2011/07/streamgraph.png" rel="lightbox[342]"><img class="aligncenter size-large wp-image-357" title="Streamgraph" src="http://nicolas.kruchten.com/content/wp-content/uploads/2011/07/streamgraph-1024x825.png" alt="" width="420" height="338" /></a></p>
<p>At <a href="http://www.recoset.com" target="_blank">Recoset</a>, working with data often means data visualization (or dataviz): making pretty pictures with data. This is usually more like making fully machine-generated images than carefully laying out &#8220;infographics&#8221; of the Information Is Beautiful school but I find they usually end up looking pretty good. There are lots of good tools for graphing data, like matplotlib or R or just plain old Excel-clone spreadsheets but what we use most often is <a href="http://mbostock.github.com/protovis/" target="_blank">Protovis</a>, the Javascript library for generating SVG, coupled with <a href="http://jashkenas.github.com/coffee-script/" target="_blank">CoffeeScript</a>, which is a concise and expressive language that compiles down to Javascript. The appeal of this combination for me is that it&#8217;s a very data-centric, declarative way of writing code to generate graphics. If you read the code for some of the very pretty examples on the Protovis site, you&#8217;ll see that they basically all have the same structure: a JSON-friendly object (usually an array) with all of the data to be visualized, and some small amount of code that declares how each element in that data object should be drawn. To me, the relationship between data and Protovis code is very similar to the relationship between semantic HTML and CSS: a separation of content and presentation.</p>
<p><span id="more-342"></span>When we first started using Protovis to make pretty pictures, we&#8217;d do something similar to the Protovis examples: make an HTML file which contained a bit of markup and a script block with the Javascript code to call Protovis, plus a script tag in the header, which would load an external file which contained our JSON data, preceded by something like &#8220;this.data=&#8221;. This suited us just fine, because the data to be visualized was usually the output of some Python or NodeJS or C++ process, and writing JSON to disk is really easy from pretty much any language. The thing is, once you&#8217;ve written a bunch of similar dataviz code a few times, you look for a way to reduce the repetitive work, and so you pull your code out into an external file and you end up creating a whole bunch of very small HTML files to combine code and data. Say you&#8217;ve settled on a nice way of graphing a certain type of data, for example the output of a per-partition Click-Through-Rate vs Cost-Per-Click analysis script. You&#8217;ll probably want to view the same graph for various partitions: per site, per vertical, per day of week, per time of day, whatever. So you generate a data file per partition, and you use the same code file, applied to each data file to generate each graph. And automating this task is just what the dataviz system that <a href="http://blog.francoismaillet.com/" target="_blank">François</a> and I built does.</p>
<p>At its core, it&#8217;s just a very small PHP file which generates an page with script tags that include a data file on the one hand and a code file on the other, based on GET parameters. This allowed us to create a few reusable Javascript or CoffeeScript files which specified how to visualize certain types of data, and use this script to quickly pull up a page for a given combination of data and code to have the browser show us the resulting SVG. Of course we don&#8217;t want to type GET parameters into the URL bar, so we wrapped this up in a little web-app that draws a file-tree based on what&#8217;s on disk and lets you click on the file you want and loads up our core PHP file in an iframe. No sweat. But then things get interesting.</p>
<p>Because Protovis is so data-centric and declarative, each code file implicitly requires a certain format in its data: some scripts work on arrays of arrays, others on dictionaries. We decided to create a simple convention for how we name our files on disk so our web-app would know which data files went with which code files. For example, a file in the &#8216;data&#8217; directory called &#8216;stats-daily.js&#8217; could be visualized with a code file in the &#8216;viz&#8217; directory called &#8216;stats.js&#8217;. And then at some point we decided that some data formats could be visualized in more than one way, so we extended the code-file naming scheme to enable mixing and matching, say, data files named &#8216;stats-daily.js&#8217; or &#8216;stats-hourly.js&#8217; with code files called &#8216;bargraph-stats.js&#8217; or &#8216;scattergraph-stats.js&#8217; (see campaign example below). Essentially for a given data-type X, we have a one-to-many relationship with data files X-F1.js, X-F2.php etc and a one to many relationship with code files C1-X.js, C2-X.coffee etc. You&#8217;ll notice from the file extensions in the examples just given that the data files don&#8217;t have to be static on disk, they can be dynamically-generated with PHP. The code files can also be written in either Javascript or, my favoured option, CoffeeScript.</p>
<div id="attachment_358" class="wp-caption aligncenter" style="width: 300px"><a href="http://nicolas.kruchten.com/content/wp-content/uploads/2011/07/ui.png" rel="lightbox[342]"><img class="size-medium wp-image-358" title="UI" src="http://nicolas.kruchten.com/content/wp-content/uploads/2011/07/ui-290x300.png" alt="" width="290" height="300" /></a><p class="wp-caption-text">Basic Tree UI</p></div>
<p>The next thing we noticed after having used this system for a while is that sometimes you want to compare or otherwise look at 2 datasets of the same format, side by side. So we added generalized support for &#8220;multi&#8221; visualizations, where you can check off which data files you want to pass to the multi-data-set visualization code.</p>
<div id="attachment_355" class="wp-caption aligncenter" style="width: 310px"><a href="http://nicolas.kruchten.com/content/wp-content/uploads/2011/07/multi.png" rel="lightbox[342]"><img class="size-medium wp-image-355" title="Multi" src="http://nicolas.kruchten.com/content/wp-content/uploads/2011/07/multi-300x227.png" alt="" width="300" height="227" /></a><p class="wp-caption-text">Multi-selection UI</p></div>
<p>So now whenever someone at the office wants to visualize some data, it&#8217;s just a matter of having whatever code is generating this data output a JSON file to our &#8216;data&#8217; directory with the appropriate name for the format, and our app will display links to visualize the file with whatever appropriate code files exist. And if the desired visualization doesn&#8217;t exist yet, it&#8217;s just a matter of creating an appropriately-named file in the &#8216;viz&#8217; directory, which can then be reused to look at other data files in the same format. This makes for some nice collaborative workflows where we&#8217;re all working on trying to build models that do the same thing and we can compare results really easily (see classifier example below).</p>
<h2>Code &amp; Demo</h2>
<p>The code for our app is available on GitHub at <a href="https://github.com/recoset/visualize">https://github.com/recoset/visualize</a> and a demo install of this code is available at <a href="http://visualize.recoset.com/">http://visualize.recoset.com/</a> where you can see some sample data-sets and play with the UI. <a href="http://www.google.com/chrome/" target="_blank">Google Chrome</a> is the recommended browser for these visualizations as they&#8217;re pretty memory and CPU intensive so Firefox has trouble with them, and they&#8217;re SVG-based so IE has some trouble with that!</p>
<h2>Classifier Example</h2>
<p>If you go look at our <a href="http://visualize.recoset.com/" target="_blank">demo</a>, in the &#8216;classifier&#8217; folder, you&#8217;ll see 3 data sets. Each data set is the result of training and testing a different type of classifier to predict a certain type of conversion. The data set is an array of objects, each of which contains the results of running the classifier at a specific probability threshold. We can use this data to plot <a href="http://en.wikipedia.org/wiki/Receiver_operating_characteristic" target="_blank">Receiver Operating Characteristic</a> (ROC) curves, as well as Precision-Recall (PVR) curves and Lift curves. In the screenshot below, you can see the &#8216;multi&#8217; option in action, plotting the results of our Boosted Bagged Decision Trees, Generalized Linear Model and Stacked Denoising Autoencoders against each other (they seem to perform about the same for this task).</p>
<div id="attachment_361" class="wp-caption aligncenter" style="width: 610px"><a href="http://nicolas.kruchten.com/content/wp-content/uploads/2011/07/pvr.png" rel="lightbox[342]"><img class="size-large wp-image-361" title="pvr" src="http://nicolas.kruchten.com/content/wp-content/uploads/2011/07/pvr-1024x756.png" alt="" width="600" height="442" /></a><p class="wp-caption-text">Precision-Recall Curves</p></div>
<h2></h2>
<p>&nbsp;</p>
<h2>Campaign Example</h2>
<p>In the &#8216;campaign&#8217; folder of our <a href="http://visualize.recoset.com/" target="_blank">demo</a>, you can see 2 datasets generated by our campaign-analysis system (note: this data has been obfuscated, and should not be interpreted as actual performance data!). The &#8217;50 top hosts&#8217; data is a good example of how we can visualize the same dataset in 4 different ways. In this case, the dataset is the performance of the 50 top hosts where we bought impressions over the course of this campaign. We can look at Click-Through Rate or Cost Per Click by host, (including confidence intervals!) or we can look at the relationship between these two quantities in a scatterplot. We can also look at a pie-chart of where we bought impressions. Same data, 4 views.</p>
<p>&nbsp;</p>
<div id="attachment_354" class="wp-caption aligncenter" style="width: 610px"><a href="http://nicolas.kruchten.com/content/wp-content/uploads/2011/07/ctr.png" rel="lightbox[342]"><img class="size-large wp-image-354" title="CTR" src="http://nicolas.kruchten.com/content/wp-content/uploads/2011/07/ctr-1024x742.png" alt="" width="600" height="434" /></a><p class="wp-caption-text">Click-Through Rate with Confidence Intervals</p></div>
<div id="attachment_353" class="wp-caption aligncenter" style="width: 610px"><a href="http://nicolas.kruchten.com/content/wp-content/uploads/2011/07/cpc.png" rel="lightbox[342]"><img class="size-large wp-image-353" title="CPC" src="http://nicolas.kruchten.com/content/wp-content/uploads/2011/07/cpc-1024x766.png" alt="" width="600" height="448" /></a><p class="wp-caption-text">Cost Per Click with Confidence Intervals</p></div>
<div id="attachment_356" class="wp-caption aligncenter" style="width: 610px"><a href="http://nicolas.kruchten.com/content/wp-content/uploads/2011/07/scatter.png" rel="lightbox[342]"><img class="size-large wp-image-356" title="Scatter" src="http://nicolas.kruchten.com/content/wp-content/uploads/2011/07/scatter-1024x767.png" alt="" width="600" height="449" /></a><p class="wp-caption-text">CTR vs CPM scatterplot with CPC iso-curves</p></div>
<p>&nbsp;</p>
<p>The second dataset in this folder allows us to draw what we call a campaign stream-graph. This is a visual representation of the progression of a campaign over time (day by day in this case, although we look at hourly versions as well). This graph shows various quantities of interest for each day of the campaign: impressions, total spend, CPC, CTR, CPM, site-mix etc. I encourage anyone interested in this visualization to go to our <a href="http://visualize.recoset.com/" target="_blank">demo page</a> and mouse over the various pieces of the graph.</p>
<p>&nbsp;</p>
<div id="attachment_357" class="wp-caption aligncenter" style="width: 610px"><a href="http://nicolas.kruchten.com/content/wp-content/uploads/2011/07/streamgraph.png" rel="lightbox[342]"><img class="size-large wp-image-357" title="Streamgraph" src="http://nicolas.kruchten.com/content/wp-content/uploads/2011/07/streamgraph-1024x825.png" alt="" width="600" height="483" /></a><p class="wp-caption-text">Campaign Stream-graph</p></div>
]]></content:encoded>
			<wfw:commentRss>http://nicolas.kruchten.com/content/2011/07/recoset-dataviz/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Longreads to Reeder to Readability to Kindle</title>
		<link>http://nicolas.kruchten.com/content/2011/06/longreads-to-reeder-to-readability-to-kindle/</link>
		<comments>http://nicolas.kruchten.com/content/2011/06/longreads-to-reeder-to-readability-to-kindle/#comments</comments>
		<pubDate>Sat, 18 Jun 2011 18:05:35 +0000</pubDate>
		<dc:creator>Nicolas Kruchten</dc:creator>
				<category><![CDATA[Thoughts]]></category>
		<category><![CDATA[kindle]]></category>
		<category><![CDATA[longreads]]></category>
		<category><![CDATA[readability]]></category>
		<category><![CDATA[reeder]]></category>
		<category><![CDATA[the future]]></category>

		<guid isPermaLink="false">http://nicolas.kruchten.com/content/?p=326</guid>
		<description><![CDATA[longreads.com is a great source of long-form stuff to read online. I subscribe to their RSS feed and many others via Google Reader. I also subscribe to a service called Readability, which lets me read stuff online without the visual clutter that surrounds most content, and to directly pay the content producers for any lost [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://nicolas.kruchten.com/content/wp-content/uploads/2011/06/My-Diagram-6-e1308420500921.png" rel="lightbox[326]"><img class="aligncenter size-full wp-image-327" title="Logos" src="http://nicolas.kruchten.com/content/wp-content/uploads/2011/06/My-Diagram-6-e1308420500921.png" alt="Logos" width="381" height="301" /></a><a href="http://longreads.com/" target="_blank">longreads.com</a> is a great source of long-form stuff to read online. I subscribe to their RSS feed and many others via <a href="http://www.google.com/reader" target="_blank">Google Reader</a>. I also subscribe to a service called <a href="http://www.readability.com" target="_blank">Readability</a>, which lets me read stuff online without the visual clutter that surrounds most content, and to directly pay the content producers for any lost revenue from my stripping out those ads. I also own an <a href="http://www.amazon.com/kindle" target="_blank">Amazon Kindle</a>.</p>
<p>And now these great systems all play together really smoothly, thanks to <a href="http://reederapp.com/" target="_blank">Reeder</a>, and to the Readability-Kindle integration. Reeder is a really smooth and easy-to-use Mac app which lets me plow through my Google Reader feeds and easily add content to Readability. And my <a href="https://www.readability.com/nicolaskruchten/archives" target="_blank">Readability list</a> is now synced to my Kindle just like a magazine subscription. So longreads posts a new article, which shows up in Reeder, I add it to Readability for later, and I can read it on my Kindle in the park!</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://nicolas.kruchten.com/content/2011/06/longreads-to-reeder-to-readability-to-kindle/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Rancilio Silvia iPhone Remote Control</title>
		<link>http://nicolas.kruchten.com/content/2011/05/rancilio-silvia-iphone-remote-control/</link>
		<comments>http://nicolas.kruchten.com/content/2011/05/rancilio-silvia-iphone-remote-control/#comments</comments>
		<pubDate>Fri, 13 May 2011 02:49:06 +0000</pubDate>
		<dc:creator>Nicolas Kruchten</dc:creator>
				<category><![CDATA[Projects]]></category>
		<category><![CDATA[arduino]]></category>
		<category><![CDATA[hardware]]></category>
		<category><![CDATA[iphone]]></category>
		<category><![CDATA[mod]]></category>
		<category><![CDATA[pid]]></category>
		<category><![CDATA[remote control]]></category>
		<category><![CDATA[silvia]]></category>
		<category><![CDATA[wifi]]></category>

		<guid isPermaLink="false">http://nicolas.kruchten.com/content/?p=313</guid>
		<description><![CDATA[This is a screenshot of what I pull up on my iPhone every morning now after its alarm clock wakes me up. That&#8217;s right, it&#8217;s an interface to turn on my espresso machine so that it will warm up to a specific temperature by the time I&#8217;m done snoozing! I can even look at a [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: center;"><a href="http://nicolas.kruchten.com/content/wp-content/uploads/2011/05/IMG_0288.png" rel="lightbox[313]"><img class="aligncenter size-full wp-image-315" style="border: 1px solid gray;" title="Silvia iPhone app" src="http://nicolas.kruchten.com/content/wp-content/uploads/2011/05/IMG_0288.png" alt="Silvia iPhone app" width="211" height="317" /></a></p>
<p>This is a screenshot of what I pull up on my iPhone every morning now after its alarm clock wakes me up. That&#8217;s right, it&#8217;s an interface to turn on my espresso machine so that it will warm up to a specific temperature by the time I&#8217;m done snoozing! I can even look at a real-time plot of the temperature to confirm that it&#8217;s holding where it should be and doesn&#8217;t need to bumped up or down a degree.</p>
<p><span id="more-313"></span>More screenshots:</p>
<div id="attachment_316" class="wp-caption aligncenter" style="width: 330px"><a href="http://nicolas.kruchten.com/content/wp-content/uploads/2011/05/IMG_0276.png" rel="lightbox[313]"><img class="size-full wp-image-316" title="Silvia iPhone app icon" src="http://nicolas.kruchten.com/content/wp-content/uploads/2011/05/IMG_0276.png" alt="Silvia iPhone app icon" width="320" height="480" /></a><p class="wp-caption-text">Silvia iPhone app icon (second from the bottom on the left)</p></div>
<div id="attachment_317" class="wp-caption aligncenter" style="width: 490px"><a href="http://nicolas.kruchten.com/content/wp-content/uploads/2011/05/IMG_0287.png" rel="lightbox[313]"><img class="size-full wp-image-317" title="Silvia iPhone graph" src="http://nicolas.kruchten.com/content/wp-content/uploads/2011/05/IMG_0287.png" alt="Silvia iPhone graph" width="480" height="320" /></a><p class="wp-caption-text">Silvia iPhone graph</p></div>
<h2>Why?</h2>
<p>As outlined in <a href="http://nicolas.kruchten.com/content/2010/12/silvia-mod-plan/">this previous post where I laid out my plan</a>, I wanted the extra precision of digital temperature control in my Silvia, but I didn&#8217;t want to slap a PID box on the side of it with an LED readout of the temperature. So I decided to build my own controller, and I opted to add Wifi to it for a web-based interface, which also opened up the possibility of turning it on remotely from bed, so it would be all warmed up by the time I was ready for my morning shot of espresso. Additionally, the controller automatically shuts off 1 hour after first being activated, so I essentially leave my Silvia powered up all the time, and the boiler only turns on when I ask it to.</p>
<h2>How?</h2>
<p>For the temperature control, I contacted the guy behind <a href="http://www.pidkits.com/" target="_blank">PIDKits.com</a> and asked for everything except a PID controller, and he obliged, and also sold me a pre-assembled <a href="http://code.google.com/p/tc4-shield/" target="_blank">TC4v1 Arduino shield</a> to interface with both the thermocouple (to read the boiler temperature) and the solid-state relay (to control the temperature). So with a bit of assembly, it was pretty easy to essentially bypass the built-in themostat with a digital one running on my Arduino. I played around with the P, I and D parameters on the controller, and I was able to get the temperature to be stable within ±1°C; much better than the ±15°C that the stock thermostat can achieve!</p>
<p>Next I ordered a <a href="http://www.cutedigi.com/product_info.php?products_id=4564" target="_blank">CuHead Wifi shield</a> for my Arduino and built a very simple REST-like interface using WiServer, which returns JSON, to run on the controller itself. This essentially turns the controller into a web service, which is in turned controller by a very simple PHP-based web app that runs on my iMac . This app contains all the right magical HTML incantations to look like a slick app on an iPhone/iPad, and can also be used on my iMac directly and even my new Kindle 3! (<em>Update Jan 2012</em>: I now have an HTML5 cached app on the phone: no iMac required after the first load!)</p>
<p>The whole thing is right now tucked away on the shelf beneath my Silvia, totally invisible to the casual eye, but I&#8217;ll probably put it in a little box or something just to hide exposed wires. (<em>Update Jan 2012</em>: in order to keep the connection stable I actually ended up hanging the controller on the wall hidden behind the Silvia.)</p>
<h2>Thoughts</h2>
<p>So I didn&#8217;t quite implement everything in the original plan, and I haven&#8217;t yet been able to run the entire app directly on the controller, nor have I been able to set up WebSockets through the Wifi shield, but overall I managed to accomplish the original design goals with a minimum of fuss, and hey, I use this thing every morning to make great espresso, so I&#8217;m a happy camper!</p>
<h2>Links</h2>
<ul>
<li><a href="https://github.com/nicolaskruchten/arduino/tree/master/SilviaController" target="_blank">Arduino sketch on GitHub</a></li>
<li><a href="https://github.com/nicolaskruchten/silvia" target="_blank">web-based &#8220;iPhone app&#8221; code on GitHub</a></li>
<li><a href="http://nicolas.kruchten.com/content/tag/silvia/">other posts about this</a></li>
</ul>
<h2>Thanks</h2>
<ul>
<li>Jim Galt from PIDKits.com for all his advice</li>
<li>The authors and contributors to the following open-source projects for making this type of hacking so much hassle-free fun:
<ul>
<li><a href="http://code.google.com/p/tc4-shield/" target="_blank">the TC4 libraries</a></li>
<li><a href="http://asynclabs.com/wiki/index.php?title=WiServer" target="_blank">the WiServer library</a></li>
<li><a href="http://smoothiecharts.org/" target="_blank">SmoothieCharts</a></li>
<li><a href="http://www.arduino.cc/" target="_blank">the Arduino toolchain </a></li>
</ul>
</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://nicolas.kruchten.com/content/2011/05/rancilio-silvia-iphone-remote-control/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Statsd, Graphite and Nagios</title>
		<link>http://nicolas.kruchten.com/content/2011/05/statsd-graphite-and-nagios/</link>
		<comments>http://nicolas.kruchten.com/content/2011/05/statsd-graphite-and-nagios/#comments</comments>
		<pubDate>Mon, 02 May 2011 02:59:31 +0000</pubDate>
		<dc:creator>Nicolas Kruchten</dc:creator>
				<category><![CDATA[Work]]></category>
		<category><![CDATA[alarming]]></category>
		<category><![CDATA[dashboard]]></category>
		<category><![CDATA[devops]]></category>
		<category><![CDATA[graphite]]></category>
		<category><![CDATA[monitoring]]></category>
		<category><![CDATA[nagios]]></category>
		<category><![CDATA[open source]]></category>
		<category><![CDATA[opsview]]></category>
		<category><![CDATA[recoset]]></category>
		<category><![CDATA[software]]></category>
		<category><![CDATA[statsd]]></category>

		<guid isPermaLink="false">http://nicolas.kruchten.com/content/?p=296</guid>
		<description><![CDATA[At Recoset we tend to worship, like Etsy (and AppNexus!), at the Church of Graphs. We&#8217;ve even started using Statsd, the system they&#8217;ve released to collect stats and relay them to Carbon for display in Graphite. And by display, I mean display on a dashboard visible to the entire dev team at the office, as [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: center;"><a href="http://nicolas.kruchten.com/content/wp-content/uploads/2011/05/dashboard.jpg" rel="lightbox[296]"><img class="size-medium wp-image-301 aligncenter" style="border: 1px solid gray;" title="dashboard" src="http://nicolas.kruchten.com/content/wp-content/uploads/2011/05/dashboard-300x225.jpg" alt="The Recoset Dashboard" width="300" height="225" /></a></p>
<p>At <a href="http://www.recoset.com">Recoset</a> we tend to worship, like Etsy (and <a href="http://techblog.appnexus.com/2011/metrics" target="_blank">AppNexus</a>!), at <a href="http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/">the Church of Graphs</a>. We&#8217;ve even started using <a href="https://github.com/etsy/statsd">Statsd</a>, the system they&#8217;ve released to collect stats and relay them <a href="http://graphite.wikidot.com/">to Carbon for display in Graphite</a>. And by display, I mean display on a dashboard visible to the entire dev team at the office, as seen above!</p>
<p>Statsd is a very simple system to which you can send UDP messages about various stats you want to track, which it then aggregates and passes along to Carbon, which stores them in Whisper, Graphite&#8217;s back-end data store. That&#8217;s a lot of moving parts but it works very well. Sending stats to statsd is extremely easy from any language (we do it from Javascript and C++) and carries low overhead, which is key for the type of work we do.</p>
<p><span id="more-296"></span></p>
<div id="attachment_299" class="wp-caption aligncenter" style="width: 310px"><a href="http://nicolas.kruchten.com/content/wp-content/uploads/2011/05/graphite.png" rel="lightbox[296]"><img class="size-medium wp-image-299" title="graphite" src="http://nicolas.kruchten.com/content/wp-content/uploads/2011/05/graphite-300x219.png" alt="Graphite Screenshot" width="300" height="219" /></a><p class="wp-caption-text">Graphite Screenshot</p></div>
<p>Graphite is basically a Django web app with a few different fancy front-ends to their &#8220;/render&#8221; URL, which returns lovely graphs depending on the query string, like you see above. There&#8217;s the &#8216;composer&#8217; GUI interface, which is a point-and-click graph builder, as well as a web-based command line interface which can be scripted to generate lots of graphs quickly. If you tack on &#8220;rawData=true&#8221; to the query-string you pass to what they call the <a href="http://graphite.wikidot.com/url-api-reference">URL API</a>, you get what you&#8217;d expect: the raw data that would have been used to generate a graph had that parameter not been set.</p>
<p>Now our dashboard doesn&#8217;t just show Graphite graphs, it cycles through multiple Firefox tabs (using the <a href="https://addons.mozilla.org/en-us/firefox/addon/tab-slideshow/">Tab Slideshow plugin</a>) one of which is <a href="http://www.opsview.com/">Opsview</a>, which is a web front-end to the monitoring tool <a href="http://www.nagios.org/">Nagios</a>. We use Nagios to monitor a variety of systems, and to notify us if something goes wrong. Here&#8217;s a screenshot of Opsview telling us Nagios found nothing wrong and everything is great:</p>
<div id="attachment_300" class="wp-caption aligncenter" style="width: 310px"><a href="http://nicolas.kruchten.com/content/wp-content/uploads/2011/05/opsview.png" rel="lightbox[296]"><img class="size-medium wp-image-300" title="opsview" src="http://nicolas.kruchten.com/content/wp-content/uploads/2011/05/opsview-300x219.png" alt="Opsview Screenshot" width="300" height="219" /></a><p class="wp-caption-text">Opsview Screenshot</p></div>
<p>You can probably see where this is going: since we&#8217;re already shuttling stats to Graphite, and we want to use Nagios for alarming, and Graphite has this rawData mode&#8230; I built a generic little Nagios plugin called <a href="https://github.com/recoset/check_graphite">check_graphite</a> which can be used to create Nagios service-checks so that it can monitor stats in Graphite and fire off alarms if needed. This was made pretty trivial by the excellent <a href="http://packages.python.org/NagAconda/plugin.html">Nagaconda</a> python module, but the end result is pretty powerful. We can now very easily set Nagios alarms on any stat we send to Graphite through Statsd, just by creating a service-check that contains the right query-string.</p>
<p style="text-align: center;"><a href="https://github.com/recoset/check_graphite">The check_graphite code is available on github under an MIT license</a>.</p>
<p>Want to know more about what <a href="http://www.recoset.com/" target="_blank">Recoset</a> does or how our algorithms work? You should check out some other Recoset-related posts, such as &#8220;<a title="RTB Pacing: is everyone doing it wrong?" href="http://nicolas.kruchten.com/content/2011/09/rtb-pacing-is-everyone-doing-it-wrong/">RTB Pacing: is everyone doing it wrong?</a>&#8221; or the &#8220;<a title="Peeking Into the Black Box, Part 1: Recoset’s RTB Algorithms" href="http://nicolas.kruchten.com/content/2011/12/peeking-into-the-black-box-part-1/">Peeking into the Black Box</a>&#8221; series.</p>
<p>(Update Aug 2011: for some of our most frequent stats we now bypass statsd and instead aggregate counters at their point of origin to send directly to Carbon, which is Graphite&#8217;s back-end. This cuts down on UDP messages and CPU usage considerably when sending tens of thousands of messages per second from one process through statsd :)</p>
]]></content:encoded>
			<wfw:commentRss>http://nicolas.kruchten.com/content/2011/05/statsd-graphite-and-nagios/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Graphing Silvia Temperature on iPad</title>
		<link>http://nicolas.kruchten.com/content/2011/04/graphing-siliva-temperature-on-ipad/</link>
		<comments>http://nicolas.kruchten.com/content/2011/04/graphing-siliva-temperature-on-ipad/#comments</comments>
		<pubDate>Mon, 11 Apr 2011 02:56:19 +0000</pubDate>
		<dc:creator>Nicolas Kruchten</dc:creator>
				<category><![CDATA[Projects]]></category>
		<category><![CDATA[arduino]]></category>
		<category><![CDATA[hardware]]></category>
		<category><![CDATA[silvia]]></category>

		<guid isPermaLink="false">http://nicolas.kruchten.com/content/?p=279</guid>
		<description><![CDATA[So I finally got around to working on my Silvia Mod Plan, getting all the way to Step 5! The video above is a demo of the setup I have to show a real-time graph on my iPad of the boiler temperature in the Silvia. Having installed the thermocouple in the Silvia and played with [...]]]></description>
			<content:encoded><![CDATA[						<div class="flickr-gallery video medium none">
							<object type="application/x-shockwave-flash" wmode="opaque" width="400" height="265" data="http%3A%2F%2Fwww.flickr.com%2Fapps%2Fvideo%2Fstewart.swf%3Fv%3D109786%26photo_id%3D5608043281%26photo_secret%3D59e7f44451" classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000">
								<param name="movie" value="http%3A%2F%2Fwww.flickr.com%2Fapps%2Fvideo%2Fstewart.swf%3Fv%3D109786%26photo_id%3D5608043281%26photo_secret%3D59e7f44451"></param>
								<param name="bgcolor" value="#000000"></param>
								<param name="allowFullScreen" value="true"></param>
								<param name="wmode" value="opaque" />
								<embed type="application/x-shockwave-flash" src="http://www.flickr.com/apps/video/stewart.swf?v=109786&photo_id=5608043281&photo_secret=59e7f44451" bgcolor="#000000" allowfullscreen="true" flashvars="intl_lang=en-us&amp;photo_secret=59e7f44451&amp;photo_id=5608043281" height="265" width="400" wmode="opaque"></embed>
							</object>
						</div>
					
<p>So I finally got around to working on my <a href="http://nicolas.kruchten.com/content/2010/12/silvia-mod-plan">Silvia Mod Plan</a>, getting all the way to Step 5! The video above is a demo of the setup I have to show a real-time graph on my iPad of the boiler temperature in the Silvia.</p>
<p>Having installed the thermocouple in the Silvia and played with my <a href="http://code.google.com/p/tc4-shield/">TC4 shield</a>, my initial plan was to use the Arduino to transmit data to my iMac using <a href="http://www.digi.com/products/wireless-wired-embedded-solutions/zigbee-rf-modules/zigbee-mesh-module/xbee-zb-module.jsp#overview">XBee</a> as a wireless serial link, where I would run a <a href="http://nodejs.org/">NodeJS</a> process which would read data from the USB port and which would communicate with the iPad via a WebSocket over Wifi (phew, mouthful!). Ideally the Arduino would speak Wifi but in the meantime I figured I&#8217;d play with this setup. I chose NodeJS because it seemed really easy to set up <a href="http://socket.io/">WebSockets using socket.io</a>, and that seemed like a good way to feed data to <a href="http://smoothiecharts.org/">Smoothie Charts</a> for real-time graphing. I rewrote the code in <a href="http://jashkenas.github.com/coffee-script/">CoffeeScript</a>, because it&#8217;s the best way to write NodeJS code IMO (a discovery I made after writing the first version of this code 4 months ago) and because it&#8217;s so fitting for this project!</p>
<p><span id="more-279"></span></p>
<p>The first snag I ran into was the fact that there doesn&#8217;t seem to be a good way to read from a USB port from Node, and it felt too much like my day-job to write an add-on in C++ to do it, so I had to work around. My solution was to run a separate Python process which would read data from the USB port and would POST to the NodeJS web server whenever the Arduino sent a temperature reading.</p>
<p>The second snag I ran into was the fact that the XBee didn&#8217;t seem to be able to transmit all the way to my iMac, so I hung the XBee from my laptop, located halfway between the iMac and the Silvia and ran the Python script there, POSTing over Wifi to the iMac. Then I smartened up and installed the NodeJS process on my laptop. So really, this system is just a convoluted way of doing what could have been done using a USB cable to graph data on my laptop, but bouncing signals around over multiple protocols, machines and wireless technologies is so much more fun!</p>
<p>The code is, of course, up on Github: <a href="https://github.com/nicolaskruchten/coffeegraph">https://github.com/nicolaskruchten/coffeegraph</a></p>
<p>&nbsp;</p>
						<div class="flickr-gallery image none"><a href="http://www.flickr.com/photos/nicolaskruchten/5608059329"><img class="flickr medium" title="Inside Silvia" alt="Inside Silvia" src="http://farm6.static.flickr.com/5267/5608059329_265a12c584.jpg" /></a></div>
					
						<div class="flickr-gallery image none"><a href="http://www.flickr.com/photos/nicolaskruchten/5608642978"><img class="flickr medium" title="iPad Screenshot" alt="iPad Screenshot" src="http://farm6.static.flickr.com/5141/5608642978_808d83b093.jpg" /></a></div>
					
						<div class="flickr-gallery image none"><a href="http://www.flickr.com/photos/nicolaskruchten/5608642900"><img class="flickr medium" title="Silvia and Arduino" alt="Silvia and Arduino" src="http://farm6.static.flickr.com/5310/5608642900_6579f7515f.jpg" /></a></div>
					
]]></content:encoded>
			<wfw:commentRss>http://nicolas.kruchten.com/content/2011/04/graphing-siliva-temperature-on-ipad/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Object Caching 712/865 objects using disk: basic

Served from: nicolas.kruchten.com @ 2012-05-19 10:10:54 -->
