The paradox of machine learning – what leaders need to know
Do the work – and don’t buy the snake oil
When I (Duncan) was leading data science for Uber’s Marketplace Experimentation (MX) team, one of our earliest big bets was on synthetic control technology, a machine learning-based causal inference technique used to measure the effects of a product launch when you can’t run a clean user A/B test, for example because there big network effects.1
Our first use-case for synthetic controls was a biggie: Uber was undertaking a massive investment in carpooling with a new product called Express Pool. Express Pool was designed to improve efficiency over traditional Uber Pool by leveraging a host of new product and infrastructure innovations. It required so much compute under the hood that we joked that the lights would flicker down in San Jose during an optimization run.
My team was responsible for measuring the effects of Express Pool on the business. Express Pool was a major initiative with C-level visibility so it was critical that we get it right; the revenue and profitability impact numbers generated by our models – positive, or negative – would shape tech investment decisions for the firm. It was high-stakes for me personally, my team, and the company overall. The pressure was on.
Four months from the launch, we had a clean Gantt chart plotting our work for the synthetic control MVP. Our plan was oriented around finding, understanding, and cleaning the data, and getting the first basic models off the ground. While we’d need to burn the candle at both ends, we felt we would get it done.
But as that quarter’s company-wide planning wrapped up and headcount asks were in, we were disheartened to learn that a rival team – four times the size of ours! – in a different org had caught wind of our project and pitched a competing solution.
Their proposal was far more exciting, and was entirely oriented around sophisticated deep learning techniques with names like LSTM and AutoEncoder. And of course they had asked for and received even more headcount to hire against this ambitious plan. Those models would knock the socks off ours… if they worked.
I instantly got lots of questions from ambitious engineers asking, why aren’t we doing that stuff? What will happen to our team when they land their tech? Leaders in my org pressed me in stomach-turning meetings on whether we were going to be left behind.
We had a significant dilemma: should we change course?
But conspicuously absent from the rival team’s plan was any mention of actually getting the data right. When I asked about it, I would be redirected to the deep learning models; I’d be sent conference papers on arXiv, and made to feel like I wasn’t sufficiently on top of the research literature to even ask the right questions.
Something smelled off. I knew the other team wasn’t worried about the right things. The work over the next quarter wasn’t going to be about who could build the fanciest model. It was about getting the nuts and bolts right. The devil is in the details; fancy new models can only go so far.
Data preparation, data cleaning, building features, measuring performance, setting benchmarks — all of these steps would be slow and hard. They’re the backbone of successful machine learning initiatives, even if they aren’t as impressive-sounding as building a transformer using Pytorch.
In the end, we were right. Express Pool launched successfully, and our measurements helped inform next steps. The rival team’s tech didn’t work, and after similar missteps a few more times in the quarters that followed, leadership lost confidence in their ability to deliver and that team was eventually dismantled. The saga was memorialized in a Harvard Business School case study.
My takeaway?
Leaders need to have a realistic view of what it takes to build ML products that deliver value — and they need to make sure their teams are actually doing that work, not something else.
There’s a deep irony here: For all the automation it promises, making machine learning happen is deeply manual work.
Next time: we’ll deep dive into the specifics of the most painstaking steps - what they are, why they’re hard, and what leaders should do about it.
The idea behind synthetic controls is simple: in order to measure the impact of a product launch somewhere, say in San Francisco, we would use machine learning to construct a synthetic version of San Francisco using data from other cities where the new product wasn’t launched, perhaps Los Angeles, Sacramento, Chicago, and New York City. Then we would compare key metrics like trips and revenue for SF to the metrics for our “synthetic” SF, and the difference is the impact of the launch.
Can't agree more on "have a realistic view of what it takes to build ML products that deliver value". Through a simple experiment, I realized that "death by a thousand cuts" is a very real challenge in delivering and maintaining the ML products' quality.