When I (Duncan) was on Uber’s Marketplace team, we would (semi) joke that we were lurching from crisis to crisis.
Our dozens of core machine learning products directly controlled billions of company dollars — targeted promotions, surge pricing, driver incentives, ETAs, pool matching, upfront rider fares, subscription upsells, the list goes on. We lived in paranoia that these were fundamentally broken in a way that would sink the business.1
Every week brought a new potential Data Science Disaster. The turmoil would always start the same way: someone would find something suspicious in the data. It might have been a spike in our internal metrics like “zeros”, which measured riders who opened the app and didn’t see any cars available. Were we under-surging and losing riders because of poor reliability? Had we deployed too many rider promos? The wrong driver incentives?
Or it could have been a tweet — perhaps a celebrity saw a fare that seemed high, and then they walked 20 feet and the price changed materially — and of course screenshots went viral and started spiraling.
Inevitably, this would happen at 7pm. Our teams at HQ in SF would be starting to grab in-office dinner and maybe an uBeer; data on the evening rush would have landed for all of our key markets; and meanwhile, our non-tech colleagues on the East coast would be logging back on for the late shift. As soon as there was smoke, uChats (Uber’s internal messaging system) would start firing and a whirlwind of analysis would get kicked up, with “jams” set up that evening to dig in or tech reviews scheduled with leadership for the next day. Most of the time we wouldn’t find a real problem, but sometimes we would — resulting in urgent code diffs, roadmap shifts, and sometimes unexpected reorgs and abrupt departures.
It was intense, partly because Uber’s culture was intense. But as I reflect back, I realize it was intense because ML itself is super intense.
Archimedes once said that given a lever long enough, and a place from which to stand, he would move the world. ML can be the lever that moves the world — when it works well, it delivers incredible, transformative value. But the lever is only as good as its weakest fibers. And when ML is even merely mediocre, it can snap back and slap you in the face.
This is different from most tech (and non-tech) functions, which can tolerate things that aren’t quite perfect. Mediocre product management, mediocre engineering, or mediocre finance aren’t good, but they are often tolerable.
But the returns to quality in ML are highly nonlinear. Great ML is game changing, while mediocre ML is a whole other beast — and it’s downright dangerous. Here’s how we think about it:
The Danger Zone is the area of mediocre quality in ML: it looks ok but it isn’t. It wouldn't be great but would be tolerable in most functions. And it can do tremendous damage to your business.
Why mediocre ML is especially destructive
As you’re reading this, you might be saying, Yeah, mediocre ML can be bad — but so can mediocre work in any area. We’re going to try to change your mind. We believe that mediocre ML is uniquely dangerous for three reasons:
1. ML’s massive scale amplifies missteps
ML operates on a scale unlike most other business functions. A single model can make millions of high stakes decisions every hour, which means that even minor flaws can have significant consequences.
Consider an ML model that an airline uses to set prices for flights. That model could price billions of dollars of flights each year. The model literally decides the firm’s revenue.
At that scale, the difference between good and bad could be staggering. If things go well the model might generate hundreds of millions of dollars of incremental profit. Or, a subpar model could easily incinerate the same.
It’s now common for ML to be powering dozens of critical applications across the enterprise — from deploying marketing budgets, buying and managing inventory, setting work schedules, blocking fraud, and managing customer support — in each of these, the opportunity is big in absolute terms but the margin for error is small. The model needs to get it right, else things can go very wrong.
2. ML systems are opaque & outputs are hard to monitor
ML systems are made up of many different parts, each operating with minimal human oversight, and these systems can be deceiving.
On the surface, everything might look like it “works” — data gets ingested, predictions are made, decisions are set — but it could be on fire inside. The system could be optimizing for something you didn’t really want, like clicks instead of quality. Worse yet, you did intend to optimize for clicks, but the clicks data could simply be broken — and you didn’t know it. And so, unbeknownst to anyone, the system’s outputs could be burning a hole in your business.
ML models typically deliver hyper-targeted predictions, and it’s this targeting that makes them so valuable — but it’s also this targeting that makes it impractical-to-impossible for any human team to monitor the outputs with precision. There’s just way too much data coming out of the system to actually review and understand.
The difficulty inherent in monitoring ML systems is especially troubling in life-or-death stakes situations like in medicine, where ML is increasingly being used to diagnose diseases and determine treatment plans. That isn’t to say we shouldn’t try to bring ML into these contexts — the positive impacts of ML on healthcare can literally be life-saving — but we have a lot work to do to make sure we can detect problems. (Needless to say, this is getting even harder in the age of generative AI!)
3. There aren’t established ways of doing it “right”
ML is relatively new, and there simply aren’t that many experts in the field. Most business leaders lack ML expertise — and data scientists lack business experience.
This is hard for both sides: leaders find it difficult to critically evaluate ML models and methodologies. They might not even know what questions to ask, and might not recognize issues like overfitting or biased training data.
At the same time, since the field is new and specialists are in high demand, many data scientists are junior. These freshly minted PhDs excel in statistics, but they aren’t taught how to build production ML models in school. They also haven’t had the chance to develop business acumen and are figuring it out from scratch every time.
Overall, this makes it way too easy to inadvertently make major mistakes — which would be quickly caught in more mature functions.
Signs you might be in the danger zone — and what to do about it
It can be hard to know if you’re in the danger zone, because for ML, it’s just spitting distance from a panacea of transformative impact. The scariest part of the danger zone is where you think everything checks out but the reality is it doesn’t. Making it to the happy place in the far right tail is about getting ALL of the pieces right.
In our experience, success requires having crisp, affirmative (and correct) answers to each of the following:
Do you know what you are trying to solve for?
Do you have high quality data?
Have you built features that capture the key patterns in our data?
Have you chosen appropriate models – sufficiently flexible to capture the patterns, but not so flexible as to overfit?
Do you have rigorous training, testing, and validation procedures?
Do you run careful, long-term experiments to measure impacts?
Do you have production monitoring in place?
Do you have robust and scalable pipelines and infrastructure to train and serve your models?
Have you revisited all of this recently, since what worked a year ago might be broken now?
Perhaps most importantly, do you have a culture of curiosity and inquiry, where it’s rewarded to question whether the pieces make sense? There will inevitably be problems — so you need a team that will dig in.
At Uber, one of our data science values was “As simple as possible, but no simpler”. Our recommendation is to instill that approach — and a relentless focus on quality — to every part of the ML workflow. That's how you end up on the happy path.
With Uber’s market cap at $140B as of writing, I think it’s safe to say that we did not in fact sink the business.
These insights will be highly applicable to the broader world with the adoption of AI frameworks and LLM / Generative AI tooling. Thanks for providing your experience!