Navigating uncertainty in Amazon's middle-mile network

22 minutes ago

Before the “last mile” delivery driver sets off for your home, your Amazon item has moved through the middle-mile network of fulfillment centers and sort centers, which brings products close enough to customers to make our same-day or next-day shipping promises possible. For years, Amazon engineers and scientists have been pushing computational boundaries to optimize this network under uncertainty, and that push has accelerated as the network has grown more complex. What happens when a huge snowstorm closes major highways, a sort center is hit by a power outage, or demand for a viral product spikes? These headline disruptions get attention because they’re obvious system shocks that vividly illustrate the challenge of planning for uncertainty. But the most important sources of uncertainty are far more subtle: the day-to-day variations in demand and travel times that, if you don’t look closely enough, erode efficiency across the entire network. We’ve found that even when we consider just demand variability, optimizing for uncertainty promises potential savings of 0.5%. This is a small percentage, but we obsess over small percentages because real customer experiences lie behind them. And demand variability is just one piece of a puzzle that includes road delays, processing time fluctuations, and countless other microvariations. Months before a customer clicks “Buy Now”, Amazon’s logistics experts consider a multitude of middle-mile routing questions: What routes should trucks take between warehouses? When should shipments depart? Where should inventory be positioned to meet customer demand? The proactive shaping of the network’s structure and timing is called network design. Our challenge is not to optimize for perfect conditions but rather to build plans that remain effective even when things don’t go as expected. A computational puzzle of staggering complexity Even if we could count on perfect conditions, optimizing the middle-mile network is challenging because it requires coordinating tens of millions of different products moving through hundreds of facilities, each with limited capacity and specific operating hours. A key difficulty is the mix of optimization decisions involved. Some are a matter of degree (what volume of packages to send down a particular route). Others are binary (open this shipping lane or not; depart now or wait for more cargo). Put them together, and you get what’s called a mixed-integer optimization problem, a kind of problem where the solution strategies explode combinatorially in both computational time and memory space. Consider that with only 300 yes-or-no decisions, there are already more possible combinations than atoms in the observable universe. Amazon’s network involves millions of such decisions, compounded by delivery windows that restrict when shipments can arrive or depart. State-of-the-art optimization software struggles to solve this problem, even with “perfect information”. In the real world, information is far from perfect, and a plan that looks optimal on paper can unravel when conditions change. The challenge of handling uncertainty Uncertainty shows up in two different ways. First, those day-to-day fluctuations in variables like demand or travel times. Second, the structural shocks: a weather-driven road closure or unexpected facility shutdown. In academic work, a common strategy is to model many scenarios the network might face and then “robustify” the solution so that it performs well across them. But at Amazon’s scale, this approach founders. There is always a staggering number of things that can go wrong, and trying to robustify against each of them individually is a hopeless task. Instead of chasing an impossible guarantee, we shift to a more practical goal: optionality. Our aim is to design a system with enough alternative routes and workable options that day-to-day fluctuations and shocks trigger effective adaptation rather than crises. In practice, our sought-after flexibility requires designing candidate networks with built-in options and stress-testing those designs against many plausible futures. That’s where Amazon’s in-house computational tools come in. Making network design tractable Amazon’s network design tool makes the middle-mile-network problem solvable at scale. It starts with a simple insight: not every possible route is worth considering. If you were planning a road trip, you would naturally focus on a handful of sensible routes. The tool applies this principle by identifying possible “consolidation points”, such as sort centers where packages from multiple origins can share trucks to common destinations, and then finding efficient routes that use them. We must also respect the clock, because Amazon facilities run on precise operational schedules. For example, a sort center might accept inbound shipments from 2:00 a.m. to 6:00 a.m. and dispatch outbound trucks from 8:00 a.m. to 12:00 noon. Ideally, planners would model these schedules at fine resolution (say, 15-minute intervals), but this creates another explosion of possibilities. On the other hand, a coarse resolution of, say, 24-hour intervals would make for fast but useless planning: packages would arrive after a facility has closed for the night, and trucks would be scheduled to depart before loading their cargo. Amazon planners overcame this stubborn problem while still supporting operational reality. The optimization approach solves at a fairly coarse time resolution, but for each candidate route, it includes precomputed “timing bounds” — the latest feasible truck departure and earliest feasible arrival — with 15-minute precision. That way, when the tool chooses routes, it’s choosing those that will work on real-world schedules. Risk-aware network adjustments at scale Even with these algorithmic advances, the solution to a single deterministic planning problem of Amazon’s scale can take hours to compute because of the difficulties of parallelizing the underlying algorithm. Adding uncertainty compounds the challenge. One naïve way to account for uncertainty on the middle-mile network would be to simplify the problem by assuming that more packages flow between locations that are large and close together. But the middle mile isn’t a set of independent pipes. Product flows interact. A spike in demand at one fulfillment center affects nearby facilities in particular ways; a new delivery station changes a region’s patterns. To better capture those complex dependencies, we developed an approach enabling risk-aware network design via Monte Carlo methods. Amazon’s risk-aware network-design models start by creating many permutations of synthetic origin-destination flow data to represent both day-to-day fluctuations in demand as well as larger structural shocks. One critical component of the models is a graph attention network model that represents the middle-mile network as two interconnected graphs. The first is a site graph whose nodes represent fulfillment centers and delivery stations, with the edges representing both shipping routes and geographic proximity. This allows the model to learn spatial patterns, such as higher demand around dense population centers. The second graph works at a higher level: each node represents a specific origin-destination pair. This structure lets us see correlations too subtle for the site graph to capture. It is like understanding traffic patterns: knowing that two highways are close (site graph) doesn’t tell you whether they compete for the same commuters (origin-destination graph). To illustrate, consider two nearby fulfillment centers in northern Connecticut, both serving New York City. A model using only the site graph might estimate that each facility sends 8,000 packages to NYC, when in reality the volumes are much lower because the two facilities share that demand. The site graph understands that the fulfillment centers are proximal, but it doesn’t fully capture that their flows to NYC are interdependent. The origin-destination graph solves this by representing each facility-to-destination pair as its own node, allowing the model to learn that when two similar facilities serve the same area, their shipments are interdependent. More broadly, this structure lets the model discover that origin-destination pairs with similar characteristics — such as suburban fulfillment centers delivering to urban areas — may exhibit correlated demand, even when they are far apart. Armed with realistic demand scenarios that respect spatial correlations and understand how network disturbances propagate across space, we can generate candidate network designs that work well under a variety of demand conditions. Keeping delivery promises under uncertainty Because the models train on historical shipping data, we can generate realistic demand scenarios that respect spatial correlations. And crucially, because the tools understand how network disturbances propagate across space, we can produce plausible scenarios the network has never encountered before, such as demand shifts driven by a new facility opening or a major regional weather event. That’s the missing half of the loop: one product designs candidate future networks, while another generates the scenarios to stress-test them. So instead of optimizing a single forecast, Amazon planners can evaluate their network designs across hundreds of plausible scenarios and preserve the options that keep the network flexible in the face of uncertainty. Overall, this enables us to distinguish between network designs that appear efficient on average but are fragile under stress and those that may incur slightly higher steady-state costs yet deliver more-stable performance. For customers, this research translates into more-reliable delivery promises, including during peak shopping periods and genuine disruptions. By combining advanced optimization techniques with machine learning, Amazon is building a middle-mile network designed to adapt to the world as it really is. So when a winter storm buries a region under two feet of snow on the same day a new must-have product goes viral, the network can absorb the shock and recover as quickly as conditions allow. But the work of building resilience against uncertainty is not finished. As the network grows, so does our commitment to advancing the computational tools that keep delivery promises reliable, day after day.