Building and Operating a Pretty Big Storage System Called S3 (Video)
Building and Operating a Pretty Big Storage System Called S3
Andy Warfield, a Distinguished Engineer at AWS, presents a keynote talk at FAST '23 reflecting on his five years working on Amazon S3. The talk explores S3's remarkable scale: 280 trillion objects, 100 million requests per second, 400 terabits/second peak bandwidth, and operations across 31 regions with 99 availability zones. Warfield describes how his first day at Amazon involved a six-hour whiteboard session learning S3's architecture—a system composed of more than 300 microservices that strongly exhibits Conway's Law, where the software architecture mirrors the organizational structure.
The discussion of the hard drive fleet reveals fascinating engineering challenges. Despite modern SSDs handling most common storage problems, hard drives remain essential for S3's massive scale. Warfield explains the plane-flying-over-grass analogy—if a hard drive head were a 747, it would fly two sheets of paper above the grass at 75 mph, with tracks 4.6 blades wide and bits one blade wide, missing only one blade per 10^15 reads. As drive capacities have grown 7.2 million times since 1956 while seek times improved only 150x, S3 must carefully manage "heat" (IOPS bottlenecks) through intelligent data placement. By spreading customer data across potentially millions of drives, S3 enables massive burst parallelism—a single bucket's data might span 19,000 drives to meet peak IOPS demands while capacity only requires 143 drives.
Durability at S3 extends far beyond the famous "11 nines" metric, which is specifically a hardware failure model combining annual failure rates with repair rates to choose appropriate erasure encodings. The organization employs durability reviews modeled after security threat models, where teams must document what could go wrong and their mitigations. For Shard Store—S3's Rust-based single-host object store—the team uses formal verification through a lightweight Rust model (1% of the full implementation) that runs property-based testing in CI pipelines. Deployment leverages availability zone redundancy with over-encoding, allowing gradual rollout with months of soak time before full production exposure.
The talk concludes with reflections on scaling oneself through ownership. Warfield draws parallels between his experience as a professor and Amazon's culture: successful projects require teams to genuinely own their work, not receive directives from above. He recalls a colleague who gave students a list of five problems to choose from—and the students never succeeded because "it wasn't their idea." At Amazon, he learned to ask questions rather than provide answers, enabling teams to own their decisions. This ownership model also enabled him to continue research through Amazon's Scholars Program, bringing in sabbatical faculty who work with product teams on innovative projects.
Key Concepts
- Conway's Law in Practice: S3's 300+ microservices directly reflect its organizational structure, with each box in the architecture diagram representing a distinct team.
- Heat Management: Managing IOPS bottlenecks through intentional data spreading—customer data is distributed across as many drives as possible to enable burst parallelism and prevent hotspots.
- Workload Aggregation Flattening: Multi-tenant aggregation causes individual spiky workloads to flatten into smoother, more predictable demand curves on the storage fleet.
- 11 Nines as Hardware Model: S3's durability guarantee is specifically about hardware failures, combining AFR with repair rates to select appropriate erasure codes, not a comprehensive data protection guarantee.
- Durability Reviews: Threat model-style documentation requiring teams to enumerate failure modes and mitigations before making durability-impacting changes.
- Shard Store: S3's Rust-based single-host object store using log-structured merges and soft updates with formal verification through executable Rust specifications.
- Ownership Culture: Teams must genuinely own their systems to be passionate about operating them—you can't mandate enthusiasm, only enable it through autonomy and empowerment.
- Customer Durability Features: Features like object versioning, object lock, and cross-region replication with account isolation address durability concerns beyond hardware failures, particularly operator errors.