S3 Deep Dive: Building and Operating for Resiliency

This AWS re:Invent session features Amy Therrien (Director of Engineering) and Seth Markle (Senior Principal Engineer) from Amazon S3, who together have over two decades of experience building the service. They take the audience behind the scenes of S3's internal architecture, focusing on how the team approaches resiliency through systematic threat modeling. The talk reveals that S3 has grown from initial projections of gigabytes to now storing over 350 trillion objects across more than 350 microservices per region, serving over a petabyte per second of bandwidth.

The presenters explain how S3 evolved from reactive firefighting in its early days to a proactive threat modeling culture where every feature goes through rigorous documentation and review. Threats are categorized by likelihood and area of impact—from a single object version to all objects for all customers in a region. They emphasize that mitigations must be carefully considered since new mitigations can introduce new threats, requiring iterative refinement until the system is robust.

A significant portion of the talk covers how customers can leverage S3's scale for their applications. Key techniques include multi-part uploads for parallelizing writes (getting 5X throughput by splitting 100MB into 20MB chunks), range GETs for parallelizing reads, and multi-value DNS to spread requests across S3's massive front-end fleet. The common runtime library (CRT) encapsulates these best practices and is included in AWS SDKs, CLIs, and tools like Mountpoint for Amazon S3.

The session dives deep into S3's durability architecture, explaining how the 11 nines (99.999999999%) durability design is achieved through end-to-end checksumming (4 billion checksums calculated per second), erasure coding across storage devices, and constant background auditing. Seth explains that durability is influenced by three factors: drives (hardware failure rates), zones (physical facilities), and people (software bugs and operator errors). For standard storage classes, data is automatically spread across multiple availability zones from the moment a PUT request receives a 200 response, providing protection against the loss of an entire zone.

Key Concepts

Threat Modeling: S3's proactive culture of documenting all potential threats with their likelihood, area of impact, and mitigations before building features—creating an exhaustive list before developing solutions.
11 Nines Durability: Mathematical model based on observed drive failure rates and re-replication speeds that ensures 99.999999999% durability, achieved through erasure coding, checksumming, and continuous monitoring.
End-to-End Checksumming: Data is checksummed throughout the entire request pipeline, then the entire transformation is reversed before responding 200 to ensure stored data matches what was received.
Erasure Coding: Data is divided into shards spread across multiple storage devices, allowing recovery even if some drives fail, with re-replication parallelized across hundreds or thousands of drives.
Prefix Partitioning: S3's index partitions data lexicographically by key name, with 3,500 PUT and 5,500 GET requests per second per prefix—keep high-cardinality characters left and dates right to optimize scaling.
Multi-Part Upload: Divide large objects into chunks uploaded in parallel for 5X throughput improvement and resilience against partial failures requiring only re-upload of failed parts.
Common Runtime (CRT): Open-source library with best practices for multi-part uploads, range GETs, multi-value DNS, checksums, and retry algorithms—included in AWS SDKs and Mountpoint.
Zonal Replication: Standard storage classes spread data across availability zones on every PUT, enabling continued availability during AZ failures without customer awareness.
S3 Express One Zone: New high-performance storage class that trades zone redundancy for lower latency by keeping data in a single AZ—requires customer disaster recovery planning.
Guardrails and Shadow Mode: S3 runs new code paths alongside old code in shadow mode against billions of requests before enablement, assuming incorrectness even after building for correctness as defense in depth.