Finding a needle in Haystack: Facebook’s photo storage

Facebook's Haystack is a custom object storage system designed to handle the massive scale of their photo application, which stores billions of photos and serves millions of requests per second. The paper describes the limitations of their previous NFS-based architecture, where the "long tail" of requests for less popular photos caused excessive disk I/O due to filesystem metadata lookups. Traditional filesystems like POSIX incur multiple disk operations just to locate a file (directory lookups, inode reads) before reading the actual data, creating a significant bottleneck for read throughput.

To address this, Haystack was designed with the primary goal of reducing the per-photo metadata so drastically that all of it could be kept in main memory. By doing so, the system ensures that reading a photo requires at most one physical disk operation—reading the needle (photo data) itself. Haystack achieves this by aggregating millions of photos into large files called physical volumes, rather than managing each photo as a distinct file on the filesystem. Each photo is stored as a "needle" within these volumes, containing its own header and data.

The system architecture consists of three main components: the Haystack Store (persistent storage), the Haystack Directory (metadata and mapping management), and the Haystack Cache. The Directory maintains the mapping between logical volumes and physical volumes across different store machines. The Store machines manage the actual data on disk and keep an in-memory mapping of photo IDs to their physical location (file offset and size). This design allows Facebook to serve the long tail of photo requests efficiently and cost-effectively, providing high throughput without the need for expensive, commercially available NAS appliances.

Key Concepts

Metadata Overhead in Traditional Filesystems: The authors identify that using standard filesystems (like those on NAS appliances) for billions of small files results in multiple disk operations per read to resolving metadata (directories, inodes), which becomes the primary bottleneck.
The Long Tail Problem: While CDNs effectively serve popular, recently uploaded content, social media traffic is characterized by a "long tail" of requests for older content. These requests often miss the CDN and hit the backing storage, making efficient random access critical.
One Disk Operation Per Read: Haystack's core design philosophy is to minimize disk activity. By keeping all necessary metadata to locate a photo in RAM, the system guarantees that fetching a photo requires only a single disk seek and read.
Needle-based Storage: Instead of one file per photo, Haystack stores many photos in a single large "volume file." Each photo is encapsulated in a "needle" structure that includes its own flags, size, and integrity checks, simplifying management and reducing overhead.
In-Memory Metadata: Store machines maintain a lightweight in-memory map of (Key, Alternate Key) to (Offset, Size). This allows them to jump directly to the data on disk without traversing on-disk filesystem structures.