Data Modeling
Data modeling in system design interviews is less about creating a perfectly normalized schema and more about designing a structure that supports the system's requirements. It involves defining entities, how they are identified, and how they relate to one another. The goal is to produce a model that is clear, functional, and capable of scaling, rather than one that is academically perfect.
The process usually begins with identifying core entities during requirements gathering and then sketching a basic schema during the high-level design phase. Key decisions are driven by three main factors: data volume, access patterns, and consistency requirements. For instance, access patterns dictate how data is queried and thus how it should be indexed or sharded, while consistency needs determine if data must reside in a single ACID-compliant database or can be distributed with eventual consistency.
Choosing the right database technology is a foundational step. While specialized databases like Graph or Wide-Column stores have their place, Relational Databases (SQL) overlap with the needs of most interview problems and are the recommended default. Alternatives like Document Databases or Key-Value stores should be chosen only when specific requirements, such as rapidly evolving schemas or simple key-based lookups, clearly demand them. Ultimately, a good data model ties schema choices directly back to the system's functional and non-functional requirements.
Key Concepts
- Relational Databases (SQL): The default choice for most system design interviews. They enforce structured schemas and relationships through foreign keys, providing ACID guarantees which are essential for many applications.
- Access Patterns: The most critical factor driving schema design. You should design your schema and indexes specifically to support the queries your API endpoints will need to run efficiently.
- Normalization vs. Denormalization: Start with a normalized model to ensure data consistency. Only denormalize (duplicate data) when necessary for read performance, and be aware of the trade-offs regarding update complexity and consistency.
- Indexing: Data structures that improve query speed. In an interview, explicit mention of which columns to index (e.g., foreign keys, filtering columns) demonstrates an understanding of real-world performance.
- Sharding: The strategy of partitioning data across multiple machines when it grows too large for a single node. The choice of a "shard key" should be based on the system's primary access pattern to minimize expensive cross-shard queries.
- Document Databases: Store data as flexible, JSON-like documents. They are useful for data with varying structures or deeply nested relationships but often require handling integrity and joins in the application logic.