Book Summary: Designing Data-Intensive Applications

In the bestseller, "Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems" by Martin Kleppmann and published by O'Reilly, readers are granted an insightful dive into the foundational principles of building robust data systems. This review offers a detailed summary of Chapter 1 and Chapter 2, allowing readers a glimpse into the initial stages of Kleppmann's exploration. However, this focused review barely scratches the surface of the book's comprehensive exploration of the topic. To truly grasp the breadth and depth of Kleppmann's expertise on data-intensive systems, we highly recommend acquiring a copy and immersing yourself in the entire book. Detailed insights into Chapter 1 and Chapter 2 can be found below in this article.

Chapter 1: Reliable, Scalable, and Maintainable Applications
Chapter 2: Data Models and Query Languages

Chapter 1: Reliable, Scalable, and Maintainable Applications

The initial focus is on the three pillars of data systems:

Designing Data-Intensive Applications: Maintainable Systems

Reliability: Systems should persistently function correctly even when faced with challenges.
Scalability: Systems should adapt and manage growth effectively.
Maintainability: Over its lifecycle, diverse teams should be able to efficiently work with and adapt the system.

Typically, databases, caches, and other tools are perceived as distinct. But evolving technologies blur these boundaries. New tools offer multifaceted functionalities, and they no longer fit neatly into traditional categories. For instance, Redis functions as both a datastore and message queue, while Apache Kafka offers database-like durability. Modern applications often require combining multiple tools, creating a composite data system tailored to specific needs, with the application code acting as a bridge.

Designing these composite data systems involves addressing challenges related to data consistency, performance consistency, scalability, and API design. Although various factors influence a data system's design, the book emphasizes the aforementioned three pillars. These terms are often used without clear definitions, but understanding them is crucial for effective engineering. The subsequent chapters of the book will delve deeper into techniques and strategies to achieve these goals.

Reliability in Software Systems

Understanding Reliability:

Reliability intuitively refers to the system "working correctly" and maintaining this correctness even when things go wrong.
Disruptions in the system are termed "faults". Systems prepared for these disruptions are "fault-tolerant" or "resilient".
It's crucial to differentiate between a "fault" (a system component deviating from its spec) and a "failure" (the entire system not delivering the expected service). The goal is to design mechanisms that prevent faults from becoming failures.

Deliberate Fault Induction:

Ironically, it can be beneficial to introduce faults deliberately to ensure the system's fault-tolerance mechanisms are constantly tested. Netflix's Chaos Monkey is an example.

Hardware Faults:

Hardware components like hard disks, RAM, etc., can fail. Such failures are frequent in large datacenters.
Redundancy (like RAID configurations, dual power supplies) can reduce the failure rate, but can't completely eliminate it.
With the rise in data and computation demands, there's a move towards systems that use software fault-tolerance techniques.

Software Errors:

Hardware faults often appear random, but software errors are systematic and often more catastrophic.
These can be caused by bugs that become apparent under specific conditions or scenarios.
Solutions include thorough testing, process isolation, monitoring, and self-check mechanisms.

Human Errors:

Despite best intentions, humans make errors. Studies indicate that human configuration errors are a significant cause of outages.
To handle this:
- Design systems to reduce error opportunities.
- Decouple error-prone areas from those that can cause failures.
- Allow for quick recovery from mistakes.
- Implement comprehensive monitoring.
- Focus on good management practices and training.

Importance of Reliability:

Reliability isn't just for mission-critical applications. Even seemingly trivial applications must be reliable to prevent loss, frustration, or damage to reputation.
While there are scenarios where reliability might be compromised to cut costs or for other reasons, it's essential to make such decisions consciously.

Bottom Line:

Reliability in software ensures that systems function correctly and consistently, even amidst faults. This requires understanding hardware and software errors, acknowledging human errors, and using deliberate mechanisms like redundancy, monitoring, and continuous testing. The importance of reliability spans from mission-critical systems to everyday applications, emphasizing the need to prioritize and invest in it.

Scalability:

Scalability refers to a system's ability to handle increased load.
It's not enough to label a system as scalable or not. Instead, the focus should be on how a system can cope with growth and what resources can be added to manage added load.

Describing Load:

To understand scalability, one needs to first determine the current load on a system, which is typically represented using load parameters.
These parameters can vary based on the system in question and can include metrics like requests per second to a server, the number of active users, hit rates on caches, etc.

Twitter as a Scalability Example:

Two main Twitter operations are posting tweets and viewing home timelines.
Initially, Twitter stored tweets in a global collection. For a user's home timeline, the system would fetch and merge tweets of everyone the user follows.
Due to scalability issues, Twitter shifted to a caching approach, wherein each user’s home timeline was pre-computed and stored. When a tweet was posted, it was added to the home timelines of all the user’s followers.
Although more efficient for reading, this second approach significantly increased the workload when writing tweets, especially for users with millions of followers.
Consequently, Twitter adopted a hybrid system, keeping the cache approach for most users but reverting to the original method for celebrities with a vast number of followers.

The discussion underscores the importance of load parameters and a system's architecture when considering scalability, using Twitter as a case study.

Describing Performance and Scalability

Understanding Load and Performance
- Performance analysis begins with understanding the load on your system.
- Two ways to look at it:
  1. How performance changes as the load increases with unchanged system resources.
  2. How much resources need to increase to maintain constant performance as the load increases.
Batch vs Online Systems
- In batch processing systems like Hadoop, throughput (records processed per second or total time to run a job) is crucial.
- In online systems, the service’s response time (time between client request and response) is vital.
Latency vs Response Time
- Though often used interchangeably, latency and response time differ.
- Response time includes the time to process the request, network delays, and queueing delays. Latency is the waiting time before a request is handled.
Understanding Response Times
- It can vary greatly, even with similar requests. Factors influencing response times include background processes, network issues, hardware interruptions, etc.
- Average response time is commonly reported but may not reflect typical user experience.
- Percentiles (median, p95, p99, etc.) are more informative. For example, a median response time of 200 ms indicates half the requests are faster and half are slower.
Importance of High Percentiles
- High percentiles (tail latencies) affect the user experience.
- Major companies like Amazon focus on the 99.9th percentile because slower requests often come from valuable customers.
Percentiles in Practice
- Backend services impact end-user request times, especially if one of several backend calls is slow.
- To monitor response time percentiles, one must calculate them continuously using efficient algorithms.
Dealing with Increasing Load
- As the load increases, architectural changes are necessary.
- Options include scaling up (more powerful single machine) or scaling out (distributing the load across multiple machines).
- Some systems are elastic (automatic scaling with load), while others require manual intervention.
- Distributing stateless services is simpler than distributing stateful data systems.
- While many tools aid distributed systems, a one-size-fits-all solution doesn't exist. Architecture should be specific to application needs.
Architecture & Application Specificity
- An effective scalable architecture is built around the assumptions of the application's operations.
- Though architectures are tailored to specific applications, they often use general-purpose building blocks in familiar patterns.

In essence, to handle increased loads, it's crucial to understand and measure performance accurately. This involves analyzing system behavior with varied loads and implementing the right architecture to cater to specific application needs.

Maintainability

Designing Data-Intensive Applications: Pillars of Data Systems

The majority of software costs come from maintenance, which includes tasks like bug fixes, system updates, and adapting to new platforms.
Maintenance of legacy systems is often unpopular because of the challenges associated with fixing past mistakes, outdated platforms, or misused systems.
To avoid becoming a burdensome legacy system, software should be designed for easier maintenance. Three key principles for this are:
1. Operability: Simplifying tasks for operations teams.
2. Simplicity: Removing complexity so new engineers can understand and work with the system.
3. Evolvability: Designing the software to adapt to future changes and unexpected requirements.

Operability: Making Life Easy for Operations

Good operations teams can often compensate for software limitations, but the opposite isn't always true.
Operations teams are crucial for tasks such as:
- Monitoring system health.
- Problem troubleshooting.
- Software and platform updates.
- Anticipating and preventing future problems.
- Complex maintenance tasks.
- Security maintenance.
For optimal operability, systems should provide clear insights into their behavior, support automation, be independent of individual machines, and be predictable.

Simplicity: Managing Complexity

As software projects grow, they often become more complex, which can hinder maintenance.
Complexity can manifest in many ways, including tight module coupling, inconsistent terminology, and unintended workarounds.
Reducing complexity improves software maintainability.
Simplifying a system doesn't mean reducing functionality; it often involves removing "accidental complexity" or complexity arising solely from implementation.
Abstraction, or hiding implementation details behind a clean façade, is an effective tool against complexity.

Evolvability: Making Change Easy

Software requirements change frequently due to various factors such as user demands, platform changes, legal requirements, and system growth.
Agile methodologies offer a framework for adapting to such changes, with techniques like test-driven development (TDD) and refactoring.
The book seeks to address agility for larger data systems that comprise multiple applications or services.
Evolvability in a data system, or its ease of modification, is closely tied to its simplicity and abstractions. Simple systems tend to be more adaptable.

In essence, to ensure software remains useful and cost-effective in the long run, it's vital to prioritize its operability, simplicity, and evolvability.

Chapter 2: Data Models and Query Languages

Data models are integral to software development, influencing not just the coding approach but also the conceptual understanding of the problem. They work in layers, each layer representing the data in terms of the layer immediately below it:

Application Perspective: Developers look at the real-world entities (people, organizations, etc.) and create objects, data structures, and APIs specific to their application.
General-Purpose Data Models: These objects and structures are then represented in widely-used data models like JSON, XML, relational databases, or graph models.
Database Perspective: The creators of database software devise ways to represent these data models in terms of bytes, be it in memory, on disk, or network, facilitating various operations on the data.
Hardware Representation: At a more fundamental level, these bytes are represented as electrical currents, light pulses, magnetic fields, etc.

Such layered abstractions hide the complexities of the underlying layers, fostering effective collaboration between various stakeholders like application developers and database engineers.

The variety in data models stems from their different intended usages. Some models make certain operations efficient while hindering others; some models naturally support certain transformations, while others make them cumbersome. Considering the effort required to master a single data model, and its significant impact on software capabilities, it’s crucial to select one that aligns with the application's needs.

This chapter delves into general-purpose data models like the relational model, document model, and several graph-based models. It also explores diverse query languages and their applications. The subsequent chapter will address how these models are technically implemented.

Relational Model Versus Document Model

Relational Model: Rooted in the proposal made by Edgar Codd in 1970, data is organized into relations (tables in SQL). By the mid-1980s, relational database management systems (RDBMSes) like SQL became popular for structured data storage and querying. Their usage stemmed from business data processing tasks such as transaction and batch processing. The key feature of the relational model was to abstract away data representation details, giving users a cleaner interface.
Data Model Evolution: In the 1970s and 1980s, network and hierarchical data models were popular, but the relational model soon dominated. Various other models like object databases and XML databases emerged in the subsequent decades but found only niche adoption.
NoSQL Emergence: From the 2010s, NoSQL attempted to challenge the relational model. The term "NoSQL" was coined as a catchy hashtag and later meant "Not Only SQL". NoSQL's rise was due to:
- A need for greater scalability.
- Preference for free and open-source software.
- Support for specialized query operations not well-suited by the relational model.
- Desire for a more dynamic and expressive data model.
It's projected that relational databases will coexist with various non-relational datastores, a concept termed as "polyglot persistence".
Object-Relational Mismatch: Modern application development often uses object-oriented programming. This leads to a challenge when interfacing with SQL databases, creating what's known as an "impedance mismatch". Object-relational mapping (ORM) frameworks aim to bridge this gap but can't completely mask the differences.
Document-Oriented Databases: These databases like MongoDB and CouchDB support storing data as JSON documents. For structures like a résumé, a JSON representation is seen as appropriate due to its simplicity and better data locality. This representation also allows for more straightforward querying.
Many-to-One and Many-to-Many Relationships: Storing data using IDs rather than direct text is beneficial for ensuring consistent representation and facilitating updates. However, this requires "many-to-one" relationships, which are not straightforward in document databases. Over time, application features often necessitate more interconnected data, challenging the document model's scalability.

Relational vs. Document Databases

Data Model Differences:
- Document data models are favored for their schema flexibility, improved performance due to data locality, and resemblance to application data structures.
- Relational models excel in supporting joins and handling many-to-one and many-to-many relationships.
Simplifying Application Code:
- Document models are apt for data with a document-like structure.
- The document model can become complex if the application utilizes many-to-many relationships, necessitating emulated joins in the application code.
- The ideal data model depends on the interconnectedness of the data. For highly intertwined data, graph models are optimal.
Schema Flexibility:
- Document databases, often dubbed "schemaless", generally don't impose a rigid schema, though the application code assumes a structure, leading to an implicit schema-on-read.
- This flexibility is advantageous in handling heterogeneous data.
- In situations where uniformity in structure is expected, schemas serve as a useful mechanism.
Data Locality:
- Document databases store data continuously, offering an advantage if the entire document is accessed frequently.
- For updates, documents often need to be completely rewritten.
- The concept of grouping related data for better access performance isn't exclusive to document databases.
Convergence of Data Models:
- Many relational databases have incorporated XML support and are increasingly adding JSON support.
- Document databases like RethinkDB are introducing relational-like features.
- Over time, both types of databases are adopting features from one another, signaling a move towards hybrid models in the future.

Query Languages for Data:

The text delves into the nature and differences between imperative and declarative query languages used in databases.

SQL is a declarative query language introduced with the relational model. In contrast, systems like IMS and CODASYL used imperative code for querying.
Imperative languages, like many programming languages, specify operations in a sequence. The text provides an example of how one might write a function to retrieve all sharks from a list using an imperative approach.
On the other hand, relational algebra offers a more concise way of querying data, using selection operators. This form of querying was closely followed when defining SQL.
The major difference between the two is that imperative languages specify the exact steps a computer should take, whereas declarative languages like SQL just describe the desired result pattern, leaving the actual methods to achieve this result up to the system. This makes declarative languages like SQL more concise, easier to work with, and adaptable to performance optimizations without changing the queries.
Additionally, declarative languages, due to their nature, can be parallelized more easily, making them efficient with multi-core processors.
The utility of declarative query languages isn't restricted to databases. Their advantage is also evident in web development. For instance, CSS (a declarative language) offers a simpler way to style elements than using JavaScript (an imperative approach) to achieve the same.
The discussion then shifts to MapReduce, a model popularized by Google for processing data across many machines. MongoDB's implementation of MapReduce is highlighted. In essence, MapReduce is a hybrid approach: it's neither entirely declarative nor imperative. It uses snippets of code to express query logic that are repeatedly executed by a framework.
The final sections touch on the limitations of MapReduce and introduce MongoDB's aggregation pipeline, a declarative query language. This language is as expressive as a subset of SQL but adopts a JSON-based syntax. The takeaway is that even NoSQL systems might end up creating features similar to SQL in different forms.

Graph-Like Data Models

Many-to-many relationships distinguish data models. If data mainly has one-to-many relationships or none at all, the document model suits best. However, for prevalent many-to-many relationships, especially as they become complex, a graph model is apt.
Graphs have vertices (nodes/entities) and edges (relationships/arcs). They can represent various data, such as:
- Social graphs: Vertices are people, edges show friendships.
- Web graphs: Vertices are web pages, edges are HTML links.
- Road or rail networks: Vertices are junctions, edges represent roads or railways.
Algorithms like the shortest path and PageRank can operate on these graphs.
Graphs can store different types of data, providing a consistent storage method for diverse objects. For instance, Facebook's graph has various vertices and edges representing people, locations, events, and more.
There are different ways to structure and query data in graphs. Some notable models include the property graph model and the triple-store model, and query languages include Cypher, SPARQL, and Datalog.
Property Graphs:
- Vertices have a unique identifier, outgoing and incoming edges, and properties (key-value pairs).
- Edges have a unique identifier, a starting vertex (tail), an ending vertex (head), a label to describe the relationship, and properties.
- Graph stores can be visualized as two relational tables: one for vertices and one for edges. Vertices can connect to any other vertex without any schema restrictions.
The Cypher Query Language:
- Cypher is designed for the Neo4j graph database.
- It is declarative and allows users to express complex relationships and queries with a simple syntax.
- A Cypher query can be used to insert data into a graph database and to query it. The language is efficient and can handle intricate searches within the graph.

In essence, the text delves deep into how graph-like data models function, especially when many-to-many relationships are dominant in the dataset. Such models offer flexibility and expressiveness for intricate data relationships, with tools like the Cypher query language enhancing its utility.

Graph Queries in SQL

It is possible to represent graph data in a relational database, as shown by Example 2-2.
While SQL can be used to query graph data in a relational structure, it might be challenging. Typically, in a relational database, you know the exact joins in advance. But in a graph query, the number of joins (or edge traversals) can vary.
In graph-based languages like Cypher, there are concise ways to represent these traversals. For example, :WITHIN*0.. means to "follow a WITHIN edge, zero or more times." This is similar to the * operator in regular expressions.
SQL can emulate these graph traversals using recursive common table expressions, introduced in SQL:1999. This is supported in databases like PostgreSQL, IBM DB2, Oracle, and SQL Server. However, the syntax in SQL is more cumbersome compared to Cypher.
An example provided demonstrates how to use SQL's recursive common table expressions to achieve the same outcome as a Cypher query. In this example, the goal is to find people born in the US and living in Europe.
The SQL version requires significantly more lines of code than its Cypher counterpart, highlighting the difference in expressiveness between the two query languages.
Such differences underscore the importance of choosing the right data model for specific application needs. Different data models cater to various use cases, and what is concise in one language may be verbose in another.

Triple-Stores and SPARQL:

Triple-store and property graph models are largely similar, but with different terminologies.
Information in a triple-store is kept as three-part statements: (subject, predicate, object). For instance, (Jim, likes, bananas).
The subject of a triple corresponds to a vertex in a graph.
- The object can be a value like a string or a number. Here, the predicate and object represent the key and value of a property on the subject vertex.
- Alternatively, the object can be another vertex in the graph, making the predicate an edge connecting two vertices.
Data can be represented in Turtle format, a subset of Notation3 (N3).
The semantic web concept relates to making web information machine-readable. The triple-store data model is independent of this, but they are often linked.
RDF (Resource Description Framework) is a format for publishing consistent web data. The Turtle language is a human-readable format for RDF data. RDF/XML is a more verbose alternative.
SPARQL is a query language tailored for triple-stores using the RDF data model. It is efficient and powerful for application use.

Graph Databases vs. Network Model:

CODASYL's network model, which seems similar to the graph model, differs in several ways:
- CODASYL had a rigid schema. Graph databases allow any vertex to connect to any other.
- To reach a specific record in CODASYL, you needed to traverse specific access paths. In graph databases, any vertex can be directly referred to by its unique ID or through an index.
- CODASYL maintained an order of child records, whereas graph databases don't.
- CODASYL had imperative queries, but graph databases support both high-level, declarative languages and imperative code.

Conclusion

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems offers a comprehensive insight into the world of data systems, as demonstrated by the insightful summaries of its first two chapters. However, this article only scratches the surface of what the book has to offer. From delving deep into Storage and Retrieval to envisioning The Future of Data Systems, chapters 3 to 12 promise to be a treasure trove of information and expertise. The book, already a bestseller, is a must-have for anyone serious about understanding and harnessing the potential of data systems. I wholeheartedly recommend investing in this book—it is worth every penny and promises to elevate your knowledge in this domain.