DDIA | Chapter 1 | Reliable, Scalable, And Maintainable Applications – Reliability

Data Intensive applications are the ones which revolve around data. The problems for these applications are amount of data, data complexities, speed at which it changes, maintaining the data and so on. 

Building blocks of data intensive applications – 

  1. Databases – store/retrieve data
  2. Caches – remember data to speed up things
  3. Search indexes – searching using keywords
  4. Stream processing – asynchronous messaging
  5. Batch processing – processing chunks of data together

As we can see there are so many different types of systems that handle data. Even within caches, there are different types and so on. Each has their own characteristics, advantages and disadvantages. 

So what is a data system?

Imagine we have a database, a message queue and a search index and then an application/API which converses with all of these. The application is hiding the complexity of handling data, making sure data remains consistent between the three systems. This whole combination of different common components along with an application is the data system, it has been designed to serve a unique purpose.

To summarize, data systems are combination of general purpose systems that handle data and an application to provide a special service. 

When thinking about such a system, one is not only thinking as an application developer but also as a data engineer. One needs to think which combination of these general purpose tools will give the most reliable, scalable and maintainable data system.

Let’s look at what reliability means when it comes to data systems or systems in general .

 

The system should continue to work correctly (performing the correct function at the desired level of performance) even in the face of adversity (hardware or software faults, and even human error).

Definition of Reliability from book

We want our systems to be reliable or we can say we want make our system as fault tolerant as possible. Let’s look how we can handle some type of faults.

Hardware Faults 

Common hardware faults are hard disk failure, RAM corruption, power failure, and so on. 

This is tackled using redundancy at the ground level. Example, generators and batteries for power if the power fails. Disks setup in raid configuration, dual power supplies, dual CPUs. When one system fails, other takes over. 

Until we had large cloud based system, this way to handle hardware faults were sufficient. A huge data center typically has large number of hardware faults and so now software based approaches are used to handle these hardware faults. 

We generally see planned downtime on utility websites. That is one way to handle hardware faults. Having planned outage to replace the faulty machines. 

Software Errors

These are the bugs introduced in systems while they were developed because of certain assumptions and decisions. There is no easy way to prevent such errors. 

Thorough testing, load testing, performance testing, process isolation, etc are few ways in which we can. find and fix these errors early.

Human Errors

Humans design systems. Humans are the ones who maintain these systems. Humans can introduce errors at any stage, design, development, release, or maintenance.

Common ways to avoid faults by humans

  1. Designing in a way to minimize introduction of errors
  2. Testing thoroughly at all levels, unit tests, integration tests.
  3. Use metrics to detect faults quickly.

Thanks for stopping by! Hope this gives you a brief overview in to reliability aspect of data systems. Eager to hear your thoughts and chat, please leave comments below and we can discuss. 


2 responses to “DDIA | Chapter 1 | Reliable, Scalable, And Maintainable Applications – Reliability”