So far this book dealt with how we can store data on a single system. From this post onwards we will see how to manage data if we distribute the data to between multiple systems.
Why distribute –
- Scalability – multiple machines can be added if data load increases
- Low latency – data distributed to multiple regions, can be served from the closest region, means low latency.
- Fault tolerance/high availability – If one machine goes down, another machine can take over without loss of availability.
Scaling to Higher Load/ Vertical Scaling/ Scaling up
In this case, whenever needed, we increase the size of RAM/number of CPUs, etc of the machine to handle the increasing load.
Drawbacks –
- High cost, machine with twice amount of RAM, twice CPU power or twice storage size, cost way higher than another machine of same size that you have.
- Low fault tolerance, single machine, single geography, any failure, you can easily replace parts but could take time.
- High latency, single machine, single geography, more amount of time spent in serving requests from far off
Shared-disk architecture
Another approach is the shared-disk architecture, which uses several machines with independent CPUs and RAM, but stores data on an array of disks that is shared between the machines, which are connected via a fast network. This architecture is used for some data warehousing workloads, but contention and the overhead of locking limit the scalability of the shared-disk approach.
Shared-Nothing Architectures/ Horizontal Scaling/Scaling out
Increase number of machines when needed. Each machine (also called node) has it’s own CPU, RAM, disk, etc, nothing is shared. Any coordination between nodes is done at the software level, using a conventional network.
Replication Versus Partitioning
Replication
Keeping a copy of the same data on several different nodes, potentially in different locations. Replication provides redundancy: if some nodes are unavailable, the data can still be served from the remaining nodes. Replication can also help improve performance.
Partitioning
Splitting a big database into smaller subsets called partitions so that different partitions can be assigned to different nodes (also known as sharding).
In the subsequent posts, we will start going into details into the shared nothing architecture and what tradeoffs are needed for implementing them.
Thanks for stopping by! Hope this gives you a brief overview in to distributed data. Eager to hear your thoughts and chat, please leave comments below and we can discuss.