Big Data Patterns, Mechanisms > Data Processing Patterns > Automatic Data Sharding
Automatic Data Sharding (Buhler, Erl, Khattak)
How can very large amounts of data be stored without degrading the access performance of the underlying storage technology?
Problem
Solution
Application
Mechanisms
A NoSQL database is used for applying the Automatic Data Sharding pattern. Generally, a field in the dataset is specified by the user for configuring the sharding process. Based on the field value, rows are automatically allocated to different shards. When a user specifies a query, the NoSQL automatically determines which shard should be contacted for retrieving the required rows.
The performance, however, may deteriorate if the query requires data from multiple shards, which requires that query patterns be examined in order to best shard the dataset.
The Automatic Data Sharding pattern is normally applied together with the Automatic Data Replication and Reconstruction patterns to achieve fault-tolerance through the automatic replication of shards.
Instead of storing the entire dataset as a single unit, the dataset is automatically divided into parts where each part, called a shard, holds only a subset of rows and is stored on a separate machine. When a user queries data, data is automatically retrieved from the shard that holds the corresponding shard. By making each machine responsible for only part of the data, the overall performance of the underlying storage technology remains unaffected when a number of users start querying different parts of a dataset.
- A dataset is stored in a NoSQL database that automatically divides the dataset into four shards and stores them on four different machines of a cluster.
- (a,b,c) A handful of users access the data simultaneously without incurring any significant delay as each user is served by a different machine.
- (a,b,c,d) Over a period of time, more data is added to the same dataset, which results in a very large dataset and further increasing the size of each shard.
- (a,b,c,d) When a large number of users tries to access the data simultaneously, they still do not incur any significant delay, for users are served by different machines.