Big Data Patterns, Mechanisms > Data Source Patterns > Data Size Reduction
Data Size Reduction (Buhler, Erl, Khattak)
How can the size of the data be reduced to enable more cost effective storage and increased data movement mobility when faced with very large amounts of data?
Problem
Solution
Application
Mechanisms
A compression engine mechanism is introduced within the Big Data platform that works closely with the data transfer engine to compress data as it is acquired. In other circumstances, already acquired data can be processed to create a reduced-size dataset, or the output from the processing engine can be configured to be compressed automatically.
The application of this pattern requires some attention as incorrect application may increase overall data processing time and be a waste of processing resources. This requires the use of an efficient compression engine that requires fewer processing cycles to compress and decompress data but at the same time provides an optimum reduction in the dataset size. A compression engine that provides more compression requires more computing power and time and vice-versa.
A component is added to the Big Data platform that reduces the size of the data before it is saved to the storage device. This not only keeps the storage cost low but further facilitates faster data movement within the cluster, which helps achieve quicker processing of data.
In the preceding diagram, with a reasonable amount of data acquisition, the IT spending only increases slightly with the passage of time. As the amount of acquired data increases exponentially, there is a tendency for the IT spending to increase exponentially as well. However, the storage capacity does not need to be increased proportionally if a data compression engine is introduced. As a result, the IT spending only increases slightly.