Big Data Patterns | Design Patterns | Data Size Reduction

Big Data Patterns, Mechanisms > Data Source Patterns > Data Size Reduction

Home > Design Patterns > Data Size Reduction

Data Size Reduction (Buhler, Erl, Khattak)

How can the size of the data be reduced to enable more cost effective storage and increased data movement mobility when faced with very large amounts of data?

Problem

Storing increasingly large amounts of data inside a Big Data solution environment can quickly exhaust existing storage capacity, requiring frequent storage capacity expansion that leads to increased costs. On the other hand, transferring very large files inside a cluster can affect the overall data processing time.

Solution

Incoming raw data’s storage footprint is reduced before data is stored inside the Big Data platform.

Application

Acquired data is compressed either inflight in case of streaming data or after acquiring the dataset in case of batch data by applying compression techniques.

Mechanisms

Compression Engine, Data Transfer Engine, Processing Engine, Resource Manager, Storage Device

A compression engine mechanism is introduced within the Big Data platform that works closely with the data transfer engine to compress data as it is acquired. In other circumstances, already acquired data can be processed to create a reduced-size dataset, or the output from the processing engine can be configured to be compressed automatically.

The application of this pattern requires some attention as incorrect application may increase overall data processing time and be a waste of processing resources. This requires the use of an efficient compression engine that requires fewer processing cycles to compress and decompress data but at the same time provides an optimum reduction in the dataset size. A compression engine that provides more compression requires more computing power and time and vice-versa.

Data Size Reduction: A component is added to the Big Data platform that reduces the size of the data before it is saved to the storage device. This not only keeps the storage cost low but further facilitates faster data movement within the cluster, which helps achieve quicker processing of data.

A component is added to the Big Data platform that reduces the size of the data before it is saved to the storage device. This not only keeps the storage cost low but further facilitates faster data movement within the cluster, which helps achieve quicker processing of data.

In the preceding diagram, with a reasonable amount of data acquisition, the IT spending only increases slightly with the passage of time. As the amount of acquired data increases exponentially, there is a tendency for the IT spending to increase exponentially as well. However, the storage capacity does not need to be increased proportionally if a data compression engine is introduced. As a result, the IT spending only increases slightly.

BigDataScienceSchool.com Big Data Science Certified Professional (BDSCP) Module 10: Fundamental Big Data Architecture

This pattern is covered in BDSCP Module 10: Fundamental Big Data Architecture.

For more information regarding the Big Data Science Certified Professional (BDSCP) curriculum,
visit www.arcitura.com/bdscp.

The official textbook for the BDSCP curriculum is:

Big Data Fundamentals: Concepts, Drivers & Techniques
by Paul Buhler, PhD, Thomas Erl, Wajid Khattak
(ISBN: 9780134291079, Paperback, 218 pages)

Please note that this textbook covers fundamental topics only and does not cover design patterns.
For more information about this book, visit www.arcitura.com/books.