Big Data Patterns | Design Patterns | Streaming Storage

Big Data Patterns, Mechanisms > Storage Patterns > Streaming Storage

Home > Design Patterns > Streaming Storage

Streaming Storage (Buhler, Erl, Khattak)

How can large datasets be accessed in a way that lends itself to efficient processing of data in batch mode?

Problem

Batch data processing techniques require contiguous blocks of input data to achieve high throughput. However, storing data using databases does not provide such a capability.

Solution

A Big Data storage device with streaming data access capability is used.

Application

Streaming data access technology is implemented to store datasets for non-random, simple sequential access, which achieves higher data transfer throughput.

Mechanisms

Processing Engine, Resource Manager, Storage Device

A distributed file system storage device is used to enable streaming data access. When data is required for batch processing, only the start position of the file needs to be found, and then the rest of the file is output as a continuous stream till the end of the file. Although enabling batch data processing, a distributed file system does not support any file search capability. A file can only be accessed based on a known location, and data can only be searched based on a sequential scan of the whole file.

This pattern is generally applied together with the Large-Scale Batch Processing pattern to provide a complete solution.

Streaming Storage: A storage device that is capable of providing non-random data access is used for storing large amounts in support of batch data processing. Restricting data access to non-random mode enables provisioning of data as contiguous blocks of data without requiring multiple data seek operations.

A storage device that is capable of providing non-random data access is used for storing large amounts in support of batch data processing. Restricting data access to non-random mode enables provisioning of data as contiguous blocks of data without requiring multiple data seek operations.

A distributed file system database is used to store large amounts of unstructured data.
When the data is required for batch processing, the distributed file system only needs to perform a single seek to find the start position of the file. Then, the distributed file system starts streaming the file without any further seeks.
This results in a very high throughput and decreases the time of the overall data processing.

BigDataScienceSchool.com Big Data Science Certified Professional (BDSCP) Module 10: Fundamental Big Data Architecture

This pattern is covered in BDSCP Module 10: Fundamental Big Data Architecture.

For more information regarding the Big Data Science Certified Professional (BDSCP) curriculum,
visit www.arcitura.com/bdscp.

The official textbook for the BDSCP curriculum is:

Big Data Fundamentals: Concepts, Drivers & Techniques
by Paul Buhler, PhD, Thomas Erl, Wajid Khattak
(ISBN: 9780134291079, Paperback, 218 pages)

Please note that this textbook covers fundamental topics only and does not cover design patterns.
For more information about this book, visit www.arcitura.com/books.