Streaming Storage (Buhler, Erl, Khattak)
How can large datasets be accessed in a way that lends itself to efficient processing of data in batch mode?
A distributed file system storage device is used to enable streaming data access. When data is required for batch processing, only the start position of the file needs to be found, and then the rest of the file is output as a continuous stream till the end of the file. Although enabling batch data processing, a distributed file system does not support any file search capability. A file can only be accessed based on a known location, and data can only be searched based on a sequential scan of the whole file.
This pattern is generally applied together with the Large-Scale Batch Processing pattern to provide a complete solution.
A storage device that is capable of providing non-random data access is used for storing large amounts in support of batch data processing. Restricting data access to non-random mode enables provisioning of data as contiguous blocks of data without requiring multiple data seek operations.
- A distributed file system database is used to store large amounts of unstructured data.
- When the data is required for batch processing, the distributed file system only needs to perform a single seek to find the start position of the file. Then, the distributed file system starts streaming the file without any further seeks.
- This results in a very high throughput and decreases the time of the overall data processing.