Big Data Patterns | Design Patterns | Intermediate Results Storage


Big Data Patterns, Mechanisms > Storage Patterns > Intermediate Results Storage
Home > Design Patterns > Intermediate Results Storage

Intermediate Results Storage (Buhler, Erl, Khattak)

How can the complete re-execution of a series of processing steps be avoided in case an error occurs partway through?

Intermediate Results Storage

Problem

When executing a series of processing steps, not being able to restart processing only from the specific step that was the source of the error results in a loss of time and unnecessary resource usage.

Solution

The intermediate output from each step is stored temporarily until the final result is computed and validated.

Application

The processing logic is modified so that the processing output from each processing step is persisted to a storage device, which only gets deleted once the final processing step gets executed and the results have been verified.

The processing logic of the processing engine is modified to enable saving the generated output to a storage device. The task of result validation can either be left to the user or be automated, such as range checking. In the latter case, if the final output is found to be erroneous, the user can then be notified. If the final output is successfully validated, a workflow engine can be used to design a workflow that deletes the intermediate outputs.

When applied, the Intermediate Results Storage pattern may slow down processing due to the additional time required for saving intermediate results.

The application of this pattern results in the utilization of more storage space because the intermediate results also need storing, for which the Data Size Reduction pattern can be applied to reduce storage footprint. Furthermore, this pattern can be applied together with the Automated Processing Metadata Insertion pattern in order to speed up debugging.

Intermediate Results Storage: The intermediate output generated by each processing run is saved in a way so that it can be retrieved later. Once the last processing run has generated the final output and after validation is deemed correct, the intermediate outputs are deleted. If the final output is erroneous, the intermediate outputs can be validated individually to find the cause of the error.

The intermediate output generated by each processing run is saved in a way so that it can be retrieved later. Once the last processing run has generated the final output and after validation is deemed correct, the intermediate outputs are deleted. If the final output is erroneous, the intermediate outputs can be validated individually to find the cause of the error.

  1. A large dataset needs to be processed via three separate processing runs to compute x, a statistic.
  2. Processing Run A is executed, and the intermediate output is saved as well as passed to Processing Run B.
  3. Processing Run B is executed, and the intermediate output is saved as well as passed to Processing Run C.
  4. Processing Run C is executed, and the output (statistic) is saved as well as passed to the user.
  5. The user verifies the computed statistic and finds that it is incorrect.
  6. (a,b,c) The user then examines the saved intermediate output from each processing run to find the source of error.
  7. The user finds that intermediate output from Processing Run B is erroneous.
  8. The logic in Processing Run B is fixed.
  9. The user then partially re-executes the processing logic, starting with Processing Run B.
  10. (a,b,c,d) After successful validation of the statistic, the intermediate/duplicate results are deleted.

BigDataScienceSchool.com Big Data Science Certified Professional (BDSCP) Module 11: Advanced Big Data Architecture.

This pattern is covered in BDSCP Module 11: Advanced Big Data Architecture.

For more information regarding the Big Data Science Certified Professional (BDSCP) curriculum,
visit www.arcitura.com/bdscp.

Big Data Fundamentals

The official textbook for the BDSCP curriculum is:

Big Data Fundamentals: Concepts, Drivers & Techniques
by Paul Buhler, PhD, Thomas Erl, Wajid Khattak
(ISBN: 9780134291079, Paperback, 218 pages)

Please note that this textbook covers fundamental topics only and does not cover design patterns.
For more information about this book, visit www.arcitura.com/books.