Big Data Patterns, Mechanisms > Storage Patterns > Intermediate Results Storage
Intermediate Results Storage (Buhler, Erl, Khattak)
How can the complete re-execution of a series of processing steps be avoided in case an error occurs partway through?
The processing logic of the processing engine is modified to enable saving the generated output to a storage device. The task of result validation can either be left to the user or be automated, such as range checking. In the latter case, if the final output is found to be erroneous, the user can then be notified. If the final output is successfully validated, a workflow engine can be used to design a workflow that deletes the intermediate outputs.
When applied, the Intermediate Results Storage pattern may slow down processing due to the additional time required for saving intermediate results.
The application of this pattern results in the utilization of more storage space because the intermediate results also need storing, for which the Data Size Reduction pattern can be applied to reduce storage footprint. Furthermore, this pattern can be applied together with the Automated Processing Metadata Insertion pattern in order to speed up debugging.
The intermediate output generated by each processing run is saved in a way so that it can be retrieved later. Once the last processing run has generated the final output and after validation is deemed correct, the intermediate outputs are deleted. If the final output is erroneous, the intermediate outputs can be validated individually to find the cause of the error.
- A large dataset needs to be processed via three separate processing runs to compute x, a statistic.
- Processing Run A is executed, and the intermediate output is saved as well as passed to Processing Run B.
- Processing Run B is executed, and the intermediate output is saved as well as passed to Processing Run C.
- Processing Run C is executed, and the output (statistic) is saved as well as passed to the user.
- The user verifies the computed statistic and finds that it is incorrect.
- (a,b,c) The user then examines the saved intermediate output from each processing run to find the source of error.
- The user finds that intermediate output from Processing Run B is erroneous.
- The logic in Processing Run B is fixed.
- The user then partially re-executes the processing logic, starting with Processing Run B.
- (a,b,c,d) After successful validation of the statistic, the intermediate/duplicate results are deleted.