Big Data Patterns, Mechanisms > Data Processing Patterns > Automated Processing Metadata Insertion
Automated Processing Metadata Insertion (Buhler, Erl, Khattak)
How can confidence be instilled in results whose computation involves applying a series of processing steps in a Big Data environment?
Problem
Solution
Application
Mechanisms
A particular data structure is standardized upon. Then details about the various operations that are applied during the course of the different processing runs are added as metadata based on the standardized data structure. The appending of the metadata is performed automatically via code that is inserted within the processing routines of the processing engine.
In case the data is manipulated via the query engine, depending on the functionality provided by the query engine, support for accessing metadata may need to be added to the query engine. The addition of metadata in machine-readable form eliminates the requirement of humans for interpreting metadata.
This pattern can also be applied in association with the Complex Logic Decomposition pattern or Intermediate Results Storage pattern to provide details about intermediate processing steps.
Details about the operation(s) applied to the data during each processing step are automatically added in a machine-readable format to the output of the respective processing step as metadata. A form of interface, textual or graphical, is provided for the user to view the metadata. The application of the Automated Processing Metadata Insertion pattern also facilitates testing, debugging and code management.
In the diagram, a statistic, x, needs to be computed from a dataset. The computation involves multiple processing steps consisting of data cleansing, data transformation and the application of an algorithm. At the end of each step, metadata is added to the output, with details about the operations performed on the input. When the user views the statistic, the have high confidence about the validity of the statistic, for they have access to the metadata that tells them how the statistic was calculated.