Big Data Patterns, Mechanisms > Data Source Patterns > File-based Source
File-based Source (Buhler, Erl, Khattak)
How can large amounts of unstructured data be imported into a Big Data platform from a variety of different sources in a reliable manner?
A file data transfer engine mechanism is used that internally employs an agent-based system. The file data transfer engine mechanism is configured to add the location of the data source(s) and the target location(s). Using polling or filesystem capabilities, such as a file watcher component, the configured locations are scanned by the agents for new files, and when files appear in those locations, they are forwarded to the target location(s) in the Big Data platform.
It should be noted that this pattern can also be used to ingest semi-structured data, such as webserver log files. Whether importing semi-structured or unstructured data, the File-based Source pattern is only applicable for batch ingress of data. Furthermore, this pattern is normally applied together with the Data Size Reduction pattern in order to reduce data size footprint before persisting data to the storage device.
The manual copying of files is automated through the introduction of a system into the Big Data platform that can be configured in a centralized manner to look for files at more than one location. Such a system removes the inefficiencies linked with the ad-hoc copying of files and provides a central interface for configuring multiple data sources.
- User configures the file data transfer engine mechanism to import data from Data Sources A and B.
- Files containing textual data are automatically copied from Data Source A by the file data transfer engine.
- The file data transfer engine then automatically inserts the textual data into the configured storage device.
- Files containing videos are automatically copied from Data Source B by the file data transfer engine.