Direct Data Access (Buhler, Erl, Khattak)
How can large amounts of raw data be analyzed in place by contemporary data analytics tools without having to export data?
A two-way connector is used to enable direct connection between the analytic tool and the Big Data platform. In order for the user to be able to access data stored in the Big Data platform, the user first specifies a connection string that is used by the connector to locate the underlying resource in the Big Data platform to which the connection needs to be made and the file/dataset that needs to be retrieved. A separate connector is generally used to connect to different types of resources. The actual connection is made between the analytics tool and either the query engine or the storage device. After making the connection, the user specifies the required operations that need to be performed on the data. At runtime, the connector connects to the storage device or the query engine and retrieves the data required by the analytics tool for executing the required operations.
A functionality is added to enable the analytics tool to make a direct connection to the Big Data platform. Based on the type of the functionality required, the connection is made to the query engine or to the storage device.
- A large dataset is stored in a storage device.
- A data analyst uses a contemporary analytics tool to apply a machine learning algorithm to the dataset.
- A connection is made to the storage device via a connector.
- The storage device provides the required data via the connector.
- The tool then applies the required machine learning algorithm to the data.