Feature Imputation | Arcitura Patterns

Machine Learning Patterns, Mechanisms > Data Wrangling Patterns > Feature Imputation

Feature Imputation (Khattak)

How can a dataset with missing feature values be used for model development without having to delete entire rows or columns of valuable data?

Problem

One way to utilize a dataset containing missing values is to delete entire rows or columns of data. However, this comes at the expense of losing valuable data that could have contributed towards developing a more accurate model.

Solution

Instead of deleting data, the value of missing features is inferred from the rest of the features through the application of statistical techniques or machine learning algorithms.

Application

Statistical techniques, such as mean, median, or mode, or machine learning algorithms, such as K-NN and linear regression are applied to the dataset to find the values of the missing fields.

Mechanisms

Query Engine, Analytics Engine, Processing Engine, Resource Manager, Storage Device

A training dataset contains missing values for Feature B and cannot be used to train a model (1). The imputation technique is applied in order to fill in the missing values (2). In the resulting dataset, the missing values are imputed by using the mean value of Feature B (3). The imputed dataset is then used to train a model (4, 5). The resulting model’s accuracy is within the expected range (6).

Module 12: Fundamental Service API Design & Management

This pattern is covered in Machine Learning Module 2: Advanced Machine Learning.

For more information regarding the Machine Learning Specialist curriculum, visit www.arcitura.com/machinelearning.