Machine Learning Patterns, Mechanisms > Data Wrangling Patterns > Feature Imputation
Feature Imputation (Khattak)
How can a dataset with missing feature values be used for model development without having to delete entire rows or columns of valuable data?
One way to utilize a dataset containing missing values is to delete entire rows or columns of data. However, this comes at the expense of losing valuable data that could have contributed towards developing a more accurate model.
Instead of deleting data, the value of missing features is inferred from the rest of the features through the application of statistical techniques or machine learning algorithms.
Statistical techniques, such as mean, median, or mode, or machine learning algorithms, such as K-NN and linear regression are applied to the dataset to find the values of the missing fields.
Query Engine, Analytics Engine, Processing Engine, Resource Manager, Storage Device
A training dataset contains missing values for Feature B and cannot be used to train a model (1). The imputation technique is applied in order to fill in the missing values (2). In the resulting dataset, the missing values are imputed by using the mean value of Feature B (3). The imputed dataset is then used to train a model (4, 5). The resulting model’s accuracy is within the expected range (6).