Missing Data Imputation
Often there is missing data in the dataset intended for processing. The analyst is faced with a choice: to ignore, discard, or fill in the missing values. Filling in the gaps often and quite reasonably seems to be the preferred solution. However, this is not always the case.
An unsuccessful choice of the method of filling in the gaps can not only be counterproductive in terms of improvement but can also deteriorate the results. This article discusses simple and commonly used techniques for processing missing data, along with their advantages and disadvantages.
Exclusion and ignoring of rows with missing values has become the default solution in some popular application packages, which can mislead novice analysts into thinking this is the best course of action. In addition, there are so-called ad-hoc methods which are quite easy to implement and to use for missing data processing. This simplicity may be the reason analysts choose them quite often:
- Filling in the gaps with zeros
- Filling in the gaps with median values
- Filling in the gaps with arithmetic mean values
- Introduction of flag variables, etc.
It is probably because of simplicity that ad-hoc methods were widely used at the dawn of modern missing data processing theory. Although it is currently known that the use of these methods can lead to distortion of the statistical properties of the sample and, as a result, to deterioration of the results obtained after such processing of missing data [Horton, 2007], they are still very common.
Several studies explore the application of such methods for filling gaps in medical research (Burton, 2004, Karahalios, 2012, Rezvan, 2015]. These studies reveal that even scientists often resort to intuitive ad-hoc methods or simply ignoring/deleting data, despite these approaches potentially compromising the research.
Using ad-hoc methods and deleting rows has many pitfalls that every analyst needs to know about. In this article, we will briefly describe these methods and point out the main problems associated with their use in practice.
Missing Data Types
In order to understand how to properly handle missing data, we need to look at the mechanisms for their formation. Depending on them, missing data is categorized into three types:
- MCAR (Missing Completely At Random) is a missing data type in which the probability of omission is equal for each record of the set. For example: During a human research survey every tenth respondent is not asked one randomly selected question; all other questions are answered by all respondents. In this case, MCAR applies. Ignoring or excluding of records that contain missing data will not lead to distortion of the results.
- MAR (Missing At Random): In practice, data is usually missing not by chance, but due to some regular pattern. The missing data are classified as MAR if the probability of missing can be determined based on other information available in the dataset (gender, age, position, education, etc.) that has no gaps. In this case, deleting missing data or replacing it with "Null" will not significantly distort the results again.
- MNAR (Missing Not At Random) is a missing data type in which the amount of data missing depends on some unknown factors. MNAR assumes that the probability of missing data could be described based on other attributes, but there is no information about these attributes in the dataset. As a result, the probability of missing data cannot be deduced from the information contained in the dataset.
Let's look at the differences between MAR and MNAR using an example.
People who hold managerial positions and/or have received education at a prestigious university are more likely to skip the question about their income than other respondents. Since job title and education are strongly correlated with income in this case, the missing data in the income field can no longer be considered completely random, that is, does not belong to the MCAR type.
If the dataset contains information about the respondents' education and position, then the relationship between the increased probability of missing data in the income graph and this information can be expressed mathematically. Therefore, the MAR hypothesis is fulfilled. In the case of MAR, missing data is acceptable.
However, if we do not have information about the position and education, then MNAR applies. With MNAR, it is no longer possible to simply ignore or exclude the missing data, as this will significantly distort the distribution of statistical properties of the sample.
Let's look at simple methods for handling missing data and related problems.
Deletion/Ignoring of Missing Data
Complete Case Analysis (or Listwise Deletion) is a method of handling missing data that many application packages employ as default. It consists in excluding records/rows or attributes/columns containing missing data from the dataset.
In the case of the first missing data type (MCAR), this method will not significantly distort the parameters of the model. However, deletion of rows means that not all available information is used for further calculations, the standard deviations increase, and the results obtained become less representative. In cases where there are a lot of gaps in the data, this becomes a noticeable problem.
For the second (MAR) and particularly the third (MNAR) type, the bias in the statistical properties of the sample, the parameter estimates of the constructed models, and the standard deviations all become even more pronounced.
Thus, despite its widespread use, the implementation range of this method for the solution of practical problems is limited.
Available Case Analysis (or Pairwise Deletion) is a processing method that ignores missing data in calculations. This method, just as Complete Case Analysis, is also often used by default.
Statistical attributes such as average values and standard deviations can be calculated using all the missing values for each of the attributes/columns. As with Complete Case Analysis, if the MCAR hypothesis is fulfilled, this method will not distort the model parameters significantly.
The advantage of this approach is that all available information is used for model construction.
There are two main drawbacks to this method. First, it cannot be used to calculate all indicators. Second, it often involves complex algorithms and requires significant computational power. These limitations can sometimes lead to inaccurate results.
For example, the calculated values of the correlation coefficients may appear to be outside the range [-1; 1]. In addition, it is not always possible to unambiguously answer the question about the optimal choice of the number of samples used for calculation of the standard deviations.
Here is an example that demonstrates the problems of Available Case Analysis.
Imagine that you need to calculate the linear correlation coefficient (Pearson correlation coefficient) between two variables X and Y, the true values of which are shown in Table 1.
X | Y | X | Y | X | Y | X | Y |
---|---|---|---|---|---|---|---|
1 | -0,1983 | 9 | -7,9492 | 11 | -12,1990 | 18 | -14,9021 |
1 | 0,3767 | 6 | -7,3456 | 10 | -11,0234 | 17 | -18,9796 |
5 | -2,2939 | 7 | -9,8047 | 11 | -14,2284 | 19 | -19,5486 |
4 | -3,0886 | 7 | -10,6003 | 14 | -13,1966 | 20 | -20,5562 |
5 | -3,1942 | 14 | -9,9677 | 16 | -15,8271 | 21 | -21,0093 |
Table 1. A dataset with no gaps
Based on Table 1, we will determine the true values of statistical parameters.
- Average value X = 10,8000.
- Average value Y = -10,7768.
Estimation of covariance:
- \sigma_{11} = \sum X_i^2 / n - ( \sum X_i / n)^2 = 37,7600,
- \sigma_{12} = \sum (X_i \cdot Y_i) / n - ( \sum X_i \cdot \sum Y_i)/ n^2 = -38,1691,
- \sigma_{22} = \sum Y_i^2 / n - ( \sum Y_i / n)^2 = 42,6608,
where n is the number of observations (n=20).
Pearson correlation coefficient:
\ r = \sigma_{12}/ \sqrt{\sigma_{11} \cdot \sigma_{22}} = -0,9510
Let's conduct similar calculations when there are gaps in the data (see Table 2).
X | Y | X | Y | X | Y | X | Y |
---|---|---|---|---|---|---|---|
? | -0,1983 | 9 | -7,9492 | 11 | -12,1990 | 18 | -14,9021 |
? | 0,3767 | 6 | -7,3456 | 10 | -11,0234 | 17 | -18,9796 |
5 | -2,2939 | 7 | -9,8047 | 11 | -14,2284 | 19 | -19,5486 |
4 | -3,0886 | 7 | -10,6003 | 14 | -13,1966 | 20 | -20,5562 |
5 | -3,1942 | 14 | -9,9677 | 16 | -15,8271 | 21 | -21,0093 |
Table 2 - A dataset with gaps (two first X values missing)
Here, we are working with the same dataset as in Table 1, with the only difference that in this case we do not know the first two values of the variable X.
Applying Available Case Analysis, we calculate the average value using all available information, that is, for the variable X based on 18 known values, and for the variable Y based on all 20 values.
- Average of X = 11,8889,
- Average of Y = -10,7768,
Estimation of covariance:
- \sigma_{11} = 30,0988,
- \sigma_{12} = -43,6174,
- \sigma_{22} = 60,2952.
The calculations considered n observations for which both values X and Y (n=18) are known.
The calculated correlation coefficient value is
r = -1,0239.
The Available Case Analysis method skewed the average value, causing the correlation coefficient to fall below -1. This result violates the theoretical range (-1 to 1), indicating a discrepancy with the data's physical meaning.
If we calculate the value of the correlation coefficient after applying Complete Case Analysis, we get the value of -0.9311.
When the MCAR hypothesis is not met, both Available Case Analysis and Complete Case Analysis lead to significant distortions of the statistical properties of the sample (mean, median, variance, correlation, etc.).
Another issue with these two methods of missing data handling is the fact that deleting rows from a dataset is sometimes just not applicable. Often, post-processing procedures assume that all rows and columns are considered in the calculations (for example, when there are not too many gaps in each column, but almost all rows have some values missing.)
In the next parts of the article we will look at some methods that involve filling in gaps based on the information available. Often these techniques are collectively called Single-Imputation Methods.
Mean Substitution
The name of the method speaks for itself: The gaps are filled in with the mean value. Variations of this technique include replacing gaps with zero, median, etc.
However, all subtypes of this approach share the same disadvantages. To see them, let's look at two simple ways to fill in the gaps of a continuous attribute — filling in the gaps with the arithmetic mean and the mode.
Example 1. Figure 1 shows the distribution of continuous attribute values before and after filling in the gaps with the mean value.
Figure 1 demonstrates that the distribution after filling in the gaps looks extremely unnatural. This results in a distortion of all indicators that characterize the properties of the distribution (except for the mean value), an underestimated correlation, and overestimated standard deviations.
Thus, this method leads to a significant distortion of the attribute distribution even in the case of MCAR.
Example 2. In the case of categorical discrete attributes, mode-filling is most often used.
Figure 2 shows the distribution of the categorical characteristic before and after filling in the gaps.
Thus, filling in the gaps of a categorical attribute with a mode has the same drawbacks as filling in the gaps of a continuous attribute with the arithmetic mean (zero, median, etc.).
Last Observation Carried Forward
The LOCF method (Last Observation Carried Forward) repeats the result of the last observation. Frequently this method is used when filling in gaps in time series, where each value is strongly interrelated with the previous ones.
Let's have a look at two cases when the use of LOCF is justified.
Case 1. If we measure open air temperature at a certain geographical point once a minute, then under normal conditions — not during natural disasters — we can expect the temperature not to change dramatically (by 10-20 °C) in such a short time interval between subsequent measurements. Therefore, filling in the gaps with the previous known value is reasonable.
Case 2. If the data is the result of simultaneous measurment in neighboring geographical locations (let's say, air temperature measurments again), then we can assume that the values will not change significantly from one point to another in the dataset. In this case, LOCF is applicable again.
Situations where the use of LOCF is justified are not limited to these two cases.
While the LOCF method seems reasonable in certain scenarios, it can introduce significant distortions in statistical properties, even for data with Missing Completely at Random (MCAR) values. [Molenberghs, 2007]. For instance, using LOCF can create duplicate outliers (by filling in missing data points with the last observed value, which might itself be an outlier). Additionally, LOCF becomes unsuitable if there are many missing values in a row. It violates the assumption of small changes, potentially leading to inaccurate results.
Indicator Method
Indicator Method is a technique that replaces missing values with zeros and adds a special indicator attribute (flag) that gets value 0 in the cases where the data was not initially missing, and value 1 for the records where there was previously missing data [Miettinen, 1985].
For better understanding, let's look at an example.
Example. Table 3 shows the data before filling in the gaps.
Value | 12,7 | 7,5 | ? | 3,1 | ? | ? | 5 | 5,8 | 3,7 |
---|---|---|---|---|---|---|---|---|---|
Missing Data | Yes | Yes | Yes |
Table 3. A dataset before filling in the gaps
The "?" sign indicates missing data in the dataset.
Table 4 shows the data after filling in the gaps.
Value | 12,7 | 7,5 | 0 | 3,1 | 0 | 0 | 5 | 5,8 | 3,7 |
---|---|---|---|---|---|---|---|---|---|
Flag | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 |
Table 4. The dataset after filling in the gaps
In practice, some variations of this method involve filling in the gaps with values other than zero. Here is an important point: When using this type of filling (e.g., filling with the average), you can leverage the inverse of the indicator field (that is, 0 when the value in the source data was initially missing, and a non-zero value when the source field had data).
When filling in gaps with non-zero values, the flag field and the source field are often interconnected.
The advantages of this method include the following:
- Using the whole dataset
- Making missing values explicit
Despite these advantages, even with the MCAR hypothesis met and a small number of missing values, this approach can lead to significant distortion of the results [Vach, 1991, Knol, 2010].
Regression Imputation
This method consists in restoration of the missing values using the linear regression model based on the known values of the dataset.
Figure 3 shows the results of restoration of missing values of attribute 1 based on the known values of attribute 2.
The linear regression method allows you to get reasonably filled data. However, real data is characterized with a certain range of values, which changes when you fill in the blanks based on linear regression. As a result, the variance in the attribute values becomes smaller and the correlation between attribute 2 and attribute 1 is artificially enhanced. The effectiveness of this gap-filling method deteriorates with two factors: increasing variance in the attribute values we use and a rising percentage of missing rows.
It is worth noting that there is a method that solves this problem — the stochastic linear regression method illustrated in Figure 4.
The stochastic regression model reflects not only the linear relationship between the attributes, but also the deviations from this linear dependence. This method has the advantages of linear regression imputation and, in addition, does not distort the values of correlation coefficients so much.
Of all the methods that we have discussed in this part of the article, filling in the gaps using stochastic linear regression generally leads to the least distortion in the statistical properties of the sample. Moreover, when there is a purely linear dependence between the attributes, the stochastic linear regression method is often superior to even more complex methods.
Conclusion
In this article, we have considered simple methods for filling in the gaps. Although the use of these methods can lead to significant distortion of the statistical properties of the dataset (such as the mean, the median, variance, correlation, etc.) even in the case of MCAR, they remain frequent not only among ordinary users, but also in the scientific community (at least in the areas related to medicine).
A study by Burton (2004) examined 100 cancer research papers published in 2002. The authors of 82% of these papers reported encountering missing data. Burton's analysis further revealed that among the 32 papers that discussed their missing data handling methods:
12 used Complete Case Analysis.
12 used Available Case Analysis.
4 used the Indicator Method
3 used ad-hoc methods (improvised techniques)
Only 1 paper employed a more complex method.
Fast forward a decade, and a similar trend persists. Karahalios (2012) reviewed scientific papers and found that Complete Case Analysis remained the dominant approach, used in 54% of the cases (in 21 articles). Other methods used included:
LOCF in 7 cases.
Mean Substitution in 3 cases.
The Indicator method in 1 case.
More sophisticated techniques for filling in gaps (such as multiple Imputation or the maximum likelihood function methods) are still rarely used in scientific medical articles [Rezvan, 2015].
As a conclusion, let us note that the use of simple methods such as deletion of rows or ad-hoc methods, does not always lead to deteriorated results. Moreover, simple methods are preferable when their use is justified.