# Missing data imputation

Often there are missing data in the data requiring processing, so the analyst is faced with a choice: to ignore, discard, or fill in the missing values. Filling in the gaps often and quite reasonably seems to be the preferred solution. However, this is not always the case.

An unsuccessful choice of the method of filling in the gaps can not only be useless in terms of improvement, but can also deteriorate the results. This article discusses simple methods for processing of missing data that are widely used in practice, their advantages, and disadvantages.

Exclusion and ignoring of rows with missing values has become the default solution in some popular application packages, which may give novice analysts the impression that this solution is the right one. In addition, there are quite easy-to-implement and easy-to-use methods for missing data processing, called ad-hoc methods, the simplicity of which may be the reason for them being chosen:

- filling in the gaps with zeros;
- filling in the gaps with median values;
- filling in the gaps with arithmetic mean values;
- introduction of indicator variables, etc.

It is probably because of its simplicity that ad-hoc methods were widely used at the dawn of modern missing data processing theory. Although it is currently known that the use of these methods can lead to distortion of the statistical properties of the sample and, as a result, to deterioration of the results obtained after such processing of missing data [Horton, 2007], they are still used quite often.

For example, there are articles on collection and evaluation of statistical data on the use of methods for filling in the gaps in medical research papers [Burton, 2004, Karahalios, 2012, Rezvan, 2015], from the results of which it can be concluded that even scientists often prefer intuitive ad-hoc methods and ignoring/deleting of lines, despite the fact that the use of these methods in the context of the posed problem is sometimes inappropriate.

Using ad-hoc methods and deleting of rows has many pitfalls that every analyst needs to know about. In this article, we will briefly describe these methods and point out the main problems associated with their use in practice.

## Missing Data Mechanisms

In order to understand how to properly handle missing data, you need to determine the mechanisms for their formation. There are the following 3 missing-data mechanisms: MCAR, MAR, MNAR:

**MCAR**(Missing Completely At Random) is a missing data mechanisms, in which the probability of omission for each record of the set is equal. For example, if during a sociological survey one randomly selected question was not asked to every tenth respondent, and all other questions were answered by the respondents, MCAR is applied. In this case, ignoring or excluding of records that contain missing data does not lead to distortion of the results.**MAR**(Missing At Random) — in practice, data is usually missing not by chance, but due to some regular pattern. The missing data are classified as MAR if the probability of missing data can be determined based on other information available in the data set (gender, age, position, education...) that does not contain missing data. In this case, deleting or replacing of the missing data with the "Missing Data" value, as in the case of MCAR, will not significantly distort the results.**MNAR**(Missing Not At Random) — is a missing data mechanisms, in which the amount of data missing depends on unknown factors. MNAR assumes that the probability of missing data could be described based on other attributes, but there is no information about these attributes in the data set. As a result, the probability of missing data cannot be expressed based on the information contained in the data set.

Let's look at the differences between the MAR and MNAR mechanisms using an example.

People who hold managerial positions and/or have received education at a prestigious University are more likely than other respondents to not answer the question about their income. Since the position and education are strongly correlated with the income in this case, the missing data in the income field can no longer be considered completely random, that is, it is not possible to talk about applicability of MCAR.

If the data set contains information about the respondent's education and position, then the relationship between the increased probability of missing data in the income graph and this information can be expressed mathematically, therefore, the MAR hypothesis is fulfilled. In the case of MAR, missing data is acceptable.

However, if we do not have information about the position and education, then there MNAR is applied. With MNAR, it is no longer possible to simply ignore or exclude the missing data, as this will significantly distort the distribution of statistical properties of the sample.

Let's look at simple methods for handling missing data and related problems.

## Deletion/Ignoring of Missing Data

**Complete-case Analysis** (or Listwise Deletion Method) is a method for missing data handling that is used in many application packages as the default method. It consists in excluding records/rows or attributes/columns containing missing data from the data set.

In the case of the first missing data mechanism (MCAR), this method will not significantly distort the parameters of teh model. However, deletion of rows means that not all available information is used for further calculations, the standard deviations increase, and the results obtained become less representative. In cases where there are a lot of gaps in the data, this becomes a noticeable problem.

In addition, in the case of the second (MAR), and especially the third mechanism (MNAR) th offset of the statistical properties of the sample, parameter values of the constructed models, and the increase in the standard deviations become even stronger.

Thus, despite its widespread use, implementation range of this method for solution of practical problems is limited.

**Available-case analysis** (or Pairwise Deletion) — processing methods based on ignoring of missing data in calculations. These methods, just as Complete-Case Analysis, are also often used by default.

Statistical attributes such as averages and standard deviations can be calculated using all the missing values for each of the attributes/columns. As in the case of Complete-Case Analysis, if the MCAR hypothesis is fulfilled, this method will not distort the model parameters significantly.

The advantage of this approach is that all available information is used during construction of the model.

The main drawback of these methods is that they are not applicable for calculation of all indicators and, as a rule, are associated with algorithmic and computation power related difficulties that lead to incorrect results.

For example, the calculated values of the correlation coefficients may be outside the range [-1; 1]. In addition, it is not always possible to unambiguously answer the question about the optimal choice of the number of samples used for calculation of the standard deviations.

Here is an example that demonstrates the problems of the Available-Case Analysis methods.

Let's consider the following problem: you need to calculate the linear correlation coefficient (Pearson correlation coefficient) between two factors/variables X and Y, the true values of which are shown in Table 1.

X | Y | X | Y | X | Y | X | Y |
---|---|---|---|---|---|---|---|

1 | -0,1983 | 9 | -7,9492 | 11 | -12,1990 | 18 | -14,9021 |

1 | 0,3767 | 6 | -7,3456 | 10 | -11,0234 | 17 | -18,9796 |

5 | -2,2939 | 7 | -9,8047 | 11 | -14,2284 | 19 | -19,5486 |

4 | -3,0886 | 7 | -10,6003 | 14 | -13,1966 | 20 | -20,5562 |

5 | -3,1942 | 14 | -9,9677 | 16 | -15,8271 | 21 | -21,0093 |

Table 1 - Data Set without Missing Data

Based on Table 1, we will determine the true values of statistical parameters.

- Average value X = 10,8000.
- Average value Y = -10,7768.

Estimation of covariance:

- \sigma_{11} = \sum X_i^2 / n - ( \sum X_i / n)^2 = 37,7600,
- \sigma_{12} = \sum (X_i \cdot Y_i) / n - ( \sum X_i \cdot \sum Y_i)/ n^2 = -38,1691,
- \sigma_{22} = \sum Y_i^2 / n - ( \sum Y_i / n)^2 = 42,6608,

where n is the number of observations (n=20).

Сorrelation coefficient value:

\ r = \sigma_{12}/ \sqrt{\sigma_{11} \cdot \sigma_{22}} = -0,9510

Let's look at the results of similar calculations when there are gaps in the data (the data is presented in Table 2).

X | Y | X | Y | X | Y | X | Y |
---|---|---|---|---|---|---|---|

? | -0,1983 | 9 | -7,9492 | 11 | -12,1990 | 18 | -14,9021 |

? | 0,3767 | 6 | -7,3456 | 10 | -11,0234 | 17 | -18,9796 |

5 | -2,2939 | 7 | -9,8047 | 11 | -14,2284 | 19 | -19,5486 |

4 | -3,0886 | 7 | -10,6003 | 14 | -13,1966 | 20 | -20,5562 |

5 | -3,1942 | 14 | -9,9677 | 16 | -15,8271 | 21 | -21,0093 |

Table 2 - Data Set without Missing Data

In other words, we are working with the same data set (as in Table 1), with the only difference that in this case we do not know the first two values of the variable X.

In the Available-Case Analysis approach, we calculate the average value using all available information, that is, for the variable X based on 18 known values, and for the variable Y based on all 20 values.

- Average of X = 11,8889,
- Average of Y = -10,7768,

Estimation of covariance:

- \sigma_{11} = 30,0988,
- \sigma_{12} = -43,6174,
- \sigma_{22} = 60,2952.

The calculations considered n observations for which both values X and Y (n=18) are known.

Correlation coefficient value:

r = -1,0239.

Thus, the calculation of the average value based on the Available-Case Analysis approach led to an offset in this value, which, in turn, was manifested in the calculated value of the correlation coefficient less than -1. Thus, the calculated value went beyond the theoretically possible range [-1; 1], which contradicts the physical meaning.

If we calculate the value of the correlation coefficient in the framework of the Complete-Case Analysis approach, we get the value of the correlation coefficient of -0.9311.

When the MCAR hypothesis is not met, the Available-Case Analysis methods, as well as the Complete-Case Analysis methods, lead to significant distortions of the statistical properties of the sample (mean, median, variance, correlation...).

The disadvantages of the first two methods of missing data handling (Complete-Case Analysis and Available-Case Analysis) include the fact that string exclusion is not always acceptable in principle. Often, post-processing procedures assume that all rows and columns are considered in the calculations (for example, when there are not very many gaps in each column, but there are few rows that do not contain any missing fields).

Later in this article, we will look at methods that involve filling in gaps based on available information. Often these methods are combined into a single group called Single-Imputation Methods.

## Filling in the Gap with the Average Value

**Filling in the gap with the average value** (Mean Substitution) (other options: filling in with zero, median, etc.); the name of the method speaks for itself.

All variants of this method have the same disadvantages. Let's look at these disadvantages using an example of one of the simplest ways to fill in the gaps of a continuous attribute: filling in the gaps with the arithmetic mean and the mode.

**Example 1. ** Figure 1 shows the distribution of continuous attribute values before and after filling in of the gaps with the average value.

Figure 1 clearly shows that the distribution after filling in the gaps looks extremely unnatural. This results in a distortion of all indicators that characterize the properties of the distribution (except for the average value), an underestimated correlation, and an overestimated standard deviations.

Thus, this method leads to a significant distortion of the attribute distribution even if MCAR is used.

**Example 2. ** In the case of a categorical discrete attributes, mode filling is most often used.

Figure 2 shows the distribution of the categorical characteristic before and after filling in of the gaps.

Thus, when filling in the gaps of a categorical attribute with a mode, the same disadvantages appear as when filling in the gaps of a continuous attribute with the arithmetic mean (zero, median, etc.).

## Repeating the Result of the last Observation

**LOCF** (Last Observation Carried Forward) repeats the result of the last observation. This method is used, as a rule, when filling in gaps in time series, when subsequent values are a priori strongly interrelated with the previous ones.

Let's consider 2 cases when the use of LOCF is justified.

**Case 1.** If we measure the air temperature at a certain geographical point in an open space, and the measurements are made every minute, then under normal conditions — if we exclude natural disasters — the measured value a priori cannot change dramatically (by 10-20 °C) in such a short time interval between the subsequent measurements. Therefore, filling in the gaps with the previous known value is reasonable in this situation.

**Case 2.** If the data is the result of measurements (for example, air temperature again) at the same time in close geographical locations in such a way that the hypothesis of small changes in values from one point of the data set to another remains valid, then again, the use of LOCF is logical.

Situations where the use of LOCF is justified are not limited to these two cases.

Although the method is logical and reasonable in the situations described above, it can also lead to significant distortions of statistical properties, even in the case of MCAR [Molenberghs, 2007]. Thus, it is possible that implementation of LOCF will lead to duplicate outliers (filling in the gaps with an abnormal value). In addition, if there are many consecutive missing values in the data, then the hypothesis of small changes is no longer fulfilled and, as a result, the use of LOCF leads to incorrect results.

## Indicator Method

**Indicator Method** is a method that replaces missing values with zeros and adds a special indicator attribute that takes zero values in the cases of records where the data initially did not contain missing data, and non-zero values where there were previously missing data [Miettinen, 1985].

It is easier and clearer to demonstrate this method with an example.

**Example.** Table 3 shows the data before filling in the gaps.

Param | 12,7 | 7,5 | ? | 3,1 | ? | ? | 5 | 5,8 | 3,7 |
---|---|---|---|---|---|---|---|---|---|

Missing Data | Yes | Yes | Yes |

Table 3 — Data before Filling in the Gaps

"?" sign indicates missing data in the data set.

Table 4 shows the data after filling in the Gaps.

Param | 12,7 | 7,5 | 0 | 3,1 | 0 | 0 | 5 | 5,8 | 3,7 |
---|---|---|---|---|---|---|---|---|---|

Flag | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 |

Table 4 — Table after Filling in the Gaps

In practice, modifications of this method which involve filling in the gaps with non-zero values are also used. It should be noted that for this type of filling (for example, the average), you can use the inverse values of the flag field (i.e. 0 — for the case when the values in the source data were initially missing, and a non-zero value for cases when the value of the source data field was known).

Also, when filling in gaps with non-zero values, the flag field and the source field are often interconnected.

The advantages of this method include the following:

Despite these advantages, even with the MCAR hypothesis and a small number of missing values, this method can lead to significant distortion of the results [Vach, 1991, Knol, 2010].

## Restoration of the Gaps on the Basis of Regression Models

This method consists in restoration of the missing values using the linear regression model based on the known values of the data set.

Figure 3 shows an example of the results of restoration of the missing values of attribute 1 based on the known values of attribute 2.

The linear regression method allows you to get reasonably filled data. However, the real data are characterized with a certain range of values, which is missing when you fill in the blanks based on linear regression. As a result, the variance in the attribute values becomes smaller and the correlation between attribute 2 and attribute 1 is artificially enhanced. As a result, this method of filling in gaps becomes worse the higher the variance in the values of the attributes, in which we fill in the gaps, gets, and the higher the percentage of the missing rows gets.

It is worth noting that there is a method that solves this problem: the stochastic linear regression method illustrated in Figure 4 (similar to Figure 3).

The stochastic linear regression model reflects not only the linear relationship between the attributes, but also the deviations from this linear dependence. This method has advantages of filling in gaps based on the linear regression and, in addition, does not distort the values of correlation coefficients so much.

Of all the methods that we have discussed in this part of the article, filling in the gaps using stochastic linear regression generally leads to the least distortion in the statistical properties of the sample. Moreover, when there is clear linear dependence between the attributes, the stochastic linear regression method is often superior to even more complex methods.

## Conclusion

In this article, we have considered simple methods for filling in the gaps. Although the use of these methods can lead to significant distortion of the statistical properties of the data set (mean, median, variance, correlation...) even in the case of MCAR, they remain frequently used not only among ordinary users, but also in the scientific community (at least in areas related to medicine).

Thus, according to [Burton, 2004], in 82% of 100 papers on cancer that were published in 2002 the authors indicated that they were faced with the need to fill in "missing data" gaps in the data. In 32 cases, the missing data handling method was explicitly specified. In 12 of these 32 papers the Complete Case Analysis was used, in 12 — the Available Case Analysis, in 4 — the Indicator Method, in 3— Ad-Hoc methods, and only 1 case involved a more complex method.

A decade later, the situation had not changed much for the better. [Karahalios, 2012] It is reported that among the scientific papers reviewed, the Complete Case Analysis was used in 54% of cases (in 21 articles), LOCF was used in 7 cases, average value filling was used in 3 cases, and the Indicator Method was used in 1 case.

Even as of 2014, the recommended methods for filling in the gaps (Multiple Imputation, maximum likelihood function methods) are still rarely used in scientific medical articles [Rezvan, 2015].

As a conclusion, I would like to note that the use of simple methods, such as deletion of rows or ad-hoc methods, does not always lead to deteriorated results. Moreover, when applicable, the simple methods are preferable.