Detection and correction of one-dimensional outliers in data

Many analysts face the need to process outliers when it comes to data. Thankfully, Megaladata has a built-in component for quickly determining extreme values with the ability to automatically delete or replace them, too.

Detecting and eliminating outliers is an important part of data management. Extreme values affect the result of the analysis, so it is important to know and use specific ways to reduce their negative impact.

An outlier is a value in the data which is far beyond the limits of other observations.

This widely used definition describes outliers and extreme values. Despite that, you will see in this article that it is actually divided into two different parts:

  • Extreme values are avoidable or unavoidable errors and possible dummy values.
  • Outliers are actual events caused by exceptional conditions. They should be interpreted as a subject in which to focus the expert's attention in regards to the disclosure of the specific features which are being analyzed throughout the data set and/or the validity of the application or evaluation of the data distribution in normal situations.

There are many different causes for outliers:

  1. Measurement errors or tool/ instrument errors. This is the most common cause of outliers. Most often the problem is related to the equipment being used— for example, incorrect operation of sensors. But, the reason may also be due to an incorrectly configured data collection process — for instance, website performance metrics. In cases of large amounts of data or the long-term existence of errors, their search and correction may in fact utilize a number of different types of resources.
  2. Data entry errors. This problem is caused by human error during data collection, recording or input. For example, the operator types in an additional zero into the numeric value of the parameter, skips it or puts a decimal separator in the wrong place. Thus, the values may become 10-100 times larger or smaller. Such errors are often associated with the lack of rules that regulate a given data entry.
  3. Data processing errors. When performing automated analysis, data is extracted from several different resources. There is a possibility that outliers will appear in the data set during the process of incorrect transformation of data or inconsistencies in units of measurement (such as tons-grams or inches-meters).
  4. Sampling errors. This problem is the result of incorrectly combining given data of different natures or making changes to the data collection methodology. For example, when collecting information about real estate objects, information about apartments and industrial facilities may be combined in the sample. Naturally, the cost of a factory is much higher than the cost of a standard apartment.
  5. Experimental errors. They occur during experiments or measurements that may depend on changes in previously unaccounted environmental conditions: vibrations, radiation, pollution, pressure, etc. Without paying attention to such factors, even reliable measuring tools/ instruments have the ability to record incorrect values or outliers.
  6. Natural outliers. These deviations are not errors, although they stand out against the background of the rest of the data. An example can be sudden spikes in website traffic caused by advertising, which look like an outlier compared to normal traffic. In this case, there is a reason for such a phenomenon causing changes — even though there is no information about the advertising campaign in the analyzed sample.

The Need to Identify Outliers

It is necessary to search for and eliminate extreme values for several reasons.

Firstly, most analytical algorithms- when working with sets in which there is a noticeable number of extreme values- try to find a solution that perfectly describes the entire data set. However, as a rule, the result is a model that poorly describes both extreme outliers and other averages. Therefore, when constructing predictive models, simply removing extreme values can actually improve the accuracy of the solution.

Secondly, many graphical and statistical characteristics of data sets are sensitive to the presence of extreme values. For example, several extreme outliers in a given sales history can seriously change the average amount and the graph based on this data will fail to reflect the actual state of affairs.

Thirdly, in some cases, the removal of outliers contributes to obtaining objects of study with a normal distribution. This expands the range of tools for subsequent analysis.

Unfortunately, there is no universal method or algorithm for finding extreme values. At least because there are multiple criteria and approaches for identifying outliers. The decision on what is more correct to apply remains with the analyst.

The simplest tool for detecting outliers is visualization. It enables you to immediately identify deviations that are invisible on large data sets. This is one of the most important stages of exploratory data analysis.

One of the popular ways to visualize one-dimensional sets is a histogram. Figure 1 shows an example of a histogram.

Figure 1. Histogram with a superimposed graph of parameter density

Standard Deviation Method

Figure 1 shows that the distribution of data in this set is close to normal. Let's use the following formula to determine the upper (Lim_{max}) and lower (Lim_{min}) limits of the outliers:

Lim_{max} = \bar{x} + N_{s}*S

Lim_{min} = \bar{x} - N_{s}*S

\bar{x} is the average value

S is the standard deviation

N_{s}=3 is the specified number of standard deviations

The number 3 is due to the fact that 99.72% of the values of the normally distributed data lie within three standard deviations. In cases where the data is highly distributed, the N_{s} parameter can be varied upwards.

Interquartile Distance Method

Figure 2 shows another version of the histogram:

Figure 2. A histogram with a superimposed graph of the parameter density

Figure 2 shows that the data distribution is different from the actual distribution and there is a bias on the right side. The previous method of setting outlier boundaries fails to apply to them. In this case, a different approach should be used — the method of interquartile distances. It operates with the following criteria for finding maximum and minimum (extreme) values:

Lim_{max} = Q_{3} + N_{i}*IQR

Lim_{min} = Q_{1} - N_{i}*IQR

Q3 is the third quartile

Q1 is the first quartile

IQR is the interquartile distance (or interquartile range), determined by the formula IQR=Q_{3}-Q_{1}

N_{i}=1.5 is a given number of the interquartile span. In some cases, when the data is highly distributed, this parameter can be varied upwards.

A graphical explanation of the representation of quartiles is a box plot or a "box with whiskers" plot. It is one of the most well-known and informative ways to visualize the distribution of one-dimensional data.

It allows you to compactly show many parameters at once (from left to right): minimum, lower quartile (25% percentile), median (50% percentile), upper quartile (75% percentile) and maximum. Depending on the nature of the available data, outliers may be absent or present in the diagram (which is usually indicated by dots). These can be specifically only: minimum extreme values, maximum values or both.

The minimum and maximum values in this case are indicated by vertical lines that connect neighboring quartiles with "whiskers" (ie: dashed lines).

Figure 3. Example of a span diagram

For both methods, all values less than the lower or higher than the upper limits will be considered outliers. This is true both when the distribution of one-dimensional data is normal and when it differs from the normal.

Elimination of Outliers

The decision on how to eliminate outliers depends on the characteristics of the data set as well as the purpose of a particular project. The choice of follow-up actions with respect to extreme values is in many ways similar to the processing of missing data described in the article "Processing Data Gaps".

In most applicable problems, analysts' attention is focused on the analysis of average behavior, a dependable averaged trend with a high degree of reliability. This makes sense because forming conclusions based on the prediction of extremely large or small values is unreliable and often useless (since it is not permanent).

  1. Deleting values. Extreme values can be deleted if it is known for certain that they contain incorrect data or if the reason for the outlier may occur in the future with a very low probability. Removing outliers can often improve a dataset's compliance with the normality requirement. Indeed, many analysts usually remove extreme values because they want to exclude their influence on the calculation of averages.
  2. Changing values. If the cause of the outliers is known, incorrect values can be changed sometimes. For example, in the case of errors caused by defects or breakdowns of the measuring instrument, replacement or repair of the device allows repeated measurements and replacing erroneous data with up-to-date data.
  3. Replacing values. The most widely used options for replacing outliers are the:
  • median
  • average value
  • boundary value chosen by an expert
  • average value from the most likely interval

The average value from the most likely interval for continuous attributes is calculated as the Mean- the most likely interval indicator. For discrete attributes, it is calculated as the Mode (mode of distribution).

There are cases when the rejection of extreme values should be used with caution.

If there is confidence in the correctness of the sampling, there is no reason to discard reliable data just because it fails to meet some preconceived expectations about how the data "should" behave. It is quite possible that if the available data is supplemented with additional attributes, the extreme values will become meaningful and will no longer be considered outliers.

Sometimes, "real outliers" (in the terms of this article) can be interesting "by themselves". They can indicate structural changes in the data, a discrepancy with expectations, a change in variance or the presence of clusters. Alternatively, they can just be a big noise that needs to be smoothed out. Understanding the reasons why a particular observation differs from the others brings the analyst closer to solving the problem or optimizing the analysis process.

All the methods and approaches for managing extreme values and outliers in data sets that were described above are implemented in the Megaladata analytical platform in the Outlier Editing component. It combines the possibilities of working with ordered and unordered data as well as data that has a normal distribution or which fails to meet this requirement.

In this component, the separation of values that deviate greatly from the main mass into outliers and extreme values occurs by setting two threshold levels:

  • For the standard deviation method, all values exceeding 3 sigma are considered outliers and those exceeding 5 sigma are considered extreme values.
  • For the interquartile distance method, all values exceeding 1.5 of the IQR interquartile distance are considered outliers and the extreme values are those that exceed the value of 3∗IQR.

At the same time, you can use either the initial default presets of the component or change the specified values of the boundaries.

Conclusion

To avoid creating the illusion of simplicity in solving the described problem, it is worth paying attention to some of the circumstances that were not mentioned above in this article.

It is possible that an analyst or an expert in the field under study has confidence that the data obtained or part of it "must" satisfy the normality condition. In fact, the data might fail to meet this requirement. In this case, there are methods to bring the data to a normalized form for further processing and analysis by using statistical methods.

Such actions come in handy when the detected outliers suggest a strong distortion of the studied data set. By converting variables, you can eliminate outliers — for example, by taking the natural logarithm of the value. This reduces the variation caused by extreme values. This can be done for datasets that lack negative values. Sometimes, it is useful to apply data normalization, which enables you to bring the data to a single scale.

It is important to remember that statistical tests alone cannot provide a reliable answer to the question of whether the detected outliers should be discarded or corrected. Such a decision should be made based on knowledge of the subject area and the specifics of the data collection process.

See also

Release notes 7.2.3
Fixed: Memory leaks in Calculator, bugs related to connections, database exports, and multiple text file imports. Improved: Operation of Neural Net, Supernode, Loop, and other components.
Improving Employee Skills in Data Science
The world is awash in data, yet we struggle to fully capitalize on its potential due to a severe shortage of skilled professionals. A significant mismatch persists between the demand for data scientists...
Working with Databases in Megaladata
Databases are one of the most popular sources of information in analytical projects. Megaladata supports work with various DBMS. This article covers all stages of work with them: connection, import, and...

About Megaladata

Megaladata is a low code platform for advanced analytics

A solution for a wide range of business problems that require processing large volumes of data, implementing complex logic, and applying machine learning methods.
GET STARTED!
It's free