How to clean data before uploading it to a storage

When you create a data warehouse is not enough attention is paid to cleaning the incoming information into it. Apparently, it is believed that the larger the storage, the better. This is a surefire way to turn your data warehouse into a garbage dump.

Data must be cleaned! After all, information is heterogeneous and is almost always collected from a variety of sources. It is the presence of different data collection points that makes the cleaning process so complex and relevant.

Mistakes will always take place and it is impossible to completely get rid of them. Perhaps, it sometimes makes sense to put up with them rather than to spend time fixing them. But in general, we should aim to reduce the number of errors to an acceptable level by any means. The methods used for analysis are already fraught with inaccuracies, so why aggravate the situation?

In addition, we need to take into account the psychological aspect of the problem. If the analyst or decision-maker is not sure about the numbers they get from the data storage, they will rely on information they get from other sources. This severely reduces the value of the storage.

Types of Errors

We will not consider simple errors, such as mismatches of types or differences in input formats and encodings. These are cases when information comes from several sources and different conventions are adopted to denote the same fact.

A typical example of such an error is the designation of a person's gender. Somewhere it is denoted as M/W, somewhere as 1/0 and somewhere as True/False. To prevent such kind of errors, we can set rules for transcoding and typecasting. So these kinds of errors are relatively easy to resolve. But we are interested in more complex problems that cannot be solved in elementary ways.

The types of complex errors are rather numerous. In addition to universal errors, some errors are specific only to a particular subject area or task. But let us look at those that do not depend on the task.

  1. Inconsistency of information
  2. Omissions in the data
  3. Abnormal values
  4. Noise
  5. Data entry errors

There are proven methods for solving each of these problems. Of course, errors can be corrected manually. But with large amounts of data, this becomes time-consuming and problematic. Therefore, we will consider options for solving these problems automatically with minimal human involvement.

Inconsistency of Information

First, we need to decide what exactly to consider a contradiction. Oddly enough, this is a non-trivial task. For example, in Russia, a pension card needs to be changed if someone changes their last name, first name, patronymic or gender. It turns out that there is no contradiction in the fact that a person was born as a woman and retired as a man!

After we decide what to consider a contradiction and detect such records, there are several options for action.

  • If several conflicting records are found, we can either delete them all or leave one of the options that was selected through a simple rule. For example, this can be the most recent entry. This method is relatively basic and therefore easy to implement. Sometimes that is enough.
  • Correct conflicting data based on statistics. For example, we can calculate the probability of the occurrence of each of the conflicting values and select the most likely one. Most often, this method gives more accurate results.

Omissions in the Data

This problem is a scourge for many data storages. Most forecasting methods assume that the data comes in a constant, uniform stream. In practice, this happens rarely. Therefore, one of the most popular applications of data storages, forecasting, is implemented poorly or with significant limitations. To combat this phenomenon, we can use the following methods.

  • Approximation. If there is no data for a certain point, we take its neighborhood. We calculate the value at this point using known formulas and add the corresponding record to the storage. This works well for rank-order data — for example, information about daily sales of products.
  • Determining the most plausible value. To do this, we take not the neighborhood of the point but all the data. This method is used for non-patterned information. We apply it to cases when we are not able to determine what is the neighborhood of the point under study.

Abnormal Values

Quite often, we come across data or events that are significantly out of character. For example, the price of a product is 10 times higher than the average. These values need to be adjusted. The fact is that the analysis algorithms do not know anything about the nature of the processes. Therefore, any anomaly will be perceived as a completely normal value. Because of this, the model will be greatly distorted. A random failure or success will be considered a pattern.

There is a way to deal with this problem — it is called robust estimates. These are methods that are resistant to strong perturbations. A median filter can serve as an example.

We evaluate the available data and apply one of the following actions to everything that goes beyond the acceptable limits.

  • Abnormal values are removed.
  • The anomalous data is replaced with the nearest boundary values.

Noise

When carrying out analysis, we almost always come across noise. Most often, the noise does not carry any useful information and only prevents us from seeing the picture clearly. There are several methods to combat this phenomenon.

  • Spectral analysis. We can cut off the high-frequency components of the data — that is, noise, frequent and insignificant fluctuations near the main signal. Moreover, by changing the width of the spectrum, we can choose what kind of noise we want to remove.
  • Autoregressive methods. This fairly common method is actively used in the analysis of time series. It boils down to finding a function that describes the process as signal plus noise. Actually, we can remove the noise after that and leave the main signal.

Data Entry Errors

This is a topic for a separate conversation. The types of such errors are too numerous — for example, typos, deliberate distortion of data or inconsistency of formats. Plus, there are errors related to the features of the application for data entry.

There are proven methods to deal with most of them. Some things are obvious. For example, we can check the formats before adding data to the storage. Others are more sophisticated. For instance, we can correct typos based on various types of thesaurus. But in any case, we need to clean up these types of errors.

Conclusion

Dirty data is a very big problem. In fact, it can negate all our efforts to fill the data storage. At the same time, we should keep in mind that we are not talking about a one-time clean but constant work in this direction. As they say, "to keep the place clean, you don't need to tidy it up but to stop making a mess there". The ideal option is to implement special procedures in the ETL process that will ensure that the data is cleaned before being uploaded to the storage.

If the data changed during the cleaning process, it must be labeled to take this aspect into account in the subsequent analysis. Otherwise, there is a risk of relying on it as on real information, which can lead to incorrect conclusions.

The solutions described above are not the only ones. There are quite a few other processing methods that can help improve the quality of data, ranging from expert systems to neural networks. At the same time, we should take into account that the cleaning methods can be strongly tied to the subject area. For example, what is noise for some is highly valuable information for others.

In view of the above, the ETL process should be considered more broadly, and not only as procedures for importing data from receivers, transforming it minimally and loading it into storage. Data cleaning should be a mandatory stage of work. The value of the storage is determined not only and not so much by the volume of data — but by the quality of the collected information.

See also

Enterprise Information Ecosystem
In today's digitally driven business landscape, data has become an indispensable asset. The ability to effectively harness and utilize information is a key determinant of a company's success. To remain...
Sampling Methods and Algorithms in Data Analysis: Nonprobability Sampling
In our previous article, we discussed probability sampling, a method where samples are selected randomly, giving each population member an equal chance of inclusion. This article will explore...
Sampling Methods and Algorithms in Data Analysis: Probability Sampling
Sampling is a fundamental process in data analysis. It involves selecting a subset of individuals from a larger population to study. By analyzing this sample, researchers can draw reliable conclusions about...

About Megaladata

Megaladata is a low code platform for advanced analytics

A solution for a wide range of business problems that require processing large volumes of data, implementing complex logic, and applying machine learning methods.
GET STARTED!
It's free