Data leakage in machine learning

In machine learning, data leakage refers to a situation where one or more input features used during the model's learning process become unavailable when the model is applied in practice. Data leakage results in lower accuracy of the model compared to the estimation during tests. Detecting this problem can be quite challenging, this is why prevention of data leakage becomes an urgent task in the field of machine learning.

Data leakage is recognized as one of the ten key challenges in machine learning. Specifically, it occurs when the information used to construct ML models is not accessible during their practical application. Despite the significant impact that data leakage can have on the work of analysts and businessmen, it is often not given sufficient attention in research.

The problem is exacerbated by the difficulty in formalizing data leakage in practice. This lack of formalization hinders the development of strictly justified and general approaches to combat this issue. In most cases, methods for detecting and mitigating leaks rely on knowledge of the data, consideration of each specific case, and common sense.

The consequence of data leakage is an overestimation of the ML model's quality during the training stage. This overestimation is relative to the performance the model will demonstrate when applied to real data in practical use. Consequently, a model that is deemed to be of high quality based on training results may perform poorly or even fail completely in practice.

Data leakage should be viewed as a complex and multifaceted phenomenon, encompassing several types:

  1. Leakage of features/signals.
  2. Targeted leaks.
  3. Leakage of the training/test set.

Leakage of Signs

This is the primary type of data leakage, which occurs when the features present in the training dataset are not available during the model's operation. This is particularly influenced by data generated in the external environment, which is beyond the organization's control. However, if such data is disregarded, potentially valuable information from the external environment remains unused during the model's training, resulting in a reduction in its quality.

The life cycle of any ML model typically involves at least two stages: training and prediction. During the training process, the model's parameters are adjusted using a specific algorithm and a training dataset. At the prediction stage, the model generates or predicts values for the target variable for new observations that were not included in the training process.

For instance, let's consider a scenario where a model is trained on a dataset that includes the attribute "Client Age." Clearly, this attribute holds importance in predicting borrower creditworthiness in credit scoring or determining customer loyalty in marketing. However, it is possible that during the prediction phase, data on the age of new customers may be unavailable. Consequently, the information used during the training process becomes unusable in the practical operation of the model. This situation is referred to as a data leak.

The leakage of signs can be illustrated using the following image.

Figure 1. The leakage of signs

Let the training dataset contain five input variables x_1,x_2,...,x_5 and with the single target, Y. After the end of the training stage, the model must "learn" to implement the function Y=f(x_1,x_2,x_3,x_4,x_5) with some acceptable accuracy.

Let's assume that when the model was put into operation, there was a data leak — namely, the x_5 attribute. As a result, instead of the function f, the model will implement some other function f′, and it is not known in advance which one. But it will definitely differ from f and form the result Y′. Y′ will be different from the result that would be obtained if all the values of the features were available.

The expected result of the leak will be a decrease in the accuracy of the model's prediction on practical data relative to the accuracy of the prediction during the training and test (where the "leaked" variable is present). The reasons for this phenomenon are intuitive: less information is used in prediction than in training.

Targeted Leaks

This type of leakage is associated with the unintentional or erroneous use of information about the target variable, which will not be available at the end of the learning process.

A trivial example of this occurs when the target variable is mistakenly included in the training set as an input feature. In such cases, the function that the model needs to "learn" is explicitly defined within the training sample but cannot be accessed beyond it. This is similar to drawing conclusions such as "it rains on rainy days." This function is often referred to as a "give-away function," signifying the inadvertent disclosure of a secret.

As you may know, in supervised learning scenarios, the training dataset comprises two vectors. The input vector contains a set of values representing the input features (independent variables), while the target vector contains the corresponding values of the dependent variables that the model should produce when trained. The target values are then compared to the output generated by the model. The resulting output error is used to adjust the model's parameters using a given learning algorithm.

Let's denote X as the input vector and Y as the output one. Then, the training example can be represented as a tuple:

⟨X(x_1,x_2,...,x_n),Y(y_1,y_2,...,y_m)⟩,

where x_i and y_j are specific values of input and target features for this example. n and m are the numbers of input and output features, respectively.

For simplicity, let's assume that the model has a single output variable (m=1). Then, if there is a leak of the target attribute in the input set, the training example will take this form:

⟨X(x_1,x_2,...,x_n,y),Y(y)⟩.

Let's explain this with the help of an image.

Figure 2. Targeted Leaks

In the process of training the model, y should only be used to calculate the output error E=y−y', where y' is the actual output of the model for this example. However, as shown in the image, the target variable also falls into the number of input variables, which is its leak.

For instance, let's consider the construction of a model aimed at predicting the probability of loan delinquency. In this case, the input features would consist of borrower characteristics such as age, income, property information, and so on. The target feature would be the occurrence of a delay, which is known for previously observed borrowers.

However, if we include the delay amount as one of the input features, the model will appear highly accurate. This is because one input variable will carry all the necessary information about the output: if the delay amount is greater than zero, it indicates the presence of a delay. Nevertheless, the delayed amounts are only available for borrowers observed in the past and not for new borrowers.

Consequently, we encounter a situation where the information about the target variable, which is accessible during model training, cannot be utilized in real-world business processes. As a result, the model's performance will be severely compromised. A key indication of a target leak is often an unrealistically high model quality based on the training results.

Leak of the Training/Test Set

In the English terminology, this type of leakage is referred to as train-test contamination (TTC). It is associated with data preprocessing techniques, such as normalization, scaling, smoothing, quantization, and outlier suppression, which result in changes to the data values.

On one hand, preprocessing enhances the efficiency of the learning process and improves the model's accuracy, but only when applied to the training data. Consequently, the model's "fitting" is tailored to the modified data rather than the actual data likely to be encountered in practice. As a result, the model's performance on the training data tends to be overestimated.

To identify this form of leakage, a straightforward approach is to initially construct a model using preprocessed data and subsequently assess its performance on portions of the data that have not undergone preprocessing. One convenient method is to create a validation set (held-out selection) in addition to the preprocessed test set. The examples in the validation set are used as they are, without any preprocessing. Evaluating the trained model using the validation set enables an estimation of its performance on practical, real-world data rather than just training data.

Figure 3. The TTC leakage

It should be noted that it is impossible to use a test set for these purposes. Its examples are submitted to the input of the model together with the training ones and must also be preprocessed.

If the model is trained using cross-validation, i.e. testing and training blocks are swapped, then, in order to avoid a TTC-type leak, it is necessary to perform preprocessing at each iteration.

Finally, if the test data "leaks" into the training data, i.e. is used both for adjusting the model's parameters and for testing, the assessment of the model's quality will be overestimated.

Figure 4. The Leak of the Training

If the quality control of the model is based on a simple separation of the original data set into training and test sets, it is important to ensure that the test data does not fall into the training data.

Methods to Prevent Data Leakage

In practical scenarios, detecting and assessing the consequences of data leakage can be quite challenging. This difficulty is further compounded by the fact that analytical models are typically developed and operated by individuals working in different companies or divisions. For example, an IT specialist may train the model, assess its quality, and deploy it in the marketing department. However, a marketing specialist, unaware of any potential data leakage, may wonder why the model, which was supposed to be highly effective based on assurances from the IT department, performs poorly when applied to real data.

To identify the root cause, it is essential to establish effective communication and collaboration among specialists involved in both the development and operation of the model.

The most evident approach to prevent data leakage is through organizational measures. This involves structuring data management during the data mining process to minimize the risk of leakage. However, there are no universally applicable approaches in this regard. Instead, reliance should be placed on experience, a deep understanding of the data, and the specific domain in which the model is utilized. Nevertheless, several general recommendations can be offered:

  1. When training the model, it is advisable to refrain from using variables that may potentially be unavailable during its use. The drawbacks of this approach are apparent: we will initially obtain a subpar model because we deliberately overlook some of the information used in the learning process.
  2. Utilize all available variables, including those prone to leakage, to construct a superior model. If a leak does occur, take measures to organize the collection of "leaked" data. However, this approach has its downsides. The costs of gathering data may outweigh the benefits, and there is a possibility that such data may be non-existent or unavailable. An example illustrating this is when the target variable is used as an input during supervised learning. Clearly, the values of the target variable in prediction mode are inherently unknown. In the medical field, consider a situation where the model utilizes data from a sample of previously observed patients, and temperature is one of the variables. New patients may simply be unavailable for temperature measurement.
  3. Restore the "leaked" data based on the information available in the training sample. For instance, during prediction, replace a missing value in a new observation with an artificially generated value from the distribution of previously known observations. However, a risk exists that the data distribution may change over time, such as income levels fluctuating. Consequently, rather than simply omitting the value, an implausible value may be generated.
  4. Assign specific labels to indicators and examples that suggest the possibility of leakage. Subsequently, "leaky" indicators can be excluded from consideration or treated with caution during the learning process.
  5. Acknowledge data leakage as an unavoidable challenge and accept it if the consequences do not have significant impact.

Data leakage can result in substantial financial losses for businesses that rely on data analysis using ML models. Therefore, it is crucial to pay close attention to monitoring signs that indicate the occurrence of data leaks. Timely detection of the problem will help mitigate its consequences.
 

See also

Release notes 7.2.3
Fixed: Memory leaks in Calculator, bugs related to connections, database exports, and multiple text file imports. Improved: Operation of Neural Net, Supernode, Loop, and other components.
Improving Employee Skills in Data Science
The world is awash in data, yet we struggle to fully capitalize on its potential due to a severe shortage of skilled professionals. A significant mismatch persists between the demand for data scientists...
Working with Databases in Megaladata
Databases are one of the most popular sources of information in analytical projects. Megaladata supports work with various DBMS. This article covers all stages of work with them: connection, import, and...

About Megaladata

Megaladata is a low code platform for advanced analytics

A solution for a wide range of business problems that require processing large volumes of data, implementing complex logic, and applying machine learning methods.
GET STARTED!
It's free