Data quality criteria
Sooner or later, most companies will come up with the idea of making decisions based on data. Such an approach is called a 'data-driven approach'. It is an alternative to the idea of HiPPO (highest paid person's opinion). It enables you to reduce the number of errors as well as improve the objectivity of decisions and the quality of management.
The need to collect and store data for decision-making is an axiom for most organizations. Meanwhile, much less attention is paid to ensuring the data's quality, which is quite strange. After all, bad data will lead to bad decisions. It happens in full accordance with the principle of GIGO (garbage in, garbage out) — which means that garbage from the input leads to garbage at the output.
Large companies that depend on data more than others have paid particular attention to the issue of assessing their losses due to poor data quality. According to a Gartner survey conducted in 2018, they believe that poor data quality causes losses of $15 million per year on average. They also found that the later the bad data is detected, the more expensive it is to correct such errors.
According to a report by Forrester Research, storing low-quality data in corporate systems deprives businesses of productivity. In this case, it is necessary to keep constantly checking the data to ensure its accuracy.
Data quality is a generalized concept that reflects "the degree of suitability of the information for solving the corresponding problem" (according to GOST R ISO 8000-2-2019). The evaluation of data as "high-quality" or "substandard", as well as its various grades (bad, good or excellent), are subjective. You should consider them in the immediate context of the task that is being solved.
There are a number of criteria used to assess the correctness, completeness, accuracy and reliability of data. The most frequently used ones are listed below.
Data Quality Criteria
Depending on the situation, the criteria can be divided into 3 groups. The first is the requirements for the content of the data. The second is the issue of consistency of information. The third is the convenience of work.
The problems with the first group are the most critical. Ignoring them can lead to a complete inability to analyze the given data. The lack of consistency reduces the credibility of the decisions made. This results in the inconvenience of use increasing costs.
Data Content
Accuracy is the correspondence of data to reality and the correctness of its interpretation. For example, the correctness of data on the number of manufactured products depends on how this data was received, accounted for and entered into accounting systems. Evidently, incorrect data cannot be used to make decisions.
Completeness is the sufficiency of the volume, depth and breadth of data sets. Incompleteness leads either to the impossibility of analysis or from needing to start from certain suppositions and assumptions regarding the missing information. Completeness may concern omissions in the attributes of the analyzed objects — for example, incomplete information in the product directory. Plus, it can be related to the absence of a part of the data under study — for instance, the information about a certain period in time.
Relevance is an indicator of how well the data corresponds to the goals and tasks being solved. For example, the information on sales of paper books may be irrelevant to the electronic market. The data on customer preferences in one country may be completely different from the data about users who live in another country.
Objectivity is the confidence that the data lacks biased opinions or subjective assessments. There are undoubtedly many problems with objectivity when data from surveys or feedback from customers is analyzed. The human factor is responsible for the blame. Keep in mind, the assessment of the quality of the same service or product by one user can radically differ from the opinion of another person.
Validity is compliance with numerous attributes associated with a data element — such as type, accuracy, format, ranges of acceptable values and so on. For instance, in the email field, the string must comply with the standard process for writing email addresses.
Precision is the detail of measuring and recording data. Depending on the specifics of the process and the purposes of the analysis, you might need to record the indicators with accuracy to the day, hour, minute or second. Alternatively, you might need to measure the weight of the goods with accuracy to the nearest ton or gram.
Timeliness is the time after data collection, after which the data becomes available for analysis. It should correspond to the speed of the analyzed process. Adequate but outdated data is useless for making operational decisions.
Consistency
Uniqueness implies that no object exists in the dataset more than once. The presence of duplicates can lead to inconsistencies and contradictions due to the lack of a single version of the truth.
Integrity is the presence of correct links between the data and their compliance with the established rules and restrictions. Referential integrity assumes that all references from data in one column of the table to data in another column of the same or another table are valid. As a result, there will be no situation in which an entry in the sales table refers to a buyer who is absent from the customer directory.
Consistency is the correspondence of different pieces of data to each other and their logical non-contradiction. For example, it can be the correspondence of a person's gender to their name and from their date of birth to their age. If the data is not consistent, this may indicate errors or inaccuracies in collecting or processing it.
Coherence is consistency with other data sources and the logic of the process that they describe. For example, data on production costs should not contradict data on the number of products manufactured within the same period of time.
Reliability is the ability to repeatedly obtain the same results. If the measurement results differ depending on the conditions, the confidence in the decisions made based on these results decreases.
Convenience
Accessibility shows how easy it is for the user to find out which data is available to them as well as to access it. Moreover, we can also talk about metadata that describes the analyzed information. For instance, it is quite difficult to quickly access and analyze data that is only available in printed form.
Usability characterizes how easy and simple it is to use data to study a particular problem. For example, information may be available, including in electronic form — but its use for analysis may require complex preprocessing. This is especially true for unstructured data, such as images, audio and video.
Universality determines to which extent data can be used for different purposes and tasks. For instance, sales information is universal because it can come in handy for researching various issues — such as finance, logistics, marketing or production planning.
Traceability is the ability to control the quality and origin of data by revealing its sources, history of creation, modification, transformation, deletion, storage and transmission.
Portability is the ability to transfer data between different platforms or services without losing its integrity or facing other obstacles. The complexity of integrating, importing or exporting data significantly reduces its value.
Quality Assurance
Data quality management is not a one-off action but an ongoing process. It includes the stages of observation, analysis and improvement of information. The goal is proactive data quality control and is not the elimination of flaws once they are eventually identified.
There are many methods and approaches for achieving this goal:
- Cleansing means deleting incorrect, incomplete, duplicate and irrelevant records.
- Standardization is the reduction of data to a uniform format and standard. It helps to improve the consistency, accuracy and completeness of data.
- Validation is the process of checking data for compliance with certain rules and requirements — such as formats, acceptable values, presence or absence of special characters, etc.
- Setting metrics means defining indicators that will be used to assess the quality of data — such as accuracy, completeness, relevance, consistency, etc.
- Data monitoring is an ongoing tracking process that enables you to identify problems in the data and solve them in real-time.
Ultimately, it is necessary to implement a data management policy. This may include the identification of responsible persons, procedures and deadlines for updating data. Of course, one of the important aspects of the struggle for data quality is staff training. Employees should be trained to learn the rules of filling in and processing information as well as using the tools for checking and cleansing data.
There are many tools that allow you to automate the processes of checking, cleansing and updating data. Their use enables you to reduce the time and the amount of resources that you need to improve the quality of information as well as implement the data-driven approach to decision-making.