From Data to Knowledge: There and Back Again

Among the new capabilities of the Megaladata platform, we would like to highlight the two strategies of workflow design: "upwards", from data to models, and "downwards", from models to data. This combination enables a more flexible modeling process, even allowing for the creation of analytical workflows without requiring data uploads.

Among the new capabilities of the Megaladata platform, we would like to highlight the two strategies of workflow design: "upwards", from data to models, and "downwards", from models to data. This combination enables a more flexible modeling process, even allowing for the creation of analytical workflows without requiring data uploads.

During the 1989 International Joint Conference on Artificial Intelligence (IJCAI-89) in Detroit, Michigan, Gregory Piatetsky-Shapiro introduced the term _"knowledge discovery in databases"_ at the inaugural workshop on the topic (KDD-1989). Today, _"data mining"_ is commonly used as a synonym for KDD. Both terms refer to the “non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data”.

Thus, the Data Mining paradigm initially saw the basis for building knowledge discovery models not in problem-specific conditions, but in problem-related data. In other words, first the analytical algorithms discover hidden dependencies, regularities, and patterns. Then, subject domain experts and practicioners interpret them to generate knowledge in form of rules, observations, and conclusions which help solve analysis tasks inform decision-making.

This paradigm — moving from data to knowledge — was directing the development of business analytics software market for as long as three decades.

Modeling "Upwards"

In classic data mining systems, building data analysis models was based on the procedure similar to the "upwards" modeling common in software development and manufacturing. The process started with more specific levels of description (of program functions or manufactured goods) and gradually went to the more abstract ones (e.g., from individual details to describing their assembly).

The goal stated was not "What are the desirable results?", but rather, "What can we accomplish with the data we have?"

Naturally, with such a problem statement, any analytical workflow starts with uploading source data. The next step is the first level of abstraction: the data is preprocessed, i.e., cleaned, grouped, aggregated, etc. At this level, the analyst operates not the data itself but metadata (field names, attributes, and processing parameters).

Alternative text

The next level of abstraction is applying various handlers to the results obtained at the previous level. The processing will result in a new data, even more abstracted from the initial one.

This process continues until the analyst decides that the goal was reached and the problem was solved. At the top levels of such a model the analyst works mostly with the conditions of the problem, almost entirely abstracted from the source data.

The "upwards" modeling has its advantages and disadvantages. Its advantages are:

  • A more transparent and comprehensible structure and business logic of workflows.
  • Simpler and faster model building.
  • Easier error search.

On the other hand, there are also such disadvantages as:

  • Workflows oriented on a specific task only.
  • Difficulties in reusing models for similar tasks.
  • If a data structure changes, the whole workflow has to be changed.

Going "Downwards"

As intellectual data analysis techniques became more prevalent and widespread in various fields of human activity, certain industries began to develop their own specific methods and approaches for decision support, grounded in these new technologies.

Consequently, a surge of innovative concepts for data mining applications began to materialize, often preceding the availability of concrete data to construct suitable models. In many cases, theoretical advancements and strategic visions for enhancing business efficiency outpaced the acquisition of real-world datasets necessary for empirical testing and implementation.

Such a situation is not an obstacle to creating data analysis models. It just requires a new paradigm, different from the one declared three decades before by the pioneers of data mining. We can move not from data to knowledge, but vice versa — from knowledge to data.

This is what constitutes the "downwards" modeling approach. The analyst begins with building the workflow levels which are most abstracted from any concrete data. Then, the development moves "down" to a more specific level of operation. In the end of this development trajectory, the analyst gains understanding of what data is required and how to obtain it.

Alternative text

The benefits of the "downwards" approach are as follows:

  • The analysis model does what "needs to be done" and not what "the data allows".
  • Organizing the workflow development process is easier.
  • The specification is easier to formulate.
  • There are more reusability options.

However, "downwards" model building has its drawbacks:

  • The model developer needs to have some specific knowledge and ideas in the subject area prior to building the workflow.
  • Before there is real data, it is hard to model their cleaning and preprocessing correctly.

Building Workflows in Megaladata

Megaladata allows users to follow any of the two model development paradigms. That is, the first workflow node can be not only a data source, but also an abstract model node, for which the analyst preconfigures input and output variables or outlines the structure of input and output datasets. If the necessary data appears in the future, it can be used to train and utilize the model (supposing that the source metadata is synchronized with the model).

As Megaladata is an advanced analytical platform that supports both "upwards" and "downwards" modeling, you can choose the approach that provides a better solution for each specific problem.

"Upwards" development is preferable when the data is fully available for analysis, and there is no plan to reuse the workflow in the future solving similar tasks.

"Downwards" modeling works better if it is impossible to use the source data at the first stage of workflow construction (it may be unavailable, incomplete, insufficient, etc.). At the same time, the analyst has certain ideas and concepts regarding the source data structure as well as the goals and results of analytical processing. Building the workflow "downwards" allows for reusing the models effectively to solve related tasks.

See also

Data Breach 101: A Beginner's Guide
In today's digital world, our personal information is constantly being collected and stored. From online shopping to social media, we leave a trail of data wherever we go. But what happens when this...
Enterprise Information Ecosystem
In today's digitally driven business landscape, data has become an indispensable asset. The ability to effectively harness and utilize information is a key determinant of a company's success. To remain...
Sampling Methods and Algorithms in Data Analysis: Nonprobability Sampling
In our previous article, we discussed probability sampling, a method where samples are selected randomly, giving each population member an equal chance of inclusion. This article will explore...

About Megaladata

Megaladata is a low code platform for advanced analytics

A solution for a wide range of business problems that require processing large volumes of data, implementing complex logic, and applying machine learning methods.
GET STARTED!
It's free