Data in the Corporate Information Factory

Data is one of a company's most valuable assets. However, for it to be useful, a special information architecture—a corporate information factory—is needed. Let's consider its structure, possible users, and the processes that take place in it.

What Is a Corporate Information Factory (CIF)?

Corporate Information Factory (CIF) is the term coined by Bill Inmon and used to describe an organization's integrated data management system.

CIF Structure

The structure of the CIF includes the following components:

External business environment: Enterprises and people from the external environment generate transactions that feed the CIF. These same entities are also consumers of the results provided by the CIF.

Applications: This is the family of systems from which the CIF collects raw, disaggregated business data. Applications support day-to-day, routine activities such as order processing, accounts payable, etc.

Operational data store (ODS): Stores domain-specific, integrated and updated granular data used to support operational decision-making within the organization.

Integration and transformation modules: Here, the data collected by applications is transformed to fit into the corporate data structure.

Data warehouse (DW): Stores a subject-oriented, integrated, time-variant and non-volatile collection of granular and aggregate data used to support strategic decision–making.

Data marts: DW subsets tailored to support the analytical requirements of a specific business unit.

Exploratory and analytical data warehouses: These components are used for data exploration and analysis, executing large and complex analytical queries. They are physically isolated from the main warehouse, so analytical activities associated with large loads on storage systems do not affect the performance of the entire factory.

Alternative data warehouse: In place of supporting operational activities and decision-making, it is intended to store large arrays of disaggregated data accumulated in the CIF over long periods. The cost of storage is several times less than in the main DW, so its size can be increased almost indefinitely.

Decision support systems: A suite of complex applications centered around the main DW, forming a distinct CIF component.

Processes and Users

Each user in the CIF has a specific role. In addition, the CIF implements many work processes, the main ones being:

Client communications
Query management
Information delivery
Configuration management
Data quality management
System administration

Managing users and processes within a CIF can be challenging, as it requires consideration of the company’s culture, policies, economics, geography, and other factors. For instance, a company with a traditionally centralized information system may face hurdles in implementing data marts for individual business areas, while a decentralized company might struggle with a centralized data warehouse.

Data Flow

Understanding the data flow is key to understanding how a CIF operates. Raw, unprocessed, granular data is refined by applications and then passed to the integration and transformation level, where the operational data is transformed into enterprise data.

Data flows from the integration and transformation level to both the ODS and the DW. The DW can receive data from both the ODS and the integration layer. Once loaded, the data is available for analysis and decision support.

This architecture mirrors a real factory. Raw materials and components arrive at the plant and are processed by specialists. Assembly lines then turn the raw materials into products. Some of them become fully finished products, others turn into semi-finished products from which other goods can be made later.

Components

Types of Data

Raw granular data

This data is typically collected from applications and loaded into the DW and ODS via the IT layer. However, some of it can be collected and loaded directly into the ODS. This occurs when end users need access to data that is not currently managed by an application.

In essence, the ODS becomes the 'authoritative' source system for the DW. Managing the data directly in the warehouse would be inefficient.

The DW is designed to support strategic decisions based on aggregated data and simply lacks the functionality to efficiently store, process, and access transactional information in real-time. Furthermore, making the DW the 'authoritative source' for raw data would likely require modifying it to support related operational activities, demanding performance and availability characteristics it wasn't initially designed for.

External data

The key source of information in the CIF is external data—the data generated outside of it by other organizations. It can be of almost any type and volume, structured or unstructured, individual or aggregated.

One of the fundamental differences between external and internal data is the ability to manage them. Whenever it is necessary to change internal data, you can adjust your registration and collection system. Since the sources of external data are outside the CIF, it is impossible to change such data from within it.

Thus, CIF architects have to either use external data as is or discard it altogether. The only possible change is when the keys of external data are modified when entered into the CIF. This happens often when the company needs to match external data to an existing customer. The system attempts to compare the name and address associated with the external data with the name and address in the customer database. If a match is found, the external key is replaced with the internal customer identifier and the external data is accepted.

In many cases, external data's key structure is very different from that of the CIF. To integrate this data, these external keys typically need to be mapped or used to look up the corresponding internal keys (like surrogate keys or conformed business keys) established within the CIF.

External data can be accessed by any component of the CIF. If they are to be used in several data marts, it is recommended to first place them in the DW and then transfer them to the data marts separately, which will ensure consistency.

The component where external data plays the most significant role is the exploratory warehouse, which is used by analysts seeking to gain new business insights.

Reference data

Reference data represents a standardized set of values or codes for classifying and uniformly defining other (usually master) data. It gives master data context and meaning, enabling consistent and accurate interpretation and analysis of the information.

Unlike transactional data, which records specific business events and changes continuously over time, reference data remains constant or changes very slowly. It serves as a basis for interpreting and analyzing data across various applications, systems, and processes.

The key purpose of reference data:

Establish definitions, classifications, and relationships for business objects.
Ensure consistency and accuracy of presentation.
Improve data quality by streamlining data integration and simplifying data exchange within and between organizations.

In the financial sphere, examples of reference data are identifiers of securities and financial instruments, such as shares and bonds. In e-commerce, these are codes of goods and commodity lines. Examples in marketing are the addresses and phone numbers of clients.

Having separately stored arrays of reference data at your disposal allows you to quickly access them without overloading the system. It is their accuracy and relevance that guarantee the correctness of operational actions, such as issuing an invoice to a client.

The volume of reference data in the CIF is relatively small, compared to transactional data. Because of this, it is often treated as secondary data. In addition, reference data is very stable, so managing it usually requires little effort.

Reference data usually belongs to the entire company and not to individual departments. As a result, special processing regulations are rarely established for them. Meanwhile, they deserve the same attention as any other type of data in the CIF.

There are at least three reasons why reference data plays an important role in CIF:

When the reference data in the application and the storage coincide, integration is significantly simpler.
Regardless of whether the CIF includes data marts, exploratory warehouses, operational warehouses, etc., correctly formed and maintained reference data will help to ensure that the same object has the same interpretation in all parts of the system.
As reference data becomes outdated, the repository should store its chronology so that historical data can be linked to the corresponding reference data.

Historical data

Even if the data was received by the CIF a few seconds ago, it is already considered historical in the sense that it reflects a business event that has already occurred.

Historical data provides primary context for business operations and often constitutes the greatest volume within the CIF. Separately, master data—which defines core business entities like customers and products and also provides vital context—requires careful management, often through dedicated Master Data Management (MDM) systems to ensure its quality and consistency across the organization.

Historical data in the CIF has the following features:

The longer the history, the greater the volume of data.
The younger the data, the more accurately it reflects the current business situation.
The more current the data, the more likely it is to be used while disaggregated.
The older the data, the more likely it is to be used in an aggregated form.

The CIF's application layer contains the most current data for a period of up to 30 days. Of course, the idea of "relevance" may vary depending on the operation area. In some industries, information may be stored for 30 days, and in others for a year.

In the ODS, the storage period is the same as at the application level. The only difference is that the ODS contains integrated data.

The main DW stores historical information from at least 24 hours to 5 to 10 years old. In practice, this period also depends on the business area.

The largest volume of historical data is contained in the alternative data warehouse, where most of the information from the main DW is archived.

Historical data is also contained in exploratory and analytical warehouses, but its use in these environments is usually project-oriented, so its history is limited. Consequently, these components of the CIF do not require large volumes for long-term storage.

One of the problems associated with historical data is the overlap of their storage periods in various components of the CIF. It can first occur between the application level and the OSD. However, applications contain granular data, and the OSD — integrated data. Therefore, despite the overlap, there is no complete duplication.

The second overlap is between the DW, where the data history starts from 24 hours, and the application level, where the storage period is about 30 days. Here, duplication may be possible.

Metadata

In addition to the master and reference data, the CIF contains another type of data that plays a key role in its operation: metadata. In the most general sense, metadata is "data about data". It contains information about the properties and structures of other data types (master data, reference data, etc.). Metadata is information about the attributes and properties of business objects, required for smooth and automatic management of large information flows.

The CIF mainly handles structured data organized into tables with typed fields (columns). Metadata defining this structure consists of details like table and field names, data types, constraints (such as character limits), and date/time formats. Even semistructured data, like text, has associated metadata, such as formatting information (footnotes, headers, etc.). This highlights that metadata is indispensable for effectively creating, processing, and understanding data in any form.

At first glance, it is easy to distinguish data from metadata: the former creates a context for business activities, while the latter serves to ensure the functioning of the system that stores and processes the main data. However, it is impossible to make an unambiguous distinction for the following reasons:

Some data can be both data and metadata. For example, a table header can be both metadata (as a metadata element — an identifier for accessing the table) and data (the table name creates a context for understanding its contents).
Data and metadata can swap roles. For example, a table field can be used as a key, i.e. carry useful information and perform a service function at some moments.
It is possible to create meta-meta-…-metadata. Since metadata is still a type of data, it is possible to create metadata for it.

Metadata can be formed manually or automatically. The former is more valuable, since it reflects the analyst's view of the problem. However, manual addition of metadata is only possible for small volumes, so in a CIF, where data flows in a continuous stream, it is not feasible.

Three main types of metadata that can be used in the CIF (classification by NISO):

Structural: This metadata describes the structure and components of compound data objects, and their relationships to one another. For example, the representation of numbers, time and date format in tables, field types, etc.
Administrative: Metadata that enables information processing in the CIF, for example, cleaning, transformation, or integration. This metadata may include data resource type, creation date, access permissions, etc.
Descriptive: This metadata describes the nature and characteristics of business objects. It is used for discovery and identification. For example, product codes or commodity line identifiers.

Some researchers distinguish technical (internal) metadata and business metadata. Technical metadata organizes the management and processing of data in the CIF, while business metadata creates a context for business data. For example, business metadata includes business rules, data quality requirements, acceptable values for reference data, etc.

Comparing the two terminologies, structural metadata refers entirely to technical metadata. Administrative metadata can refer to both technical and business metadata. Descriptive metadata is primarily business metadata.

Metadata acts as the 'glue' in the CIF, binding its components together and enabling them to interact. However, managing this metadata reveals a contradiction: it needs to be specific to individual CIF components, yet it also needs to be shared and standardized across them for integration.

For storing metadata within the CIF, two architectures are available: centralized and autonomous. The centralized approach uses a single metadata repository for the whole CIF, while the autonomous approach stores metadata directly within the associated CIF components.

On the one hand, a centralized repository increases the efficiency of metadata management in terms of maintaining consistency and quality. On the other hand, a centralized architecture limits the possibilities of using metadata by company departments and analysts. An autonomous architecture also has its drawbacks, since it does not guarantee uniform interpretation of metadata by all components of the CIF. Therefore, in practice, a hybrid architecture is often used, providing a certain balance.

About Megaladata

Megaladata is a low code platform for advanced analytics

A solution for a wide range of business problems that require processing large volumes of data, implementing complex logic, and applying machine learning methods.

GET STARTED!

It's free