Data Analysis Methodology


The truth is, methods that are theoretically perfect often have little in common with the real world. Analysts frequently encounter situations where it's difficult to make clear assumptions about the problem. The underlying model is unknown, and the only source of information is a simple table of observed "input–output" data, where each row contains the values of an object’s input characteristics and their corresponding output values.
As a result, analysts are forced to rely on various heuristics or expert assumptions when choosing informative features, selecting a model class, and setting model parameters. These assumptions are based on their experience, intuition, and understanding of the analyzed process.
The conclusions this approach can provide are based on a simple but fundamental hypothesis about the monotonicity of the solution space: "Similar inputs lead to similar system outputs." This idea is intuitively clear and usually sufficient to produce practical, acceptable solutions.
This method sacrifices academic rigor for practicality, which is nothing new. If an approach to solving a problem contradicts reality, we tend to change the approach.
In data analysis and, more specifically, Machine Learning, there is another key point: The process of extracting knowledge from data follows the same pattern as determining physical laws. It involves collecting evidence-based data, organizing it into tables, and searching for a logical pattern that, first, makes the results obvious and, second, allows for the prediction of new facts.
At the same time, it’s understood that our knowledge of an analyzed process, like any physical phenomenon, is to some extent an approximation. Any system of reasoning about the real world involves various approximations. The very term "machine learning" acknowledges the physical approach, alongside the mathematical one, to solving data analysis problems. So, what exactly is a "physical approach"?
This approach suggests that the analyst is prepared for the process to be too complex for precise analysis with strict mathematical methods. However, it’s still possible to get a good idea of its behavior by approaching the problem from different angles, guided by knowledge of the subject area, experience, intuition, and various heuristic methods. In this case, we move from a rough model to increasingly precise understandings of the process. Rephrasing Richard Feynman, you can perfectly study the characteristics of a system; you just have to not obsess over absolute accuracy.
A generalized representation of a data analysis flow is shown in the diagram:
This approach implies that:
- You must rely on the experience of an expert.
- You need to look at the problem from multiple angles and combine approaches.
- You shouldn't strive for high accuracy right away; instead, you should move toward a solution using simpler, rougher models before moving to more complex and accurate ones.
- You should stop as soon as you get an acceptable result, without pursuing an ideal model.
- As time passes and new information becomes available, the cycle must be repeated, as the learning process is endless.
Example: Analyzing real estate market
As an example, let's consider analyzing a real estate market in Mexico City to assess the investment potential of projects. It includes a task of building a pricing model for new housing, or, in other words, finding the quantitative relationship between housing prices and price-forming factors. For typical housing, these include:
Location: Neighborhood prestige, infrastructure, proximity to undesirable areas (e.g., industrial sites, older buildings, or markets), and ecology (e.g., nearby parks).
Apartment location: Floor (first and last floors are usually cheaper), section (apartments in end sections are cheaper), orientation (north-facing apartments are often cheaper), and the view.
Building type: Refers to the architectural style or class of the building, such as a high-rise apartment block, a low-rise building, a townhouse, or a duplex.
Apartment size: The total area of the apartment, typically measured in square meters.
Amenities: Availability of elevators, balconies, etc.
Construction stage: The closer to completion, the higher the price per square meter.
Finishing: Rough, partial, or turnkey.
Transport connection: Proximity to the subway, distance from major highways, ease of access, and availability of parking lots.
Seller: The investor or developer ("first-hand") versus an intermediary realtor.
This is where Feynman's statement about the ideal model and accuracy is very useful.
To begin, we can choose information about just one city district from all the available sales data. As input factors, we can also use a limited set of characteristics that experts believe influence the selling price of housing the most: building series, finishing, floor (first, last, or middle), readiness of the property, number of rooms, section (corner or regular), and square footage. The output will be the price per square meter at which the apartments were sold. This provides a clear table with a manageable number of input factors.
Then, we train a neural network on this data to build a rough model. Despite its approximate nature, it will have one significant advantage: It will correctly reflect the dependence of the price on the factors considered. For example, all other things being equal, an apartment in a corner section is cheaper than in a regular one, and the cost of apartments increases as the building nears completion. Later, the model can be improved and made more complete and accurate.
At the next stage, we can add sales records from other areas of Mexico City to the training set. Correspondingly, characteristics such as the prestige of the area, local ecology, and the distance from the subway can be included as input factors. It's also a good idea to add the price for similar housing on the secondary market to the training set.
Specialists with real estate experience can painlessly experiment with improving the model by adding or excluding factors, as the process of finding a better model simply involves training the neural network on different datasets. The main thing here is to realize that this process is endless.
This is an example of a highly effective approach to data analysis: Using a specialist's experience and intuition to systematically increase the accuracy of the process's model. The main requirement is the availability of sufficient, high-quality data, which is impossible without an automated data collection and storage system. This is a crucial point for anyone providing business information support.
Conclusion
The "physical approach" to data analysis described above allows for solving real-world problems with acceptable quality. While one can find many shortcomings, in reality, there's no true alternative unless you abandon analysis altogether. If physicists have successfully used such methods for centuries, why not adopt them in other fields?
See also


