Clustering algorithms in data mining
Introduction
Clustering — a process combining similar objects into groups —is one of the fundamental tasks in the field of data analysis and data mining. The range of areas where it can be applied is wide: image segmentation, marketing, anti-fraud procedures, impact analysis, text analysis, etc. At the present time, clustering is often the first step in data analysis. After grouping, other methods are applied, and separate models are built for each group.
The problem of clustering in one form or another was described in such scientific areas as statistics, pattern recognition, optimization, and machine learning. this is the reason for a wide range of synonyms for the concept of a cluster, such as class and taxon.
At the present moment, the number of methods for splitting of groups of objects into clusters is quite large – several dozen algorithms and even more modifications of those. However, we review clustering algorithms from the point of view of their application in data mining.
Clustering in Data Mining
Clustering in data mining becomes valuable when it acts as one of the stages of data analysis, building a complete analytical solution. It is often easier for an analyst to identify groups of similar objects, study their features, and build a separate model for each group, than to create a single general model for all the data. This technique is constantly used in marketing for identification of groups of customers, buyers, and products as well as for development of separate strategies for each of them.
Very often, the data used in data mining has the following important features:
- high dimensionality (thousands of fields) and large volume (hundreds of thousands and millions of records) of database tables and data stores (ultra-large databases);
- the data sets contain a large number of numeric and categorical attributes.
All attributes or features of the objects are divided into numeric and categorical. The numeric attributes are those that can be put in arranged order in certain space, and the categorical attributes are those that cannot be put in order. For example, the "age" attribute is numeric, and the "color" attribute is categorical. Assignment of values to attributes occurs during measurements employing a selected type of scale, and this, generally speaking, is a separate task.
Most clustering algorithms involve comparing of objects with each other based on some measure of proximity (similarity). The measure of proximity is a value that has a limit and increases with increasing proximity of the objects. The similarity measures are "developed" according to special rules, and the choice of specific measures depends on the task, as well as on the measurement scale. Euclidean distance, calculated using the formula below, is often used as a measure of proximity for numeric attributes:
D(x, y)=\sqrt{\sum_{i}{(x-y)^2}}
Czekanowski-Sørensen and Jaccard similarity coefficients are commonly used for categorical attributes ( \mid t_1 \cap t_2\mid/\mid t_1\cup t_2 \mid).
The need to process large amounts of data in data mining has led to establishing of requirements that, if applicable, the clustering algorithm should meet. Let's consider them:
- The minimum possible number of passes through the database;
- Operation with a limited amount of ram;
- Ability to interrupt the algorithm and save the intermediate results so that you can continue the calculations later;
- The algorithm should operate when the objects from the database can be retrieved only in the unidirectional cursor mode (i.e., in the record navigation mode).
The algorithm that meets these requirements (especially the second one) will be called scalable. Scalability is the most important property of an algorithm, depending on its computational complexity and software implementation. There is also a more comprehensive definition. An algorithm is called scalable, if, given a constant RAM capacity, its operation time increases linearly as the number of records in the database increases.
However, it is not always necessary to process extremely large data sets. Therefore, at the dawn of the theory of cluster analysis, almost no attention was paid to scalability of algorithms. It was assumed that all processed data will fit into RAM, the main emphasis had always been put on improvement of the quality of clustering.
It is difficult to maintain a balance between high quality clustering and scalability. Therefore, ideally, the data mining arsenal should include both efficient microarray clustering algorithms and scalable ones for processing of large databases.
Clustering Algorithms: Brilliance and Misery
In sum, it is already possible to classify cluster algorithms into scalable and non-scalable. Let's broaden the classification.
Considering the methods of splitting into clusters, there are two types of algorithms: hierarchical and non-hierarchical. Classical hierarchical algorithms only work with categorical attributes with a complete tree of nested clusters being built. The agglomerative methods for building of cluster hierarchies are common here – they combine sequential unification of the source objects and corresponding reduction of the number of clusters. The hierarchical algorithms provide relatively high quality clustering and do not require prior specification of the number of clusters. Most of them have a complexity of O(n^2).
The non-hierarchical algorithms are based on the optimization of a certain objective function that determines the optimal, in a certain sense, partitioning of a set of objects into clusters. In this group, popular algorithms of the k-means family (k-means, fuzzy c-means, Gustafson-Kessel), which use the sum of squares of weighted deviations of the object coordinates from the centers of the desired clusters as the objective function.
The clusters looked for are either spherical or ellipsoid in shape. As per the canonical implementation, the function is minimized based on Lagrange multiplier method and allows you to find only the nearest local minimum. Implementation of global search methods (genetic algorithms) will significantly increase the computational complexity of the algorithm.
Among the non-hierarchical algorithms that are not based on distance, special mention should go to the EM algorithm (Expectation-Maximization). Instead of using cluster centers, it assumes that there is a probability density function for each cluster with a corresponding expectation value and variance. In the mixture of distributions (Fig. 2), their parameters (mean and standard deviations) are searched for using the maximum likelihood estimation principle. The EM algorithm is one of the implementations of such a search. The problem is that before start of the algorithm, a hypothesis about the type of distributions that are difficult to evaluate in the total data set is put forward.
Another problem occurs when the attributes of an object are mixed – one part has a numeric type, and the other part has a categorical type. For example, suppose you want to calculate the distance between the following objects with attributes (Age, Gender, Education):
{23, male, higher}, (1)
{25, female, secondary}, (2)
The first attribute is numeric, the others are categorical. If we want to use a classical hierarchical algorithm with some measure of similarity, we will have to somehow discredit the "Age" attribute. For example, as follows:
{under 30, male, higher}, (1)
{under 30, female, secondary}, (2)
At the same time, we will certainly lose some of the information. If we determine the distance within the Euclidean space, then there will be issues with the categorical attributes. It is clear that the distance between the "Gender — male" and "Gender — female" is 0, because the values of this attribute are using the name scale. And the "Education" attribute can be measured both using the name scale and within the order scale by assigning a certain score to each value.
Which option should we choose? What if the category attributes are more important than the numeric attributes? The problem of finding of a solution for these issues falls on the shoulders of teh analyst. In addition, when using the k-means algorithm and those alike, it is difficult to define the cluster centers for categorical attributes, as well as to set the number of clusters a priori.
The algorithm for optimization of the objective function in non-hierarchical algorithms based on distances is iterative, and each iteration requires calculation of the distance matrix between the objects. With a large number of objects, this is inefficient and requires significant computation power. The computational complexity of the 1st iteration of the k-means algorithm is estimated as O(kmn), where k,m,n — are the number of clusters, attributes, and objects, respectively. But there can be lots of iterations! You will have to make a lot of passes through the data set.
This very approach presupposing searching for spherical or ellipsoid clusters has a lot of disadvantages within the k-means. This approach works well when data in space form compact clusters that are clearly distinguishable from each other. And if the data have a nested form, none of the algorithms of the k-means family will ever cope with this task. The algorithm also does not work well when one cluster is significantly larger than the others and they are close to each other – the effect of "splitting" of the large cluster is observed (Fig. 3).
However, studies in the field of improvement of clustering algorithms are conducted constantly. Interesting extensions of the k-means algorithm for working with categorical attributes (k-modes) and mixed attributes (k-prototypes) have been developed. For example, in k-prototypes, the distances between the objects are calculated differently depending on the attribute type.
In the market of scalable clustering algorithms, the main struggle is to reduce each "additional" pass through the data set when the algorithm is running. Scalable analogs of k-means and EM (scalable k-means and scalable EM), scalable agglomerative methods (CURE, CACTUS) have been developed. These modern algorithms require only a few (two to ten) database scans before the final clustering.
Development of scalable algorithms is based on the idea of abandoning the local optimization function. Pairwise comparison of objects with each other in the k-means algorithm is nothing more than local optimization since at each iteration it is necessary to calculate the distance from the cluster center to each object. This leads to an expanded cost of computational power required.
When setting the global optimization function, addition of a new point to the cluster does not require a lot of calculations: it is calculated based on the old value, the new object, and the so-called cluster features. The specific cluster characteristics depend on a particular algorithm. This is how BIRCH, LargeItem, CLOPE, and many other algorithms appeared.
Thus, there is no single universal clustering algorithm. When using any algorithm, it is important to understand its advantages and disadvantages, take into account the nature of the data it works with best, and its scalability.