Scalable CLOPE algorithm for clustering categorical data
Introduction and Main Concepts
Clustering of large arrays of categorical data is very important for data analysis systems. Categorical data can be found in any field: manufacturing, commerce, marketing, medicine… Categorical data also includes so-called transactional data: receipts in supermarkets, logs of visits to web resources. This also includes the analysis and classification of text documents (Text Mining).
Here and further, the "categorical data" refers to the qualitative attributes of objects measured using the name scale. We remind that when using the name scale, it is only indicated whether the objects are the same or not relative to the measured feature.
It is inefficient and often impossible to use traditional algorithms for clustering of objects with categorical features (for more information, see the article "Clustering Algorithms in Data Mining"). The main difficulties are related to the high size and huge volume that often characterize such databases.
Algorithms based on pairwise distance calculation (k-means and analogs) are effective mainly for numerical data. Their performance when working with arrays of records with a large number of non-numeric factors is unsatisfactory. And it's mostly not due to the complexity of setting up a metric for calculating the distance between the categorical attributes, but the fact that each iteration of the algorithm requires pairwise comparison of objects with each other, and there can be a lot of iterations. it is not applicable for tables with millions of records and thousands of fields.
Therefore, there are ongoing studies in the field of development of scalable clustering algorithms for categorical and transactional data. These have special requirements, namely as follows:
- minimum possible number of "scans" of the database table;
- operation with a limited amount of RAM;
- ability to interrupt the algorithm and save the intermediate results so that you can continue the calculations later;
- the algorithm should operate when the objects from the database can be retrieved only in the unidirectional cursor mode (i.e., in the record navigation mode).
Nowadays, more than a dozen methods have been proposed for working with categorical data, for example, a family of hierarchical cluster algorithms. However, they do not always meet the requirements listed above. One of the most effective algorithms is LargeItem algorithm, which is based on optimization of a certain global criterion. This global criterion uses a support parameter (the terminology here has a lot in common with algorithms for generation of association rules).
In general, calculation of the global criterion makes the clustering algorithm many times faster than using the local criterion for pairwise comparison of objects; so "globalization" of the evaluation function is one of the ways to obtain scalable algorithms.
The CLOPE algorithm that we discuss in this article is very similar to LargeItem algorithm but faster and easier in terms of programm implementation. CLOPE was proposed in 2002 by a group of Chinese scientists. At the same time, it provides higher performance and better clustering quality compared to LargeItem algorithm and many other hierarchical algorithms.
To begin with, we will describe the clustering problem under consideration for categorical data. The entire presentation will go as if we have a database of transactional data, and at the end of the article we will show how to use CLOPE to partition any categorical arrays into clusters by working with them as with transactional ones.
The term "transaction" here refers to some arbitrary set of objects, whether it is a list of keywords in an article, products purchased in a supermarket, a set of patient symptoms, outstanding image fragments, etc. The task of transactional data clustering is to split the entire set of transactions in such a way that similar transactions end up in the same cluster, and different transactions end up in different clusters.
The CLOPE clustering algorithm is based on the idea of maximization of the global cost function, which increases the proximity of transactions inside the clusters due to an increase in the cluster histogram parameter. Let's consider a simple example of 5 transactions: \{ (a,b), (a,b,c), (a,c,d), (d,e), (d,e,f) \}. Let's assume that we want to compare the following two partitions into clusters:
\{ \{ ab,abc,acd \}, \{ de,def \} \} , (1)
\{ \{ ab,abc \}, \{ acd,de,def \} \}, (2)
For the first and second partitioning options in each cluster, let's calculate the number of occurrences of each transaction item in them, and then calculate the height of (H) and width of W of the cluster. For example, the cluster \{ ab,abc,acd \} has occurrences of a : 3, b : 2, c : 2 with H=2 and W=4. To facilitate understanding, Figure 1 shows these results geometrically as histograms.
We can evaluate the quality of two partitions by analyzing their height of H and width of W. The clusters \{ de,def \} and \{ ab,abc \} have the same histograms, so they are equivalent. The histogram for the cluster \{ ab,abc,acd \} contains 4 different items and has an area of 8 blocks (H=2.0,H/W=0.5), and the cluster\{ acd,de,def \} contains 5 different items with the same area of (H=1.6,H/W=0.32). Obviously, splitting (1) is better, since it provides a greater overlap of transactions (respectively, the H parameter is higher there).
The CLOPE algorithm operates based on this obvious and simple concept of geometric histograms (Clustering with sLOPE). Let's look at its more detailed formal description.
CLOPE algorithm
Let us assume that there is a D transaction database consisting of a set of transactions \{ t_1,t_2,…,t_n \}. Each transaction has a set of objects \{ i_1,…,i_m \}. The cluster set \{ C_1,…,C_k \} is a partition of the set \{ t_1,…,t_n \}, such that C_1…C_k = \{ t_1,…,t_n \} and C_i\neq \varnothing \wedge C_i \bigcap C_j = \varnothing is true for 1<=i, j<=k. Each item C_i is called a cluster n, m, k are the number of transactions, the number of objects in the transaction database, and the number of clusters, respectively.
Each C cluster has the following attributes:
- D(C) — set of unique objects;
- Occ(i,C) — number of occurrences (frequency) of an object i in cluster C;
- S(C)=\sum_{i\in\ D(C)}\ Occ\ (i, C)=\sum_{t_i\in C}\mid t_i \mid;
- W (C) = | D (C) |;
- H(C)=S(C)/W(C).
The histogram of the cluster C is called a graphical image of its current attributes: the objects in the cluster are laid off along the OX axis in the order of descending of the values of Occ(i,C), and the value of Occ(i,C) is laid off along the OY axis (Fig. 2).
In Figure 2, S(C)=8 corresponds to the area of the rectangle bounded by the coordinate axes and the dotted line. Obviously, the higher the H value, the more "similar" the two transactions are. Therefore, the algorithm must choose partitions that maximize H.
However, it is not enough to take into account the H height value alone. Let's take a database consisting of 2 transactions: \{ abc,def \}. They do not contain common objects, but the partition \{ \{abc,def \} \} and the partition \{ \{ abc \}, \{ def \} \} are characterized by the same height of H=1. It turns out that both splitting options are equivalent. But if we use the gradient G(C)=H(C)/W(C)=S(C)/W(C)^2 instead of H(C), then the partition option \{ \{ abc \}, \{ def \} \} will be better (the gradient of each cluster is 1/3 versus 1/6 for the partition \{ \{ abc,def \} \}).
Summarizing the above, we write a formula for calculation of the global criterion — the cost function Profit(C):
Profit (C) = \frac {\sum \limits_{i=1}^{k} G(C_i)\times\mid C_i\mid}{\sum \limits_{i=1}^{k} \mid C_i\mid}=\frac {\sum \limits_{i=1}^{k} \frac{S(C_i)}{W(C_i)^r}\times\mid C_i\mid}{\sum \limits_{i=1}^{k} \mid C_i\mid}
where:
- |Ci| — number of transactions in the i-th cluster
- k — number of clusters
- r — positive real number greater than 1.
The r parameter, called the repulsion coefficient (repulsion) by the authors of CLOPE, regulates the level of similarity of transactions within the cluster, and, as a result, the final number of clusters. This coefficient is selected by the user. The higher the r, the lower the similarity level and the more clusters will be generated.
The formal statement of the clustering problem by the CLOPE algorithm looks as follows: for given D and r, find the partition C: Profit(C,r) -> max.
Algorithm Implementation
Let's assume that the transactions are stored in a database table. The best solution is searched for during a sequential iterative search of the database records. Since the optimization criterion is global in nature, based only on the calculation H and W, the performance and speed of the algorithm will be significantly higher than that during the pairwise comparison.
The implementation of the algorithm requires the first pass through the transaction table to perform the initial split defined by the following function Profit(C,r). After that, a small (1-3) number of additional table scans is required to improve the clustering quality and optimize the cost function. If there are no changes in the current pass through the table, the algorithm stops. The pseudocode of the algorithm looks as follows.
- // Phase 1 — initialization
- Until end
- read the transaction [t, -] from the table;
- put t in the existing or a new cluster C_i, which gives the maximum Profit(C,r);
- write [t,i] into the table (cluster number);
- // Phase 2 — Iteration
- Repeat
- go to top of table;
- moved := false;
- until end of table
- [t,i];
- put t in the existing or a new cluster C_j, that maximizes Profit(C,r);
- if C_i<>C_j then
- write [t,i];
- moved := true;
- until not moved
- delete all empty clusters;
As you can see, the CLOPE algorithm is scalable, since it can work with a limited amount of computer RAM. During operation, RAM stores only the current transaction and a small amount of information for each cluster, which consists of: the number of transactions N, the number of unique objects (or the width of the cluster) W, a simple hash-table for calculating Occ(i,C) and S values of the cluster area. They are referred to as cluster features CF. For simplicity, we denote them as properties of a cluster C, for example, C.Occ[i] means the number of occurrences of an object i in a cluster C, etc. You can calculate that it takes 40 MB of RAM to store the occurrences of 10 thousand objects in 1 thousand clusters.
To complete the implementation of the algorithm, we need two more functions that calculate the increment Profit(C,r) when adding and removing a transaction from the cluster. This can be easily done by learning the S, W and N values of each cluster:
- function DeltaAdd(C,t,r) : double;
- begin
- S_{new} := C.S + t.ItemCount;
- W_{new} := C.W;
- for i := 0 to t.ItemCount - 1 do
- if (C.Occ[t.items[i]] = 0) then W_{new} := W_{new} + 1;
- result := S_{new} * (C.N+1)/(W_{new})^r - C.S * C.N/(C.W)^r
- end;
t.Items[i] — value of the i object of the transaction t. Note that when adding t to a new cluster DeltaAdd(C,t,r) is equal to S/W^r, where S and W are the area and width of the cluster consisting of the transaction t being added.
The implementation of the increment function Profit(C,r) when deleting a transaction is similar toDeltaAdd(C,t,r), so we omit its detail code.
The following theorem guarantees that the function DeltaAdd is used correctly.
Theorem. If DeltaAdd(C_i,t) is the maximum, then movement t to the cluster C_i maximizes Profit(C,r).
Now you can estimate the computational complexity of the CLOPE algorithm. Let the average transaction length be A, the total number of transactions — N, and the maximum possible number of clusters — K. The time complexity of one iteration is O( N*K*A ), which shows that the speed of the algorithm increases linearly with the growth of clusters and the size of the table. This makes the algorithm fast and efficient for large volumes.
Having talked about the implementation of the algorithm, we did not say anything about the type of the transaction table to which the CLOPE algorithm can be applied. CLOPE allows you to solve clustering problems not only for transactional data, but also for any categorical data. The main thing is that all attributes of the objects are measured using the name scale.
However, before running CLOPE, the data must be normalized. It can take the form of a binary matrix of images, both in the form associative rules, and a representation of the one-one mapping between the set of unique objects \{ u_1,…u_q \} of the table and the set of integers \{ 0,1,2,…,q-1 \}.
The Mushroom Dataset
The mushroom dataset is a popular test used to evaluate algorithms for clustering of categorical data sets (available at the UCI machine learning repository). The test sample contains 8124 instances describing 22 attributes of mushrooms of two classes: 4208 edible (e) and 3916 inedible (p) mushrooms. The sample file looks as follows:
{p, x, s, n, t, p, f, c, n, k, e, e, s, s, w, w, p, w, o, p, k, s, u}
{e, x, s, y, t, a, f, c, b, k, e, c, s, s, w, w, p, w, o, p, n, n, g}
{e, b, s, w, t, l, f, c, b, n, e, c, s, s, w, w, p, w, o, p, n, n, m}
{p, x, y, w, t, p, f, c, n, n, e, e, s, s, w, w, p, w, o, p, k, s, u}
{e, x, s, g, f, n, f, w, b, k, t, e, s, s, w, w, p, w, o, e, n, a, g}
.........
The total number of unique attributes is 116. 2480 records have missing values in one of the attributes. Description of the data set — https://archive.ics.uci.edu/ml/datasets/mushroom.
If such data set is presented in the normalized form described above, you will get 8124 transactions, of which 2408 will be 21 items in length, and the rest — 22 (the missing values are ignored). Now you can apply the CLOPE algorithm. The result of CLOPE at r=2.6 for the mushroom problem after the 1st iteration (initialization phase) is shown in Table 1. The quality criterion for the algorithm is the number of "dirty" clusters, i.e. those containing both edible (e) and inedible (p) mushrooms.
The smaller these clusters are, the better. Table 1 shows that after the 1st iteration, only 1 "dirty" cluster (#18) remained. You will need a couple more database scans to get the final clustering. It is obvious that cluster 12 will disappear.
A detailed study of the CLOPE algorithm, conducted by its developers, showed a high quality of clustering in comparison with other algorithms, including hierarchical ones. At the same time, it outperforms them several times in terms of speed and performance.
CLUSTER | e | p |
---|---|---|
1 | 256 | |
2 | 512 | |
3 | 768 | |
4 | 96 | |
5 | 96 | |
6 | 192 | |
7 | 1296 | |
8 | 432 | |
9 | 149 | |
10 | 192 | |
11 | 1146 | |
12 | 1 | |
13 | 288 | |
14 | 192 | |
15 | 223 | |
16 | 48 | |
17 | 72 | |
18 | 48 | 32 |
19 | 8 | |
20 | 8 | |
21 | 1497 | |
22 | 192 | |
23 | 288 | |
24 | 32 | |
25 | 36 | |
26 | 8 | |
27 | 16 | |
Result | 4208 | 3916 |
Table 1: Result of CLOPE Implementation after 1 Iteration
CLOPE Application Areas
CLOPE is designed to work with transactional data, but as we have seen, a lot of data sets with categorical attributes represent transactional data or are reduced to it. Questionnaire answers, list of keywords in the document, user's most visited web resources, patient's symptoms, and attributes of the mushrooms – all this is nothing more than a transaction. Therefore, the scope of CLOPE covers all arrays of categorical databases.
In general, clustering of transactional data has a lot in common with association analysis. Both of these data mining technologies reveal hidden dependencies in data sets. But still there are some differences. On the one hand, clustering provides a general view of the data set, while associative analysis finds specific dependencies between the attributes. On the other hand, association rules can be used immediately, whereas clustering is most often used as the first stage of analysis.
In conclusion, we emphasize the advantages of the CLOPE algorithm:
- High scalability and speed, as well as the quality of clustering, which is achieved through using a global optimization criterion based on maximization of the height gradient of the cluster histogram. It can be easily calculated and interpreted. During operation, the algorithm stores a small amount of information for each cluster in RAM and requires a minimum number of scans of the data set. This allows you to use it for clustering of large amounts of categorical data (large categorical data sets);
- CLOPE automatically selects the number of clusters, which is controlled with a single parameter – the repulsion coefficient.