The clustering-concept algorithm (CCA) described in the paper "Clusters, Concepts, and Pseudometrics" has been implemented in a Windows based system that includes a user interface with a variety of execution and display options, a database facility for storing and analyzing clusters, and a facility for conducting statistical trials on randomly generated contexts.
The implementation of the algorithm can be further optimized, but in spite of this fact, relatively large contexts (on the order of 200 x 400) can be processed in a few minutes on even a 200 MHz Pentium machine. In particular, the algorithm has been run on contexts generated from a variety of sources ranging from medical research to sporting events. Overall, these experiments indicate that the tool is useful for computing a tractable subset of potentially very large concept lattices.
Example 1 shows the results of running CCA on cancer data.
Example 2 shows the results of running CCA on randomly generated relations.
Example 3 shows the results of running CCA on football pool data.
Example 1
The CCA was run on a subset of the Wisconsin breast cancer data (ftp.cs.wisc.edu/math-prog/cpo-dataset/machine-learn/cancer1/) consisting of 200 rows and 390 columns of 0's and 1's (| O | = 200, | A | = 390). Each row represents one tumor and various sets consisting of 39 columns represent the discretization of 10 distinct attributes of the tumor such as mass and diameter. The rows represent equal percentages of benign and malignant tumors.
Running the algorithm on this context required 191 iterations and generated 389 extents. The algorithm found one extent of size 90 that is 88% malignant, but it failed to find any cluster of significant size that was over 50% benign. Running the algorithm on the 200 x 780 complemented extension of the context required 151 iterations and generated 409 extents. The algorithm found one extent of size 81 that is 85% malignant and one extent of size 95 that is 88% benign.
The following chart shows the distribution of ties for the complemented extension: on 86% of the iterations, there was no tie, and on 94% of the iterations there was at most one tie.
Number of iterations |
129 |
12 |
3 |
1 |
2 |
1 |
2 |
Number of ties |
0 |
1 |
2 |
3 |
4 |
6 |
13 |
Percent of iterations |
86 |
8 |
2 |
0.67 |
1.34 |
0.67 |
1.34 |
Example 2
The following tables are based on running CCA on 100 randomly generated contexts where the probability of a 1 in a row-column entry is 0.5. In Tables II(a) and II(b), the algorithm was run on the complemented extension of randomly generated contexts with one-half the number of attributes listed for the number of attributes.
For each row in Table I(a) and II(a), the average number of iterations on a given run is less than the number of objects and the average number of clusters per trial is approximately 2 times the number of objects.
A comparison of rows 1 and 2, and 2 and 3, respectively, in Table I(a), and rows 2 and 3 in Table II(a), shows that as the number of attributes increases, the average number of ties per iteration on a given run significantly decreases.
Table I(b) (Table II(b)) shows that for over 94% (84%) of the iterations on contexts of any size, the number of ties is at most 2. Furthermore, the two tables show that for over 99% of the iterations on contexts of any size, the number of labels is less than the number of vertices in the graph.
I(a) |
objects |
attributes |
average iterations |
max iterations |
average clusters |
max clusters |
average ties/iteration |
max ties/iteration |
1 |
25 |
25 |
20.09 |
23 |
49.97 |
55 |
0.56 |
1.20 |
2 |
25 |
50 |
21.96 |
24 |
48.65 |
51 |
0.31 |
0.79 |
3 |
50 |
50 |
40.23 |
47 |
98.98 |
105 |
0.52 |
0.89 |
4 |
50 |
100 |
44.02 |
48 |
96.92 |
102 |
0.35 |
0.70 |
5 |
100 |
100 |
81.71 |
90 |
196.05 |
201 |
0.56 |
0.88 |
I(b) |
objects |
attributes |
ties = 0 |
ties = 1 |
ties = 2 |
ties > 2 |
l_{k} - v_{k} < 0 |
l_{k} - v_{k} = 0 |
l_{k} - v_{k} > 0 |
1 |
25 |
25 |
71.50 |
16.82 |
6.44 |
5.24 |
99.74 |
0.26 |
0.00 |
2 |
25 |
50 |
83.83 |
10.11 |
3.44 |
2.62 |
100.0 |
0.00 |
0.00 |
3 |
50 |
50 |
78.49 |
14.70 |
2.78 |
4.03 |
99.95 |
0.05 |
0.00 |
4 |
50 |
100 |
89.40 |
7.51 |
0.63 |
2.46 |
99.98 |
0.02 |
0.00 |
5 |
100 |
100 |
83.11 |
12.42 |
2.46 |
2.01 |
100.0 |
0.00 |
0.00 |
II(a) |
objects |
attributes |
average iterations |
max iterations |
average clusters |
max clusters |
average ties/iteration |
max ties/iteration |
1 |
25 |
50 |
18.85 |
22 |
53.39 |
64 |
0.82 |
1.69 |
2 |
50 |
50 |
31.08 |
37 |
109.63 |
128 |
1.34 |
2.24 |
3 |
50 |
100 |
33.76 |
40 |
105.16 |
118 |
0.99 |
1.76 |
4 |
100 |
100 |
57.78 |
67 |
215.03 |
230 |
1.54 |
2.09 |
II(b) |
objects |
attributes |
ties = 0 |
ties = 1 |
ties = 2 |
ties > 2 |
l_{k} - v_{k} < 0 |
l_{k} - v_{k} = 0 |
l_{k} - v_{k} > 0 |
1 |
25 |
50 |
66.27 |
13.62 |
8.85 |
11.26 |
99.27 |
0.62 |
0.11 |
2 |
50 |
50 |
64.66 |
14.53 |
4.89 |
15.92 |
99.60 |
0.30 |
0.10 |
3 |
50 |
100 |
71.28 |
11.05 |
4.97 |
12.70 |
99.69 |
0.31 |
0.00 |
4 |
100 |
100 |
71.12 |
12.48 |
4.21 |
12.19 |
99.84 |
0.16 |
0.00 |
Example 3
CCA was run on football data consisting of 31 rows and 866 columns of 0's and 1's. Each row represents one of the teams listed below and each of the first 433 columns represents whether a given individual selected the team in the given row to win (1) or lose (0) on a given week. The remaining 433 columns are the complements of the first 433 columns with 0's and 1's interchanged.
List of Teams
1-Arizona, 2-Atlanta, 3-Baltimore, 4-Buffalo, 5-Carolina, 6-Chicago, 7-Cincinnati, 8-Cleveland, 9-Dallas, 10-Denver, 11-Detroit, 12-Green_Bay, 13-Indianapolis, 14-Jacksonville, 15-Kansas_City, 16-Miami, 17-Minnesota, 18-NY_Giants, 19-NY_Jets, 20-New_England, 21-New_Orleans, 22-Oakland, 23-Philadelphia, 24-Pittsburgh, 25-San_Diego, 26-San_Francisco, 27-Seattle, 28-St_Louis, 29-Tampa_Bay, 30-Tennessee, 31-Washington
The algorithm required K = 31 iterations and there was a unique nearest pair on each iteration: l_{k} = e_{k} = 1 and v_{k} = 2 for each k < K.
The number of concepts found was N = 61; these are listed below with the superscripts indicating the iteration on which the set first appeared:
{{k}: 0 < k < 32}^{1}, {7, 8}^{2}, {7, 8, 23}^{3}, {12, 17}^{4}, {10, 26}^{5}, {19, 22}^{6}, {7, 8, 23, 28}^{7}, {11, 21}^{8}, {6, 25}^{9}, {13, 31}^{10}, {10, 12, 17, 26}^{11}, {5, 27}^{12}, {9, 24}^{13}, {16, 30}^{14}, {1, 4}^{15}, {7, 8, 13, 23, 28, 31}^{16}, {20, 29}^{17}, {3, 18}^{18}, {6, 11, 21, 25}^{19}, {2, 14}^{20}, {5, 15, 27}^{21}, {2, 9, 14, 24}^{22}, {1, 4, 10, 12, 17, 26}^{23}, {3, 6, 11, 18, 21, 25}^{24}, {16, 19, 22, 30}^{25}, {1, 2, 4, 9, 10, 12, 14, 17, 24, 26}^{26}, {5, 15, 20, 27, 29}^{27}, {3, 5, 6, 11, 15, 18, 20, 21, 25, 27, 29}^{28}, {7, 8, 13, 16, 19, 22, 23, 28, 30, 31}^{29}, {1-6, 9-12, 14, 15, 17, 18, 20, 21, 24-27, 29}^{30}, {1-31}^{31}
The various sizes and numbers of clusters is summarized below:
Size / Number |
1 / 31 |
2 / 14 | 3 / 2 | 4 / 5 |
5 / 1 | 6 / 3 | 10 / 2 | 11 / 1 | 21 / 1 |
31 / 1 |
The following three clusters of size four were significant:
Good teams: {10, 12, 17, 26} [Denver, Green_Bay, Minnesota, San_Francisco]
Interesting teams: {2, 9, 14, 24} [Atlanta, Dallas, Jacksonville, San_Diego]
Weak teams: {7, 8, 23, 28} [Cincinnati, Cleveland, Philadelphia, St_Louis]
and the three larger clusters of sizes 10, 10, and 11 partitioned the teams into the following disjoint groups:
{1, 2, 4, 9, 10, 12, 14, 17, 24, 26} -- contains the good and interesting teams
{7, 8, 13, 16, 19, 22, 23, 28, 30, 31} -- contains the weak teams
{3, 5, 6, 11, 15, 18, 20, 21, 25, 27, 29}