Clusters, Concepts, and Pseudometrics

Michael D. Rice

Department of Mathematics & Computer Science
Wesleyan University
Middletown, CT 06459
mrice@wesleyan.edu
http://mrice.web.wesleyan.edu

Michael Siff

Faculty of Computer Science
Sarah Lawrence College
Bronxville, NY 10708
msiff@slc.edu

The fields of cluster analysis and concept analysis are both used to identify patterns in data. Concept analysis identifies similarities between sets of objects based on their attributes. Cluster analysis groups objects with related characteristics based on some notion of distance.

In this paper, we investigate connections between these two approaches. In particular, for each binary relation defined on a set of objects O and attributes A, we define distance functions r (on the power set of O) and d (on O).

We prove that r and d are pseudometrics and use them to

In particular, we characterize concepts in terms of r and characterize a family of concept lattices based on all subsets with a fixed cardinality bound in terms of d.

Our clustering algorithm differs from the classical algorithms since, first, the values of r, not d, determine which pairs of sets are combined at each level, and second, the clusters defined at each level in the algorithm are generally anti-chains and may not be partitions. Therefore, the analysis of the algorithm depends on the metric-geometry of r and is more involved than the analysis of the classical algorithms.

We have developed a software environment that permits the execution of the algorithm on finite relations and the storage and analysis of the resulting clusters. The algorithm has been run on relations generated from a variety of sources ranging from medical research to sporting events. Our results indicate that the number of iterations of the clustering algorithm is linear in the size of O and produces a linear number of clusters that are concepts. Hence the algorithm can be a useful tool for computing a tractable subset of a very large concept lattice.