## Clusters, Concepts, and Pseudometrics

### Michael D. Rice

Department of Mathematics & Computer Science

Wesleyan University

Middletown, CT 06459

*mrice@wesleyan.edu*

*http://mrice.web.wesleyan.edu*
### Michael Siff

Faculty of Computer Science

Sarah Lawrence College

Bronxville, NY 10708

*msiff@slc.edu*
The fields of cluster analysis and concept analysis are both used to identify patterns in data. Concept analysis identifies similarities between sets of objects based on their attributes. Cluster analysis groups objects with related characteristics based on some notion of distance.

In this paper, we investigate connections between these two approaches. In particular, for each binary relation defined on a set of objects *O* and attributes *A*, we define distance functions r (on the power set of *O*) and d (on *O*).

We prove that r and d are pseudometrics and use them to

- specify a
*clustering algorithm* that computes a subset of the concept lattice

- discover new
*interpretations* of basic notions in concept analysis.

In particular, we characterize concepts in terms of r and characterize a family of concept lattices based on all subsets with a fixed cardinality bound in terms of d.

Our clustering algorithm differs from the classical algorithms since, first, the values of r, not d, determine which pairs of sets are combined at each level, and second, the clusters defined at each level in the algorithm are generally anti-chains and may not be partitions. Therefore, the analysis of the algorithm depends on the metric-geometry of r and is more involved than the analysis of the classical algorithms.

We have developed a software environment that permits the execution of the algorithm on finite relations and the storage and analysis of the resulting clusters. The algorithm has been run on relations generated from a variety of sources ranging from medical research to sporting events. Our results indicate that the number of iterations of the clustering algorithm is *linear* in the size of *O* and produces a *linear* number of clusters that are concepts. Hence the algorithm can be a useful tool for computing a tractable subset of a very large concept lattice.