Gene Expression Clustering Analysis

BIOL 265 /COMP 113 Computer Laboratory

M. Weir / M. Rice

Large-scale gene expression data can be obtained from microarray or deep sequencing experiments.  In today’s lab we will explore data from microarrays.  Microarray data can provide important information regarding regulatory relationships between genes. If genes are regulated equivalently (e.g. similar cis regulation), then their expression profiles in different conditions (experiments) can correlate. Various methods for clustering analysis can reveal these correlations. These methods will be explored in this lab session.

Please be sure to record your results as you proceed through this lab - an effective approach is to record screen images and paste them into Microsoft Word, carefully annotating how results were obtained (which application, clustering method, genes, filters, transformations).  At the end of the lab, you will then be able to review your records and evaluate the approaches you have tried. Since microarray and deep sequencing data analyses are new fields, it is particularly important for us to critically assess different methods for analyzing these data.

1. Affymetrix Data

Go to the Integrative Genomic Sciences (IGS) home page [] and select IGS Databases -> Access IGS Microarray & Slide Database. Select one of the Affymetrix microchip data sets (e.g. the Tamayo et al. (1999a) data) and click on the Access Data Set button. [We will consider slide data later.] Select a portion of the data set (e.g. expression values for the first 500 genes of the Tamayo (1999a) data for all 4 experiments) – do this by entering the appropriate gene range, click "Add gene range, " then enter the experiment range values and click "Add experiment range" [data input image].  [Also, in the middle of this page, select into a new browser window the "Click here to see percentiles..." link (in the "Expression Levels (percentiles)" box [previous run]) for us to look at later.]  Next, click the Continue button at the bottom of the page. Once your chosen data set is displayed [previous image], click the Format Data button -- this will give you the option of outputting the data set in your format of choice. Click the Genesis button. This generates a tab-delimited text file which includes the gene identifier names in the first column of each line (row). Save the file in a convenient directory. [NOTE in some browsers, before saving the file, it may be necessary to switch to "view source" using the right mouse button, and then save using systematic labeling of the file – the file should be a .txt file to be recognized by Genesis].  [data input image, filters input, output from previous run]

2. Clustering Algorithms

Clustering methods use a distance measure (e.g. Euclidean metric: sqrt(sum of differences of squares)) to compare expression values of pairs of genes for each experiment. When the distance between a pair of genes is small, then the two genes might be clustered. We will use the Genesis application to cluster the raw data. Let us try clustering the raw data using the K-means algorithm. Open the Genesis program, and import your data set into the program. The option "Expression Images" allows you to visualize the data on a red (high expression) to green (low expression) scale. Different genes are represented in different rows, and different experiments (microarrays) are found in each column. Before running a clustering algorithm, use the 'Distance" pull-down menu to choose a distance measure (e.g. Euclidean = sqrt(sum differences squared)). Then choose a clustering algorithm (e.g. the K-means option from the "Analysis" pull-down menu. and select the number of clusters (K) into which you wish to partition your genes (e.g. 20). Choosing an appropriate number is an important issue.

Your clustering results can be viewed in several ways. Take a look at the (mean) expression profiles of your clusters by looking at "Centroid views" -- first choose the "All clusters" option to see the profiles of all clusters with the numbers of genes in each cluster. Note that in order to see the profiles, you may need to "adjust to maximum" under the "View" drop-down menu in Genesis. To obtain a better indication of how the individual genes contribute to the clusters, choose "Expression views" to display the expression profiles of all genes in a cluster, with the centroid highlighted. To save files listing the genes (and their expression values) in individual clusters, use mouse right click "save cluster" (or click "save all clusters" to save a file for each cluster). You can also use the mouse right click to save images of the cluster profiles. [You might also save the screen image for the all-clusters image of centroid views.]

Notice that many of the centroid profiles are rather flat, and they often are not particularly representative of the profiles of the individual genes in the cluster. This indicates that we need to consider preprocessing the data set before running clustering algorithms. For example, when we preprocess the first 500 genes in the same way as Tamayo et al. (discussed below), the following profiles result from K-means clustering (K = 10).

Description: Macintosh HD:Users:mweir:Documents:BIOL265 2014:tamayoKmeans.png

3. Preprocessing Data

In the middle of the Tamayo data set page (the page with filtering and transformation options), we selected earlier the "Click here to see percentiles..." link (in the "Expression Levels (percentiles)" box [previous run]). The entire data set of expression values from the microarray experiments is divided into 20 bins each representing 5 percent increments.

Notice that a large percentage of the data set values are negative. Processing of the microarrays includes estimating non-specific background signal which is then subtracted from all expression values. The Affymetrix algorithms for calculating gene expression values compare the signals obtained with perfect-match and one-base-mismatch hybridization oligonucleotides on the microarrays (oligonucleotide "probe pairs"). Because the one-base-mismatch oligonucleotides can sometimes hybridize to other mRNAs, they do not always give a good representation for non-specific background signal. Hence, apparent negative expression values for genes can result. Also, since there is some noise in the data, this can also result in negative values.

On Affymetrix microarrays, the expression of each gene is measured using several different spots (the oligonucleotide probe pairs) -- each probe pair corresponds to a different region of gene of mRNA sequence. The "gene calls" of present (P) or absent (A) depend upon whether the signals for the different oligos are internally consistent. Also, expression values considered too low to be measurable are given a call of "A". Indeed, notice that many of the low expression values, and (virtually) all the negative values are scored as "A". To assess this, scroll down the data set in the "Query Results" window (each column number represents a different microarray experiment -- in the case of Tamayo et al. (1999a), experiment 1, 2, 3 or 4).

For purposes of clustering analysis, we may wish to filter out all genes that do not have at least one "P" call in some experiment.

We might also wish to transform all expression values below some threshold cut-off value to that threshold value; e.g., Tamayo et al. transform all expression values less than 20 to 20 (round-up threshold). This technique removes all the negative values – which will be useful later.

Starting again with the Tamayo data set, apply the preceding two operations to the data set. [It is best to press "Reset Page" at the bottom of the Filter and Transformations page so that you keep track of exactly which operations are activated.] Repeat K-means clustering in Genesis (with the same value of K). As before, in order to see the profiles, you may need to "adjust to maximum" under the "View" drop-down menu in Genesis. Has this filtering and transformation of the starting data affected the distribution of cluster expression (centroid) profiles? [Save a screen image of the cluster centroid profiles.] [input, output from previous run]

We are still not seeing many "interesting" profile patterns. Our problem is that the profiles of each gene are not being measured or compared effectively by our "distance measure" calculation in the clustering.

If we subtract the mean expression value for each gene from each of its expression values, then all the genes will be varying about the same mean (zero).

Apply this transformation and see how this improves the clustering results -- do you now see a more "interesting" range of profiles in different clusters? [input, output from previous run]

We still face a problem however. Genes whose raw expression values are high may tend to show high expression changes, whereas genes with lower (but measurable) expression values might show much smaller (but nevertheless meaningful) expression changes. We can try different approaches to address this problem. Here are three possibilities: [For each approach below, save screen images for "All-Cluster" centroid profiles.]

(a) Apply a log transformation to the expression values of each gene.

(b) Use the mean and standard deviation of the expression values for each gene to normalize each expression value.

(c) Filter out genes whose expression values do not vary significantly, e.g. genes with approximately the same maximum and minimum expression values.


Try (a), (b) and (c), or if time is limited, go straight to (c).

(a) For example, if an expression value doubles, then the change on a log scale will be the same regardless of whether the gene has large or small expression values. Compute the log of each expression value, subtract the log of the mean for each gene, and repeat the clustering on the same data set. Since you are taking logs, you will first need to use a threshold transformation (e.g. round up to a minimum gene expression value of 20) to avoid problems with non-positive expression values. [Before this transformation, you may also wish to filter out genes whose expression values never exceed the threshold of 20.]

(b) Normalize each expression value by subtracting the mean from each expression value and then dividing the result by the standard deviation. This computes the z-score for each expression value. The standard deviation for each gene’s expression values will now be 1. You may first want to filter out genes with low standard deviations that are close to the noise in the system (experiment with different standard deviation cut-offs). As in (a), you may also wish to filter out genes whose expression values never exceed a certain threshold.

(c) A drawback of using a standard deviation cut-off is that rare larger changes may be missed due to the averaging effect of the standard deviation calculation. An alternative, used e.g. by Tamayo et al. (1999) and Coller et al. (2000), is to filter out genes whose maximum and minimum expression values differ by less than 100, and whose fold-difference (ratio) is less than 3. Try applying these two filters where you specify a minimum expression threshold to be used (e.g. 20) such that expression values below this threshold are not considered in the range or fold calculations. After using these filters, compute the z-scores (see (b)). Before applying these transformations, expression values should be rounded-up to the minimum expression threshold, e.g. 20, used in your range filters. [input, output from previous run]

How do the above approaches affect the results of clustering?

What is the difference between these approaches? How does your answer relate to the biology of the system? In (a), we focus on fold differences, but in (b) and (c), we focus on the profile of the changes, and include all profiles with changes above some threshold measure. In (b), this threshold measure depends on the standard deviation of each gene, and is not dependent on whether the gene has a high or low mean expression level. In (c) the threshold measure does depend upon the absolute expression levels (fold measure constraint) as well as the absolute size of the range.

Several different clustering algorithms are commonly used to analyze microarray data. In contrast to the K-means algorithm that divides the gene set into K discrete clusters, hierarchical clustering provides a "tree view" of the distances between gene expression profiles. Genes with similar expression profiles share close ancestral nodes; genes with dissimilar expression profiles have distant ancestral nodes.

Perform hierarchical clustering in Genesis using the first 500 genes of Tamayo et al. pre-processed using approach (c) above (filter for genes with > 3 fold and > 100 expression range (minimum 20), round up to 20, and z-transform).

Choose settings for hierarchical clustering in Genesis -- e.g. Euclidean distance, average linkage. Compare the clustering results tree with your clusters from the K-means algorithm (centroid views -- all clusters) -- it may help to use printouts of your data for comparison. Can you identify "K-means" clusters in the tree image? A useful property of hierarchical clustering is that nodes within the tree define clusters, and that the nodes within these clusters define "subclusters". However, when working with larger numbers of genes, it becomes harder to interpret the resulting trees.

4. Slide Data (optional)

An important difference between the Affymetrix and Slide microarray approaches is that with each slide, we compare the expression of two different mRNA or cDNA populations, each labeled with a different color of dye. This gives a ratio of expression values for each gene.

Typically, a log transformation is applied to the expression ratios so that we can compare the fold changes in the expression of genes in different populations. This notion applies to the Eisen (1998) data set in the IGS database -- i.e. the values in the database are logs of expression ratios.

Using the idea we discussed above, the mean log(ratio) values for each gene can be subtracted from the log(ratio) values so that all genes have the same mean (zero). Apply this transformation to the Eisen (1998) data set and then run the K-means algorithm on 500 or 1000 genes with the first 18 experiments. How do your clustering results compare to the clustering runs above? [Note: Since the data set is already stored as log ratios, use the "value - mean" transformation option, not "Log2(value) - Log2(mean)"].

5. Time Series (optional)

The first 18 experiments (columns) of the Eisen (1998) data represent wild type yeast cells with synchronized cell cycle. The cells were fixed at specific times in the cycle [e.g. in experiment 2 cells were fixed at 7 min (alpha 7)]. See which of your cluster profiles are periodic. Does the choice of K (in K-means) determine how many clusters have periodic profiles? What wavelengths do you see? Do different filtering or transformation choices reveal periodic expression of different genes?

You can also examine the annotations of genes that have periodic profiles. Do any of the annotations implicate them in cell cycle events? [Note: You can retrieve gene annotations by selecting the "Excel" output file format from the IGS database and in Genesis you can store the identities of genes in clusters by saving the cluster with a mouse right click.]

Another approach is to try clustering runs in which you select different ranges of experiments from the set of 18 (e.g 1-5, 6-12). Is one cell cycle of expression data sufficient to give the same clusters? What happens if you use only a portion of each cycle?

The McDonald and Rosbash data set also contains data for timed experiments (Drosophila adults at different times of the day -- 0, 4, 8, 12 hours etc.) You might also analyze this Affymetrix data set to identify clusters of genes with periodic expression using appropriate filters and transformations for your analysis.

Review the annotated results in your Word document. For each analysis, record your assessment of the implications of your results. At the end of the document, record your general conclusions regarding the different clustering approaches that you have explored.

6. Additional Issues:

In this lab, we have introduced a number of techniques for analyzing microarray data. There are a number of additional issues.

(a) How should the value of K be chosen for the K-means algorithm ?

(b) What extra information does hierarchical clustering provide (compared to K-means)?

(c) What criteria would you use to decide on the clustering approach?


1. Explain briefly a significant reason why deep sequencing is now the preferred method for this kind of analysis (compared to microarray experiments).

2. Using the first 500 genes of the Tamayo (1999a) dataset perform K-means clustering for several choices of K and various filters and transformations. Use the resulting profiles to illustrate the analysis. Discuss how the results change for your various choices of parameters.

3. What additional output information would you add if you were re-designing the GENESIS implementations of K-means and hierarchical clustering?


Another common algorithm for clustering gene expression profiles makes use of Self Organizing Maps (SOM). Clusters are represented as an array of cells in two-dimensional space, and the expression vectors of each cell (cluster) are updated in each iteration of the SOM based on the expression values of the genes assigned to that cell, as well as genes in the neighboring cells -- but the influence of neighboring cells falls off with distance.

In Genesis, try applying the SOM algorithm to the same data set (Tamayo et al. -- filtered and transformed) using the default settings. Compare the centroid values of clusters (which are influenced by neighboring clusters) with the expression of genes assigned to the clusters. Compare your results with those from K-means and hierarchical clustering.

Copyright 2019 Wesleyan University