Gene Expression Clustering Analysis
BIOL 265 /COMP 113 Computer Laboratory
M. Weir / M. Rice
Large-scale gene expression data can be obtained from microarray or deep sequencing experiments. In today’s lab we will explore data from microarrays. Microarray data can provide important information regarding regulatory relationships between genes. If genes are regulated equivalently (e.g. similar cis regulation), then their expression profiles in different conditions (experiments) can correlate. Various methods for clustering analysis can reveal these correlations. These methods will be explored in this lab session.
Please be sure to record your results as you proceed through this lab - an effective approach is to record screen images and paste them into Microsoft Word, carefully annotating how results were obtained (which application, clustering method, genes, filters, transformations). At the end of the lab, you will then be able to review your records and evaluate the approaches you have tried. Since microarray and deep sequencing data analyses are new fields, it is particularly important for us to critically assess different methods for analyzing these data.
1. Affymetrix Data
Go to the and select IGS Databases -> Access IGS Microarray & Slide Database. Select one of the Affymetrix microchip data sets (e.g. the Tamayo et al. (1999a) data) and click on the Access Data Set button. [We will consider slide data later.] Select a portion of the data set (e.g. expression values for the first 500 genes of the Tamayo (1999a) data for all 4 experiments) – do this by entering the appropriate gene range, click "Add gene range, " then enter the experiment range values and click "Add experiment range" [data input image]. [Also, in the middle of this page, select into a new browser window the "Click here to see percentiles..." link (in the "Expression Levels (percentiles)" box [previous run]) for us to look at later.] Next, click the Continue button at the bottom of the page. Once your chosen data set is displayed [previous image], click the Format Data button -- this will give you the option of outputting the data set in your format of choice. Click the Genesis button. This generates a tab-delimited text file which includes the gene identifier names in the first column of each line (row). Save the file in a convenient directory. [NOTE in some browsers, before saving the file, it may be necessary to switch to "view source" using the right mouse button, and then save using systematic labeling of the file – the file should be a .txt file to be recognized by Genesis]. [data input image, filters input, output from previous run]
2. Clustering Algorithms
Clustering methods use a distance measure (e.g. Euclidean metric: sqrt(sum of differences of squares)) to compare expression values of pairs of genes for each experiment. When the distance between a pair of genes is small, then the two genes might be clustered. We will use the Genesis application to cluster the raw data. Let us try clustering the raw data using the K-means algorithm. Open the Genesis program, and import your data set into the program. The option "Expression Images" allows you to visualize the data on a red (high expression) to green (low expression) scale. Different genes are represented in different rows, and different experiments (microarrays) are found in each column. Before running a clustering algorithm, use the 'Distance" pull-down menu to choose a distance measure (e.g. Euclidean = sqrt(sum differences squared)). Then choose a clustering algorithm (e.g. the K-means option from the "Analysis" pull-down menu. and select the number of clusters (K) into which you wish to partition your genes (e.g. 20). Choosing an appropriate number is an important issue.
Your clustering results can be viewed in
several ways. Take a look at the (mean) expression profiles of your clusters by
looking at "Centroid views" -- first choose the "All
clusters" option to see the profiles of all clusters with the numbers of
genes in each cluster. Note that in order to see the profiles, you may need to
"adjust to maximum" under the "View" drop-down menu in
Genesis. To obtain a better indication of how the individual genes contribute
to the clusters, choose "Expression views" to display the expression
profiles of all genes in a cluster, with the centroid highlighted. To save
files listing the genes (and their expression values) in individual clusters,
use mouse right click "save cluster" (or click "save all
clusters" to save a file for each cluster). You can also use the mouse
right click to save images of the cluster profiles. [You might also save the
screen image for the all-clusters image of centroid views.]
Notice that many of the centroid profiles are rather flat, and they often are not particularly representative of the profiles of the individual genes in the cluster. This indicates that we need to consider preprocessing the data set before running clustering algorithms. For example, when we preprocess the first 500 genes in the same way as Tamayo et al. (discussed below), the following profiles result from K-means clustering (K = 10).
3. Preprocessing Data
In the middle of the Tamayo data set page
(the page with filtering and transformation options), we selected earlier the
"Click here to see percentiles..." link (in the "Expression Levels
(percentiles)" box [previous run]). The entire data set of
expression values from the microarray experiments is divided into 20 bins each
representing 5 percent increments.
Notice that a large percentage of the data set values are negative. Processing of the microarrays includes estimating non-specific background signal which is then subtracted from all expression values. The Affymetrix algorithms for calculating gene expression values compare the signals obtained with perfect-match and one-base-mismatch hybridization oligonucleotides on the microarrays (oligonucleotide "probe pairs"). Because the one-base-mismatch oligonucleotides can sometimes hybridize to other mRNAs, they do not always give a good representation for non-specific background signal. Hence, apparent negative expression values for genes can result. Also, since there is some noise in the data, this can also result in negative values.
On Affymetrix microarrays, the expression of each gene is measured using several different spots (the oligonucleotide probe pairs) -- each probe pair corresponds to a different region of gene of mRNA sequence. The "gene calls" of present (P) or absent (A) depend upon whether the signals for the different oligos are internally consistent. Also, expression values considered too low to be measurable are given a call of "A". Indeed, notice that many of the low expression values, and (virtually) all the negative values are scored as "A". To assess this, scroll down the data set in the "Query Results" window (each column number represents a different microarray experiment -- in the case of Tamayo et al. (1999a), experiment 1, 2, 3 or 4).
For purposes of clustering analysis, we
may wish to filter
out all genes that do not have at least one "P" call in
We might also wish to transform all expression values below some threshold cut-off value to that threshold value; e.g., Tamayo et al. transform all expression values less than 20 to 20 (round-up threshold). This technique removes all the negative values – which will be useful later.
Starting again with the Tamayo data set, apply the preceding two operations to the data set. [It is best to press "Reset Page" at the bottom of the Filter and Transformations page so that you keep track of exactly which operations are activated.] Repeat K-means clustering in Genesis (with the same value of K). As before, in order to see the profiles, you may need to "adjust to maximum" under the "View" drop-down menu in Genesis. Has this filtering and transformation of the starting data affected the distribution of cluster expression (centroid) profiles? [Save a screen image of the cluster centroid profiles.] [input, output from previous run]
We are still not seeing many
"interesting" profile patterns. Our problem is that the profiles of
each gene are not being measured or compared effectively by our "distance
measure" calculation in the clustering.
If we subtract the mean expression value for each gene from each of its expression values, then all the genes will be varying about the same mean (zero).
Apply this transformation and see how this improves the clustering results -- do you now see a more "interesting" range of profiles in different clusters? [input, output from previous run]
We still face a problem however. Genes
whose raw expression values are high may tend to show high expression changes,
whereas genes with lower (but measurable) expression values might show much
smaller (but nevertheless meaningful) expression changes. We can try different
approaches to address this problem. Here are three possibilities: [For each
approach below, save screen images for "All-Cluster" centroid
(a) Apply a log transformation to the expression values of each gene.
(b) Use the mean and standard deviation of the expression values for each gene to normalize each expression value.
(c) Filter out genes whose expression values do not vary significantly, e.g. genes with approximately the same maximum and minimum expression values.
Try (a), (b) and (c), or if time is limited, go straight to (c).
(a) For example, if an expression value doubles, then the change on a log scale will be the same regardless of whether the gene has large or small expression values. Compute the log of each expression value, subtract the log of the mean for each gene, and repeat the clustering on the same data set. Since you are taking logs, you will first need to use a threshold transformation (e.g. round up to a minimum gene expression value of 20) to avoid problems with non-positive expression values. [Before this transformation, you may also wish to filter out genes whose expression values never exceed the threshold of 20.]
(b) Normalize each expression value by subtracting the mean from each expression value and then dividing the result by the standard deviation. This computes the z-score for each expression value. The standard deviation for each gene’s expression values will now be 1. You may first want to filter out genes with low standard deviations that are close to the noise in the system (experiment with different standard deviation cut-offs). As in (a), you may also wish to filter out genes whose expression values never exceed a certain threshold.
(c) A drawback of using a standard deviation cut-off is that rare larger changes may be missed due to the averaging effect of the standard deviation calculation. An alternative, used e.g. by Tamayo et al. (1999) and Coller et al. (2000), is to filter out genes whose maximum and minimum expression values differ by less than 100, and whose fold-difference (ratio) is less than 3. Try applying these two filters where you specify a minimum expression threshold to be used (e.g. 20) such that expression values below this threshold are not considered in the range or fold calculations. After using these filters, compute the z-scores (see (b)). Before applying these transformations, expression values should be rounded-up to the minimum expression threshold, e.g. 20, used in your range filters. [input, output from previous run]
How do the above approaches affect the
results of clustering?
What is the difference between these approaches? How does your answer relate to the biology of the system? In (a), we focus on fold differences, but in (b) and (c), we focus on the profile of the changes, and include all profiles with changes above some threshold measure. In (b), this threshold measure depends on the standard deviation of each gene, and is not dependent on whether the gene has a high or low mean expression level. In (c) the threshold measure does depend upon the absolute expression levels (fold measure constraint) as well as the absolute size of the range.
Several different clustering algorithms
are commonly used to analyze microarray data. In contrast to the K-means
algorithm that divides the gene set into K discrete clusters, hierarchical
clustering provides a "tree view" of the distances between gene
expression profiles. Genes with similar expression profiles share close
ancestral nodes; genes with dissimilar expression profiles have distant
Perform hierarchical clustering in Genesis using the first 500 genes of Tamayo et al. pre-processed using approach (c) above (filter for genes with > 3 fold and > 100 expression range (minimum 20), round up to 20, and z-transform).
Choose settings for hierarchical clustering in Genesis -- e.g. Euclidean distance, average linkage. Compare the clustering results tree with your clusters from the K-means algorithm (centroid views -- all clusters) -- it may help to use printouts of your data for comparison. Can you identify "K-means" clusters in the tree image? A useful property of hierarchical clustering is that nodes within the tree define clusters, and that the nodes within these clusters define "subclusters". However, when working with larger numbers of genes, it becomes harder to interpret the resulting trees.
4. Slide Data (optional)
An important difference between the Affymetrix and Slide microarray approaches is that with
each slide, we compare the expression of two different mRNA or cDNA populations, each labeled with a different color of
dye. This gives a ratio of expression values for each gene.
Typically, a log transformation is applied to the expression ratios so that we can compare the fold changes in the expression of genes in different populations. This notion applies to the Eisen (1998) data set in the -- i.e. the values in the database are logs of expression ratios.
Using the idea we discussed above, the mean log(ratio) values for each gene can be subtracted from the log(ratio) values so that all genes have the same mean (zero). Apply this transformation to the Eisen (1998) data set and then run the K-means algorithm on 500 or 1000 genes with the first 18 experiments. How do your clustering results compare to the clustering runs above? [Note: Since the data set is already stored as log ratios, use the "value - mean" transformation option, not "Log2(value) - Log2(mean)"].
5. Time Series (optional)
The first 18 experiments (columns) of the Eisen (1998) data represent wild type yeast cells with
synchronized cell cycle. The cells were fixed at specific times in the cycle [e.g.
in experiment 2 cells were fixed at 7 min (alpha 7)]. See which of your cluster
profiles are periodic. Does the choice of K (in K-means) determine how many
clusters have periodic profiles? What wavelengths do you see? Do different
filtering or transformation choices reveal periodic expression of different
You can also examine the annotations of genes that have periodic profiles. Do any of the annotations implicate them in cell cycle events? [Note: You can retrieve gene annotations by selecting the "Excel" output file format from the IGS database and in Genesis you can store the identities of genes in clusters by saving the cluster with a mouse right click.]
Another approach is to try clustering runs in which you select different ranges of experiments from the set of 18 (e.g 1-5, 6-12). Is one cell cycle of expression data sufficient to give the same clusters? What happens if you use only a portion of each cycle?
The McDonald and Rosbash data set also contains data for timed experiments (Drosophila adults at different times of the day -- 0, 4, 8, 12 hours etc.) You might also analyze this Affymetrix data set to identify clusters of genes with periodic expression using appropriate filters and transformations for your analysis.
Review the annotated results in your Word document. For each analysis, record your assessment of the implications of your results. At the end of the document, record your general conclusions regarding the different clustering approaches that you have explored.
6. Additional Issues:
In this lab, we have introduced a number of
techniques for analyzing microarray data. There are a number of additional
(a) How should the value of K be chosen for the K-means algorithm ?
(b) What extra information does hierarchical clustering provide (compared to K-means)?
(c) What criteria would you use to decide on the clustering approach?
1. Explain briefly a significant reason why deep
sequencing is now the preferred method for this
kind of analysis (compared to microarray experiments). 2. Using the first 500 genes of the Tamayo
(1999a) dataset perform K-means clustering for several choices of K and various
filters and transformations. Use the resulting profiles to illustrate the
analysis. Discuss how the results change for your various choices of parameters. Appendix: Another common algorithm for clustering gene
expression profiles makes use of Self Organizing Maps (SOM).
Clusters are represented as an array of cells in two-dimensional space, and the
expression vectors of each cell (cluster) are updated in each iteration of the
SOM based on the expression values of the genes assigned to that cell, as well
as genes in the neighboring cells -- but the influence of neighboring cells
falls off with distance. In Genesis, try applying the SOM algorithm to
the same data set (Tamayo et al. -- filtered and transformed) using the default
settings. Compare the centroid values of clusters (which are influenced by
neighboring clusters) with the expression of genes assigned to the clusters. Compare
your results with those from K-means and hierarchical clustering. Copyright 2019 Wesleyan University
3. What additional output information would you add if you were re-designing the GENESIS implementations of K-means and hierarchical clustering?
2. Using the first 500 genes of the Tamayo
(1999a) dataset perform K-means clustering for several choices of K and various
filters and transformations. Use the resulting profiles to illustrate the
analysis. Discuss how the results change for your various choices of parameters.
Another common algorithm for clustering gene expression profiles makes use of Self Organizing Maps (SOM). Clusters are represented as an array of cells in two-dimensional space, and the expression vectors of each cell (cluster) are updated in each iteration of the SOM based on the expression values of the genes assigned to that cell, as well as genes in the neighboring cells -- but the influence of neighboring cells falls off with distance.
In Genesis, try applying the SOM algorithm to the same data set (Tamayo et al. -- filtered and transformed) using the default settings. Compare the centroid values of clusters (which are influenced by neighboring clusters) with the expression of genes assigned to the clusters. Compare your results with those from K-means and hierarchical clustering.
Copyright 2019 Wesleyan University