CLUSTAL W and BLASTP
BIOL 265/COMP 113 Computer Laboratory
M. Weir / D. Krizanc / M. Rice
Introduction to sequence alignments: Sequence alignments are used extensively in biology. In this lab, we will explore sequence alignments from several perspectives. We will first carry out sequence alignments of fragments of several proteins. We will then discuss one of the major reasons why sequence alignments are so important -- the relationship between sequence alignment, evolutionary conservation, and gene function.
ALIGNING SEQUENCES USING CLUSTALW
Here are several fragments of protein sequences (in FASTA format) taken from different organisms. Let us use an alignment program, CLUSTALW, to see if there is sequence conservation between these sequences. The CLUSTALW program can be run from several web sites including Kyoto, EMBL, or EMBnet.
Paste the list of sequences into the sequence box and run CLUSTALW (e.g. at Kyoto). For now, do not worry about program options -- use the default options for protein sequences. The Kyoto site gives you results in several formats (here is a partial output from a previous submission): alignment scores for pairs of sequences; multiple alignment; alignment trees showing distances between sequences are linked at the bottom of the page.
[If you would like a color block diagram of the aligment where similar amino acids have the same color, use the EMBL site and press “show colors” option.]
Notice that these sequence fragments were chosen because they contain a conserved motif as discussed below. Different organisms have similar versions of the sequence (sequence conservation).
Notice that the Drosophila sequences are somewhat different from the vertebrate sequences, and that the nematode sequence is even less similar. [This property, observed with sequence alignments with more and less closely related organisms, is used in the design of sequence alignment algorithms to be discussed in class.]
Also notice that a single organism, e.g., Drosophila, has different proteins with similar motifs. Indeed, the sequence conservation is seen in several other Drosophila proteins. For example, use CLUSTALW to align the following sequences. [previous run]
A CONSERVED DEVELOPMENTAL GENE
The examples above are artificial in that we pre-selected sequence fragments that show sequence conservation. Let us imagine instead that we were studying a protein involved in animal development.
Studies in Drosophila have identified homeotic and segmentation genes which play a central role in embryo development. When these genes are mutated, the resulting phenotypes tell us about the functions of these genes.
For example, mutation of the Ubx homeotic gene gives rise to fruit flies with four wings (RHS) instead of the normal two wings (LHS). Analysis of the sequence of the Ubx protein can provide insights into how the gene functions in development.
You may get the full Ubx protein (fasta) (previous download) sequence from NCBI. A fragment of this sequence was included in one of the sequence alignments above. However, we do not know a priori which regions of a protein sequence might be conserved. To find conserved regions of protein, we can use a pattern searching program such as BLAST (Basic Local Alignment Search Tool) [Notice that this is local, not global alignment].
In NCBI BLAST, go to Standard protein-protein BLAST [BLASTP] and paste the complete Ubx sequence (with amino acid numbers) into the input box and run the BLASTP program (use the "nr" database). The result may take some time to appear. Once it does, mouse over the color-coded alignment box. You should see the highly conserved genes as red bars. Less well conserved proteins are different colors (see color key). Click on one of the bars to see the actual sequence alignment. The input sequence is the query; the subject sequence is the similar protein sequence. Notice that parts of the input sequence are highly conserved in many other protein sequences.
You may want to restrict the dataset searched to, e.g., human sequences, using the general BLAST page -- this will show you the closest human sequences to the Drosophila Ubx protein. [For your search, use BLASTP, the refereed protein database refseq, and limit the query to Homo sapiens.] [A previous blast search is available here.] Notice the block of very strong conservation towards the C-terminus of the Ubx protein.
Let us follow up on this apparent conservation: go to Search for conserved domains and test the Ubx protein. You will find that Ubx has a conserved motif called the homeodomain -- this is the conserved domain that we examined in the first section -- click on cd00086 homeodomain, to see information on the motif. Homeodomain-containing proteins are transcription factors that bind to cis-regulatory DNA of genes that they regulate. The proteins bind DNA through their homeodomain. Vertebrates have a large family of proteins with homeodomains -- they are called Hox proteins, and play a role in coding the appropriate development of groups of cells.
If regions of proteins are conserved, this suggests that they are important for functioning of the protein -- in this case, DNA-binding function. We can gain insight into how conserved motifs function by looking at their structures.
The Cn3D program allows you to see sequence alignments for a conserved motif, and the structure of that motif based on X-ray crystallography or NMR studies. Choose one of the homeodomain family links; e.g., cd00086, Homeodomain. Listed at the bottom of the page is a group of aligned homeodomain-containing protein sequences -- similar to the homeodomain protein sequences we aligned above with CLUSTALW. Set the program to Cn3D, and press "show structure".
You can highlight residues on the first sequence of the alignment, and the corresponding residues will show up as yellow in the structure. For example, the amino acids at the C-terminal (RHS) of the homeodomain form the 3rd alpha helix of the homeodomain (try highlighting them) -- this alpha helix is called the "recognition" helix and lies in the major groove of the DNA and makes specific DNA contacts. Try rotating the structure to get a better view (move the mouse on the image while clicking down). Although the structures of only a small number of homeodomains have been determined, it is likely that other homeodomains have similar structures given that their sequences are conserved.
This example illustrates how protein sequence motifs can be conserved because they have particular functions and corresponding structures associated with those functions. Hence, sequence alignments to identify conserved motifs can provide important insights into possible functions and stuctures associated with that protein.
You can visit the page CSH: Homeodomains in development for more information on how homeodomains function in development. Notice that many of the core mechanisms identified in model organisms like Drosophila are likely to also operate in other organisms including humans.
Summarize briefly what is meant by "sequence conservation" and discuss its relationship to protein structure.
Propose different kinds of criteria that might be used to define permitted amino acid substitutions. Think about theoretical criteria based on amino acid properties, and practical criteria based on observed sequence alignments (consider how Blosum matrix values are computed).
Copyright 2019 Wesleyan University