Genomics Laboratory

BIOL/MBB 210

Michael Weir

 


Introduction

In this genomics laboratory, we are going to examine the human gene Aniridia and its Drosophila homolog, eyeless.  The two genes are described as homologs because their protein sequences are very similar.  The most similar Drosophila sequence to human Aniridia is Drosophila eyeless.

 

The human gene has several names:

Pax 6

Aniridia

Paired box gene 6

 

Aniridia is named after a human disease that affects development of the eye. The Drosophila eyeless mutant phenotype, as its name suggests, also affect eye development.  This similarity of function is striking since vertebrate and insect eyes are morphologically very different. This is discussed on pages 6-7 of the Hartwell et al. (2004) text.

 


Drosophila eyeless

Let us start by looking at the Drosophila homolog, eyeless (ey).

Using Flybase, go to the eyeless gene of Drosophila melanogaster.

 

How many different transcripts are there? 

 

What is a likely explanation for there being more than one Drosophila ey transcript?

 

Compare transcript ey-RA with ey-RB:

a. Describe the splicing events that give rise to these two transcripts.

b. Identify the sequences of the mRNA segment(s) that differ.

c. Compare the protein sequences encoded by ey-RA and ey-RB (ey-PA and ey-PB).

 

You will find it useful to use several sequence analysis programs to address these questions.

 

--BLAST two sequences allows you to align two sequences

--ORF finder shows all possible protein open reading frames of a DNA sequence

 

It is also often convenient to store sequences and screen images of results pasted into e.g. Microsoft Word.

 

As we make use of the information reported by the Drosophila Genome Project, it is worth considering issues of accuracy:

(i) What reporting criteria could you imagine are used by a genome project to decide on reporting multiple transcripts and proteins for a given gene?  Are these criteria based on known data or predictions?

(ii) How much confidence do you have in the accuracy of information reported on a genome project site?

We tend to assume that information reported in genome projects is accurate.  However, we should always keep in mind the basis for the reported information, and consider possible sources of error.  Information is often the best available at that time, but it is not always completely accurate, and it is not infrequently corrected over time.

 


Human Aniridia

Now let us move to the human gene.

Use ey-PA as input for a BLAST search of human proteins at NCBI BLAST – using protein-protein blastp, limit your search to human proteins.

 

The BLAST search immediately indicates that eyeless has two highly conserved motifs -- the paired domain and homeodomain.  Indeed, these regions of the protein products are where there is most sequence conservation when comparing the Human Aniridia and Drosophila Eyeless proteins.  Both motifs encode DNA-binding activities (consistent with the proteins being transcription factors).  We will come back to these motifs at the end of this session.

 

Notice that the highest scoring matches from your BLAST search are:

--paired box gene 6 (aniridia, keratitis)

--Paired box protein Pax-6 (Oculorhombin) (Aniridia, type II protein)

 

At NCBI PubMed, look up the protein for Aniridia.  Initially, use the "gene" search option.

 

Notice that the coding region of paired box gene 6 has two isoforms: a and b.

 

Compare these two protein sequences using BLAST two sequences.

[Try the comparison with and without the filter -- how does this affect the output?]

Provide a likely explanation for the difference between the two protein isoforms.

 

If you use the "protein" search at NCBI PubMed and enter "Aniridia", you will see multiple individual entries  -- Aniridia proteins in several organisms.  Indeed, we should note that NCBI PubMed has a wealth of information available from a large number of different sources.  For example, the "PubMed" search links to publications in the scientific literature; the "books" search links to a set of online books in molecular biology and related areas.

 


Conserved motifs are important for function

Go back to the initial BLAST results page ("Formatting BLAST" page). We will discuss now how to follow the links for each conserved motif and view the conserved protein structures using Cn3D.

Click on the red PAX box or blue Homeodomain in the BLAST results page; these links gives you listings of the conserved motifs -- the paired box gnl|CDD|16534 and the homeobox gnl|CDD|16525 .  The paired box structure is that of the Drosophila protein called Paired; the homeobox structure is that of yeast MAT-alpha2.  Both structures were determined by X-ray crystallography.

 

You can click on "Show structure" and view the structures in Cn3D [you may wish to save the structure file as a text file on your hard drive before importing it into Cn3D].  You can change the rendering style to "worms" -- this shows the helix-turn-helix structure of the homeodomain.  The identity of the ninth residue of the third helix (known as the "recognition" helix) determines in part the DNA-binding specificity of the homeodomain.  This is a Serine (11th last amino acid of the Homeodomain) in the 1AKHB (MAT-Alpha Homeodomain).  You can view this residue by highlighting it on the sequence.

 

How might you design experimental tests of the statement above that the ninth residue of the recognition helix plays a crucial role in the functional specificity of homeodomains?  Consider what kinds of experiments might be helpful.  For example, would it be helpful to make altered versions of the homeodomain and test them by making transgenic lines?  Can you imagine the steps to do this?

 

[The image below is taken from Alberts et al. (the book is available on line) and shows a homeodomain and POU domain, another protein domain often associated with homeodomains.]

 

We started this session describing Human Aniridia and Drosophila eyeless as homologs, and we noted that most of the sequence conservation between the Aniridia and Eyeless proteins is in the paired domain and homeodomain regions.  These proteins are transcription factors (they regulate the transcription of other genes) and the specificity of their DNA-binding activities depends upon these domains.  Hence, it is not surprising that these are the most highly conserved regions of the proteins. 

 

Sequence conservation between different proteins can provide important insights into shared functions.  Hence, analysis of sequence conservation is an important part of our analysis of the sequences made available by the genome projects.  Sometimes, testing the functions of conserved protein motifs is easier done in certain model systems compared to others (e.g. Drosophila compared to Human).  But often, the insights gained in one system apply to the equivalent motifs in other model systems.


Copyright 2005 Wesleyan University