EST Sequence Assembly

BIOL 265/COMP 113 Computer Laboratory

M. Weir / M. Rice / D. Krizanc

Assembly of long DNA or RNA sequences from overlapping shorter sequences is a computational challenge for biologists. For example, an important component of the genome projects is to assemble mRNA sequence information for all genes. One approach towards this goal is to systematically sequence large numbers of cDNAs from cDNA libraries (cDNAs are DNA copies of mRNAs).

High quality sequence runs are typically about 500 bp, whereas mRNAs are typically longer.

Therefore it is necessary to perform computational analysis of cDNA sequences to identify overlaps and thereby predict larger sequence fragments of mRNAs. This analysis can be complicated by issues including the possibility of alternative splicing and sequence duplications.

We have assembled several Drosophila sequence segments with overlapping sequence -- an example of the kind of data that might emerge from this type of analysis.









  1. Use the "BLAST 2 sequences" server to determine the regions of overlap between these four sequences (use FASTA format entering >sequencename in the line above the sequence).  Record (e.g. with screen shots) the BLAST alignments of the overlap regions. (Notice that the alignment results are distributed over several lines as the sequences are quite long.)
  2. Draw a line diagram that illustrates the regions of overlap between the four sequences (S1, S2, S3, S4) showing the overlap coordinates for both sequences of each overlap.  This is analogous to constructing a contig from the four sequences.
  3. Based on your alignments in questions 1 and 2, define a set of string slices of S1, S2, S3 and S4 which when concatenated together (in the correct order) create a contig representing the composite cDNA segment (remember Python string indices start at 0 whereas BLAST outputs start counting at 1). Be sure that the overlapping sequences are only included once. Using Python, run your concatenation to create the contig sequence.
  4. Notice that the overlap between two of the sequences contains some mismatches: what are three possible explanations for this? Think about how the sequence information is obtained as well as biological processes that might be involved. (Might the overlap with mismatches be misleading us?)
  5. To resolve this issue, and assess whether the composite cDNA represents a real mRNA, it is useful to compare the composite cDNA with Drosophila genomic sequence. Go to the Drosophila Flybase BLAST server. Use your composite cDNA as a query against the whole euchromatic genome sequence (i.e. choose the Genome Section "Genome Assembly (NT)"). Use the program Blastn nt->NT.
  6. Does your output allow you to distinguish between the possible explanations for the mismatches (step 4 above). Discuss the orientations of your cDNA fragments. (Assume that all the cDNA sequences correspond to the mRNA single strand sequences, not the antisense sequences.) Develop a model (an explanation) to explain all the results of your BLAST search.
  7. To test your model, perform a BLAST search with the composite cDNA (input) against the dataset of predicted Drosophila genes (using Database "Annotated Genes (NT)" on Flybase BLAST server). Does the BLAST search confirm your model?
  8. Use the BLAST result to link to matching predicted gene(s).
  9. View the genes in the "Map(GBrowse)" link (on the LHS of the gene report). This will facilitate assessing your model. You may find it useful to reduce the scale of the map (e.g. change to "show 100 kbp") in order to see neighboring genes. Notice that the Gene Region Map can show several maps based on your choices in the “select tracks” tab:
    • DNA sequence map (we are looking at 7M on chromosome 2R)
    • cytologic map showing chromosome band names ("cytologic band"; we are in the region of band 47F17)
    • mutation map ("point_mutation")
    • gene model map ("Gene span"; notice the genes en and inv)
    • predicted gene map (e.g. "Genescan prediction")
    • mRNA map ("mRNA")
    • protein coding sequence map ("CDS")
    • DNA maps referring to DNA clones used in the sequence assembly ("Tiling BAC")
    • sequenced cDNA clones ("cDNA and other aligned sequences")
    • microarray probes ("Affymetrix v1 or v2")

This analysis provides indications of the kinds of issues that arise during sequence assembly -- of genomic and mRNA sequences. It is wise to try to confirm interpretations using independent data -- in this case, comparing cDNA and genomic sequences.


Answer questions 3, 4, 6 and 7 above.

Additional challenges (not part of assignment):

  1. Using artificial sequence constructs (using S1, S2, S3, S4), determine a way to deduce ALL overlaps between these four sequences using a SINGLE call of the "BLAST 2 sequences" server. Provide your input and output.
  2. Using Python, consider how you would design a prefix-suffix overlap detection function that takes two strings as input, and if the two strings overlap, output's the inferred combined sequence. You may consider an exact match or imprecise match version (without gaps).

Copyright 2018 Wesleyan University