Information
Theoretic Analysis of Sequences:

Drosophila
Splice Site Database

**BIOL 265/COMP 113
Computer Laboratory**

**M. Weir / D. Krizanc / M. Rice**

To analyze how
biological machines such as the spliceosome successfully recognize target DNA or RNA sequences, it can be
very useful to investigate large numbers of aligned target sites. Information theoretical
analysis of the target sequences can provide important insights.

The
Weir research group has constructed a relational database containing nucleotide
sequences in the vicinity of 11,161 donor and acceptor splice sites in 3,375
Drosophila cDNAs (Weir and Rice, 2004). All the donors, and separately, the acceptors,
are aligned, and the nucleotides at positions -32 to 32 (where the splice site
position is 0) are stored in the database.

Several
stored procedures are available for analyzing the data set by using the web
interface found at http://igs.wesleyan.edu/
(you may go to this site and click on "Databases & Tools" in the
left hand menu; choose the option "Use WesQL to
run stored procedures on public IGS Splice Site and RNA Databases"; then
click the stored procedures tab to see the list of stored procedures). (Outputs
from previous runs are also provided below.)

1. Select the stored procedure Compute
Splice Site Information and click the Continue button. You will see a number of pre-set
parameters including the cDNA table (Wesleyan Known cDNA), Intron Table (Wesleyan Known Introns), and Splice
Site Table (Wesleyan Known Splice Sites), and Minimum Splice Element Length
(20).

Change
the default parameters by setting the Type of Site to Donor Sites and the Start
and Finish positions to -4 and 8.

Click
the Execute Stored Procedure button.
Executing the procedure generates several HTML tables that contain the
following entries:

-SpliceElements
- number of introns

-Sitetype -
donor (D) or acceptor (A)

-Nuclposition
- nucleotide position with respect to aligned donor or acceptor sites (with
splice sites at position 0)

-Information - at each nucleotide
position *j* that is calculated using the formula

*Info*(s_{j}) = 2 - [-f_{A} * log_{2}(f_{A})
- f_{C}
* log_{2}(f_{C}) - f_{G} *
log_{2}(f_{G}) - f_{T}_{
} * log_{2}(f_{T})] - g

where the quantity
in brackets is the uncertainty (entropy) at position *j* based on the
frequency of occurrence f_{A}, ..., f_{T} of the nucleotides A, ..., T, and the
correction factor g depends on the
number of splice sites that are being aligned.

-nA, nC, nG, nT - the numbers of
nucleotides at position *j*

-pA,
pC, pG, pT
- the probability of each nucleotide at position *j*

[Note: The T in the cDNA
corresponds to the U in the RNA.]

(a)
How many cDNAs were used? (3,090)

(b)
How many introns were analyzed ? (10,057)

(c)
What are the consensus nucleotide values for positions 1 and 2 at the donor
sites (D+1, D+2)? (GT)

(d)
What are the percentages of occurrence of the predominant nucleotides at these
positions? (99.8%, 99.2%)

Store
(copy and paste) the HTML tables in MicroSoft
Excel worksheets - we will use this data in part 3.

2. Repeat part 1 without any restriction on
the lengths of the introns and exons (i.e. without restricting the minimum
splice length to 20).

(a)
What are the percentages of occurrence of the predominant nucleotides at
positions D+1, D+2 ?

(b)
What are some possible reasons why the GT consensus is not as well represented
as in part 1 ? [consider the
algorithm used to compute the splice sites - see Weir and Rice (2004)]

(c)
For the set of cDNAs that contain either an intron or
an exon with length less than 20, calculate the frequency at each of the two
positions of the canonical GT [Hint: compare nucleotide counts from 1(d) and
2(a) using Excel to subtract corresponding counts].

3. Using the higher quality data set from
part 1, identify a consensus nucleotide sequence at the nucleotide positions
with greater than 0.5 bits of information.

The
sequence [3' UCCAUUCA 5'] in the Drosophila U1 snRNA
is thought to bind near the donor site to the consensus sequence.

Draw
a diagram of the predicted RNA/RNA base pairing.

4. To study the effects of intron length on
information values, first set a range of small intron lengths (e.g. 64-80 using
the Minimum and Maximum Intron Length parameters; also reset minimum splice
element length to 20) and compute the information at both donor and acceptor
sites for nucleotide positions -10 to 10. Store the output tables in Excel.

Next,
compute the information for positions -10 to 10 using a Minimum Intron Length
of 8192. (Note - you will also need to set the Maximum Intron Length to 0.)

(a)
Compare the total information (summed over nucleotide positions -10 to 10) at the donor and acceptor sites for the sets of
longer and shorter introns. You will find it useful to also graph the information values at EACH nucleotide position in the site.
You can graph these values for the longer and shorter introns side-by-side in the same graph.

(b)
Compare the amounts of information at each nucleotide position for the sets of
longer and shorter introns.

(c)
For the positions with large information differences, compare the nucleotide
content. How does your result
relate to the U1 snRNA binding sequence discussed in
part 3?

(d)
Do you notice any other differences between the longer and shorter intron data
sets? For example, compare the A
content in each data set. What
conclusions can you draw?

See
Weir and Rice (2004) for a more complete analysis of the two datasets.

5.
In addition to using information to measure the degree of conservation in aligned
sequences we can also measure the *individual information *of each
sequence to see how well it conforms to the conservation.

Suppose
*t*_{1}, *t*_{2}, . . . , *t*_{n} is a set of DNA or RNA sequences each
of length *m*. For each base *a* and each position *j* between 1
and *m*, define the *weight *

w(a,j) = 2 + log_{2}(f_{aj})
= 2 – (-log_{2}(f_{aj}))

where *f*_{a}_{j} denotes the frequency of base *a* at position *j*.

The
*individual information *score for a sequence *t*_{i}
is

infoscore(t_{i})
= sum {w(t_{ij}, j) | 1 __<__ j __<__
m }

In
other words, the score of a sequence *t*_{i}
is the sum of the bits contributed by the symbols found at each of the
positions in *t*_{i}.

(a)
You can compute individual information scores using the stored procedure
Compute Splice Site Information Matrix in the database *dbDrosophilaSplice2*. [You need to switch to the new database in the
starting WesQL page.]

Compute
the individual information scores for the set of Drosophila introns of length
16000-32000 using the default parameters and the following settings:

-Start position = -4

-Finish position = 8

-Site type = Donor

-Number of
reference sequences = 250

-Reference set: Minimum intron length =
16000

-Reference set: Maximum intron length =
32000

-Test set: Minimum intron length =
16000

-Test set: Maximum intron length =
32000

Store
your output results in Excel. You
can use the histogram output to look at the distribution of individual
information scores.

(b)
Compute the distribution of scores for introns of length 64 by using the same
Reference set as in part (a) and setting Test set Minimum and Maximum intron
length = 64. How does this
distribution compare with the one in part (a)? Why is there some spread in both of
these distributions?

(c)
The calculation of the individual information scores (*infoscore*(t_{i})) uses a matrix of the weights defined
above. In order to see this weight
matrix, follow the link

IGS
Database Tools > Submit sequence alignments for Information Theoretic
Analysis

Paste into the sequence window
the 160 donor sequences from your output of reference sequences in part (a)
(which you stored in Excel). Adjust
the format setting to flat and click the Submit Alignment button.

You
can display the weight matrix for the reference sequences by selecting the
following options in the View Information Profile & Weight Matrix area:

- Show residue counts
- Show information weights
- Show residue frequencies

and clicking on the View Information Profile button.

(d) Working in Excel, use the weight
matrix to calculate the individual information of the sequence AAAGGTAAGTAT (you should
sum the "infoWeight" values corresponding to which nucleotide is at each position). In
this sequence, the nucleotide choices at each position are those with the
highest frequency. Therefore it has
the highest possible individual information score. [Compare this value with the scores in
the distribution in part (a); notice that this "perfect" sequence is rarely if ever seen.]

A.
Answer the questions in parts 3, 4(a), and 5(d).

B. State at least two general
conclusions that you can draw from our analysis of splice sites. What possible molecular-based hypotheses
are suggested by the analysis ?

Stephens,
R.M. and Schneider, T.D. 1992. Features of spliceosome
evolution and function inferred from an analysis of the information at human
splice sites. J Mol Biol
228: 1124-36.

Weir, M.P. and Rice, M.D. 2004. Ordered
Partitioning Reveals Extended Splice Site Consensus Information. Genome
Research 14:67-78.

Weir, M., Eaton, M. and Rice, M. 2006. Challenging
the spliceosome machine. Genome
Biology 7:R3.

Copyright
Wesleyan University 2019