Signal Search Analysis Tutorial
1. Introduction:
Signal search analysis is an ancient method (published
in 1984) to analyse sequence motifs that occur
at characteristic distances upstream or downstream
from a functional site in a nucleic acid sequence.
Note that this problem is different from the
standard motif search problem addressed by many
other algorithms. There, one wants to find a motif
present in all or a statistically significant
proportion of input sequences. The location of
the motif within the sequences is irrelevant.
In signal search analysis, the input is a list
of experimentally defined functional sites, for
instance transcription initiation sites, given
as pointers to positions in nucleotide database
sequences (The sequences are stored somewhere else
on the computer). The user specifies on the fly
the sequence range around the site she/he wants
to consider.
In signal search analysis, not only the structure,
but also the location relative to the functional
site as well as the distance flexibility are of
interest.
Let's have a look at a few examples.
A)
A sizeable fraction of
eukaryotic promoters contain
a so-called TATA-box upstream of the initiation site.
Let's suppose that we already know the approximate structure
of this element and that we are primarily interested in its
location relative to the initiation sites.
To answer this question go to the
OProf
page (OProf stays for occurrence profile) and follow the
instructions below:
- On the left side under the header Sequence input make sure that
the checkbox "Transcription initiation sites" is activated.
- Select "EPD-107 non-redundant promoter set".
- On the right side under the header Signal description make sure that
the checkbox "Consensus sequence" is activated.
- Erase the contents of the large text window.
- Enter the sequence "TATAAA".
- Under "Cut-off value" activate the checkbox "mismatches" and fill in the value "1".
You are now ready to
submit the job.
As you can see, there are a number of parameters you can
play with in order to make a graphically appealing
signal occurrence profile.
- Change the window size and/or the number of mismatches
allowed.
- Look at a larger range, for instance -499 to 500.
The CCAAT-box is another signal reported to occur
frequently in the upstream region of eukaryotic
promoters. Unlike the TATA-box, it appears to
function in either orientation. To confirm these
claims:
- Produce a signal occurrence profile for the
pentamer sequence CCAAT. Optimize the window
size by trial and error.
- Analyse also the positional distribution
of the complementary pentamer ATTGG
- Use the Switch Search model/bidirectional to combine
the two profiles
- Analyse the positional distributions of the CCAAT-box
like signals in the non-redundant insect and plant
promoter subsets.
B)
Let's now look at
bacterial translation initation sites.
Prokaryotic messenger RNAs contain a so-called Shine-Dalgarno
interaction region upstream of the initiation codon, containing
a sequence motif that is complementary to the highly conserved
3'-terminus of the 16s ribosomal RNA
(see for instance
A14565).
To analyse this motif,
chose an oligonucleotide near the 3'end of E.coli ribosomal
RNA,
e.g.
CCUCCU, and produce a signal occurrence profile
for its complement (AGGAGG in proposed example) for
translation start site regions of several bacterial species.
Note that you can select several species at once (up to four)
in order to combine several signal occurrence profiles in
one graph. To start this analysis, we propose to analyse
and compare the Shine-Dalgarno interaction regions of the
extensively studied species E. coli and
B. subtilis.
Bacteria do not only use ATG as translation initiation codon,
but at lower frequences also GTG, CTG, and TTG. Determine the
frequencies at which these codons are used in various prokaryotic
species using the OProf service.
2. FPS-dependent sequence retrieval
As mentioned before, SSA programs typically do not
use sequences as input but lists of computer-readable
pointers to sequence positions in a database.
Such a list of pointers is called a functional position set,
or FPS. Each pointer contains a sequence id, a position, and
two flags, one indicating the strand (
+ or
-), the
other one the topology (1=linear, 0=circular). The Eukaryotic
Promoter Database is an example of a functional position set. To further
illustrate this concept, let's now go for a short moment
to the EPD pages:
EPD
Display an individual promoter entry in text format, for instance
HS_MYC_1 doc
The computer-readable pointer to the sequence position is contained
in the line starting with the line code
FP.
Try to identify
the four crucial elements: sequence id, topology, orientation, and
position.
The FPS files used by the Signal Search Analysis server can also be viewed,
for instance:
epd_nr.fps
Bacillus_subtilis.fps
Now, from the EPD home page, on the left menu under "Access EPDnew", follow the link:
Select/Download. This
page allows you to extract promoter sequence segments around transcription initiation sites
You can use this tool to select all promoters (leaving all 'Optional criterias' blank) or
them based on all or some of their genomic contexes (such as presence of core promoter elements) or expression levels.
After selection, you can download them in various format or use them to perform further analysis such as motif enrichment/search with the signal search signal search server, and chromatin status.
You can try to reproduce one of the signal occurrence profiles you made before by uploading
the promoter sequence file containing the non-redundant subset of all
promoter sequences. Note that you have to indicate the relative internal position
of the functional site on the OPROF form (500 in this case). You can also specify a name for
the sequence set (e.g. epd_nr) and a description of the site type
(e.g. "Transcription start site") on the form. The contents of these fields
will appear in the graphical output produced by the signal search analysis server.
3. Constraint profiles.
In the examples studied so far, we already had some idea of how the
signal we were interested looks like. But how to proceed
if we know absolutely nothing in the beginning. The program
CPR (for constraint profile) can be
of some help in such situations. A constraint profile is a plot
of sequence non-randomness as a function of the location relative
to a functional site. For instance, eukaryotic promoter sequences
show high non-randomness about 30 bp upstream of the transcription
start site because of the frequent presence of a TATA-box motif
in this region.
Input to a constraint analysis is a functional position set (FPS) and a
so-called "signal sequence collection". The latter may consist of
a complete set of oligonucleotides of particular length. Like in
OProf, the sequences extracted with the FPS are scanned with a sliding
window. The frequencies of the elements of the signal sequence collection
are determined for each window. This gives rise to a two-dimensional
array of numbers called "signal search data". In windows with high
sequence constraints, a few oligonucleotides may occur at very high
frequencies while most others occur at frequencies slightly below
expectation. This would lead to a relatively high variance of
"signal frequencies" (original jargon). The constraint index
displayed in a constraint profile is in fact based on the
variance of the signal frequencies.
Let's look at an example:
- On the left side under the header "Sequence input" activate
the checkbox "Upload the sequence (in Fasta Format)".
- Select the file containing the previously downloaded non-redundant
promoter subset (all promoters), indicate that the functional site is
at the internal position 500 and supply appropriate names for the sequence
set and the functional site type.
- Instead of using the above mentioned "previously downloaded non-redundant promoter subset" the sequence files can be found at:
These files can also be used for the next paragraph.
- On the right side under the header "Signal collection" make sure that
the checkbox "complete" is activated, leave other parameters unchanged.
Instead of a complete signal search collection, one can also use a random
subset of oligonucleotides of a particular length, for instance 200 hexamers.
This allows one to use longer signals without exponentially increasing the
computing time. Special collections allow usage of so-called "gapped oligonucleotides".
A gapped oligonucleotide is a motif consisting of real bases and unspecific positions
represented by the wild-card character N. For instance ANA is a gapped dinucleotide.
A certain type of gapped oligonucleotides is specified by a string consisting of
the letters X and N, where X stays for a real base and is automatically expanded
to all four bases of the DNA alphabet. For instance XNX is expanded to:
XNX -> ANA,ANC,ANG,ANT,CNA,CNC,CNG,CNT,GNA,GNC,GNG,GNT,TNA,TNC,TNG,TNT,
Different types can be combined in one collection but they all have to
be of the same length.
Further suggestion:
- Generate constraint profiles for the non-redundant eukaryotic promoter set with different
parameters.
- Analyse the vertebrate, arthropode, and plant subsets as well.
- Apply constraint analysis to bacterial translation start site regions.
- To specifically illustrate the use of gapped dinucleotides, generate constraint
profiles for B. subtilis translation start sites with gapped oligonucleotides of length
3 and 5 (XNX, XNNNNX).
4. Using signal lists to analyse the contents of a constraint regions
The program
SList (for Signal List) is used
to analyse the contents of a constraint region. The input and data processing steps are
largely the same as for the constraint analysis. Both programs generate so-called
signal search data (lists of oligonucleotide frequencies determined in a sliding
window). What is different is the output. SList produces a list of locally over- or
under-represented "signals" (oligonucleotide motifs). Over- and under-representation can
be assessed in two different ways. "Calculation mode" 1 uses the mean of all signal
frequencies in the corresponding window as the reference, mode 2 uses the mean of the frequencies
of the corresponding signal in all windows as the reference. The selection mode refers
to local and global maxima along a particular signal occurrence profile.
Use Slist to further investigate the signals corresponding to the constraint
regions found in eukaryotic promoters and bacterial translation start
regions.
5. Optimizing a weight matrix for a locally over-represented sequence motif
Consensus sequences are not always appropriate descriptors of
regulatory sequence motifs. In particular, they cannot make a
difference between easily tolerated and severe mismatches. Note that
a weight matrix can be viewed as a generalization of a consensus
sequence. For instance, the motif TATAAA (1 mismatch) can be represented
by the following weight matrix:
0 0 0 1
1 0 0 0
0 0 0 1
1 0 0 0
1 0 0 0
1 0 0 0
Cut-off value: 5
Convince yourself of this equivalence by generating the same
signal occurrence profile with this motif for eukaryotic promoters,
once with a consensus sequence and once with a weight matrix.
It is not a trivial task to find an optimal weight matrix
description for a motif like the TATA-box.
The program PATOP
PatOp (for pattern optimization)
implements an iterative procedure which successively optimizes
the weight matrix, the cut-off value, and the borders of the
preferred region of occurrence, keeping two of these three
components constant at a time. PatOp has the capability of extending
the matrix to the left and right side if additional consensus is
observed, or to drop positions in the opposite case.
Use this program to produce a weight matrix description of the TATA-box
motif for the
non-redundant insect and plant promoter sets (they are relatively
small and thus do not take too much time). Use default parameters
for this purpose (a detailed understanding of the parameters
of the PatOp algorithm is beyond the scope of this tutorial). Start from
the consensus sequence motif TATAAA (one mismatch).
PatOp uses a heuristic algorithm converging to a local
optimum. To test convergency, start the iterative refinement
process from another initial motif, for instance TATAAT.
Try also to derive a weight matrix for the Shine-Dalgarno
interaction region of a completely sequenced
bacterium.
Are the weights of the matrix found to be compatible with
the assumption that G:U pairs can also be formed between the
mRNA leader and the 3'end of the 16s RNA?
6. Analyse a collection of yeast splice acceptor sites
It has been said that the introns from budding yeast
contain a special signal near the branchpoint. During
the splicing reaction, the 3' end of the intron is
covalently linked to the 2'OH group of an internal base
(called branchpoint) leading to a so-called lariat
structure. The branchpoint is difficult to determine
experimentally but it is known to be located
within a limited distance range from the
3' end of the intron, also called the splice acceptor
site.
A sequence set of yeast splice acceptor sites can be
found here.
/ssa/data/src/yeast_ag.seq
The sequences extend from -200 to +100
relative to the 3' end of the intron.
Use your skills learned during the previous exercises to characterize
the branchpoint consensus sequence of budding
yeast.