Signals are short DNA sequences that tell the genomic apparatus where to initiate the process of genomic decoding (transcription, translation). Several signal sequences have been described in terms of their content consensus and positional distribution Examples are TATA box, CAAT box in prokaryotes and the cap signal in eukaryotes. The Signal Search Analysis (SSA) server can be used to study the characteristics of these control or signal sequences.
The server hosts two main programs for this type of analysis:
These programs can be used for the analysis of DNA sequences that are postionally correlated, there by look for known or unknown genetic signals.
A typical signal search analysis pipeline involves the following steps:
The Input sequences are a list of positionally correlated DNA sequence library
which were constructed as described below.
EPD non-redundant dataset
The Eukaryotic Promoter Database is an annotated non-redundant collection
of eukaryotic POL II promoters, for which the transcription start site has been
determined experimentally. In EPD, a promoter is defined by a position reference
to a Genbank/EMBL sequence that defines the position of the TSS in the
corresponding entry. Each promoter is categorized as "single site"(class S),
"multiple sites"(class M) or "region"(class R) according to the degree of TSS
scattering observed in promoter mapping experiment. For promoters of the M and
R classes, the postion reference points to the center of the sequence interval
in which transcription intiation events are observed. A new class "unknown" (U)
was introduced for promoters from PRESTA database, for which there is no
information necessary for proper classfication. Very recently, enteries based
on DATA from DBTSS and from the MGC program have also been incorporated.
Human promoters from paper
This set consists of manually curated promoter set from literature, for which
the transcription start site has been determined experimentally.
Human promoters from DBTSS
This set is based on data from DBTSS(Suzuki et al.,2002), which provides cDNA
5'end profiles derived from cDNA librairies obtained with the oligo-capping
method. DBTSS is a gene centered resource where one entry corresponds to one
gene where as promoter-centered database like the EPD contain multiple enteries
for the same gene if there are alternative promoters. The DBTSS database was
converted to EPD-like database using the 5'end genomic co-ordinate of
each clone of DBTSS
Human promoters from PRESTA
PRESTA is the name of a program that extracts promoter sequences from
the nucleotide sequence databases. It scans the annotation parts of the
GenBank/EMBL sequence enteries for feature keys indicative of a TSS. For each
putative promoter collected in the first pass, the program attempts to find EST
5' end sequences matching the corresponding genomic regions. EST 5' ends
mapping closely to the annotated TSS are considered positive evidence.
We first downloaded a FASTA-formatted library containing all human promoters
from the PRESTA web site. The TSS positions were taken from the header lines of
this file. Stringent Blast searched were performed to match the retrieved
sequences to a genomic sequence entries from GenBank/EMBL and to an RNA
sequence from Refseq. The HUGO approved gene symbol was extracted from the
Refseq entry. All promoters were attributed to the newly introduced promoter
Human promoters from MGC
The goal of the MGC program is to generate a reference collection of
full length human and mouse cDNA clones. For this purpose, the 5' and 3' ends
of a large number of cDNA clones from full-length enriched libraries were
sequence and deposited as ESTs in GenBank/EMBL. The corresponding chromatograms
are available and we used these chromatograms to extract the sequences.
The begining of the cDNA insert was mapped to Genomic contigs via RefSeq
identifiers and the genomic sequences were extracted and complied.
E.coli translation intitation sites
This is a library that was extracted from the E.coli
B.subtilis translation intiation sites
This is a library that was extracted from the B.subtilis genome sequences.
3' and 5' borders
The left and the right end borders are intergers which define the
horizontal extension for the length of the DNA to be extracted from a data
Window size and shift
The fixed length (as determined by the 5' and 3' borders) sequences extracted from the data library are organised into as a matrix of oligonucleotides. This matrix is divided into cross-sections or windows. Two parameters define this sub-division process:
This parameter presents the strand of DNA to be searched for. If a
signal is specified bidirectional, in which case the complementary strands of
the Window segments are also considered.
Eukarotic promoters are known to occur only in the forward direction or
in both orientations. The promoter occuring in both the orientiations can direct
the expression of two genes, one on each end of the promoter. For example, CCAAT and
GC-boxes are bidirectional elements whereas the TATA-box and Initiator are
The signals can be described in two different formats
A signal library has also been provided. The following promoters are available in the signal library:
Reference Position The reference position indicates the nucleotide position within the consensus signal or the PWM that describes the motif.
The Weight matrix can be used to compute a score for any subsequence of the corresponding length in a promoter region. The cut-off value determines which subsequence represents a match to the motif.
The OUTPUT consists of two parts:
It can be seen that the EPD is enriched in TATA-box containing promoters.
Sequence Input The Input sequences are a list of positionally correlated DNA sequence library which were constructed as described below. The datasets available are the same as that for the OProf.
Sequence Parameters and Search parameters
These parameters are also the same as described above for the OProf.
Five different Signal selection Criteria has to be provided by the user
A Signal sequence collection is required to look at the over or under
represented signals in the data set
The program offers three different possibility of choosing a signal sequence collection
The OUTPUT of the program consists of two parts: