EUKARYOTIC PROMOTER DATABASE USER MANUAL
Written by: Philipp Bucher, Rouaida Cavin Périer, Viviane Praz and Christoph Schmid
EPFL School of Life Sciences - SV
and Swiss Institute for Experimental Cancer Research - ISREC
Computational Cancer Genomics Group
EPFL SV ISREC GR-BUCHER Station 15
CH-1015 /Lausanne
Switzerland
Electronic mail:
This manual and the database it accompanies may be copied
and
redistributed freely, without advance permission, provided that this
statement
is reproduced with each copy.
Published Research assisted by the Eukaryotic Promoter Database
should
cite:
EPD in its twentieth year: towards complete promoter coverage of
selected model organisms
Schmid, C.D., Perier, R., Praz, V. and Bucher, P. (2006) Nucleic Acids
Res, 34, D82-85.
The Eukaryotic Promoter Database EPD was designed and developed at the
Weizmann Institute of Science in Rehovot (Israel) and is currently
maintained
at ISREC in Epalinges s/Lausanne (Switzerland). EPD is a specialized
annotation
database of the EMBL Data Library. It provides information about
eukaryotic
promoters available in the EMBL Data Library and is intended to assist
experimental researchers, as well as computer analysts, in the
investigation
of eukaryotic transcription signals. The present version originated
from
a previous compilation published in an article (1)
and is organized as a hierarchically ordered and documented "functional
position set" (2)
pointing to transcription initiation sites. All information is either
directly
extracted from scientific literature or, starting from release 73,
compiled
by a new in silico primer
extension method (16). Thus promoter information
in EPD is independent of the EMBL sequence entry descriptions. As a
consequence,
many of the initiation sites referred to in EPD do not appear in
corresponding
EMBL feature tables.A coordinated updating procedure has been set up by
the two laboratories that will ensure future compatibility between the
position references in EPD and the sequence data in the main data
library.
Investigators who access EMBL via publicly available programs should be
aware of the fact that software producers occasionally modify the
sequence
data in ways that render position references inaccurate. EPD is
generally
not compatible with sequence data of another release because EMBL
sequence
entries are not designed as stable data units. The completeness and
accuracy
of EPD greatly benefits from user-feedback. Any report of mistakes or
omissions
would be very much appreciated. Direct communication of newly published
transcript mapping or gene expression data is also welcome. Please
forward
all correspondence to the address given on top of this document. Use
electronic
mail if possible.
2 PROMOTER SELECTION
EPD is a rigorously selected database. In order to be included in EPD,
a promoter must be:
recognized by eukaryotic RNA POL II,
active in a higher eukaryote,
experimentally defined, or homologous and sufficiently similar to
an
experimentally
defined promoter,
biologically functional,
available in the current EMBL release,
distinct from other promoters in the database.
Explanations:
Transcription by RNA POL II is bona fide assumed for protein
coding
genes
but must be supported by alpha-amanitin data if the end product is an
RNA.
All eukaryotes except phycophyta, fungi, myxomycetes, and
protozoa are
considered higher eukaryotes. Note that the expression "active in" does
not always refer to the source organism of the promoter (e.g. in
viruses).
EPD contains currently promoter sequences from 139 different species.
A promoter is experimentally determined if a corresponding
transcription
initiation site is mapped with a precision of +/- 5 bp or higher. Any
technique
that characterizes the 5'terminus of an in vivo or in vitro generated
RNA
is acceptable. Single nuclease-protection or primer-extension data must
be accompanied by additional evidence unless the gene's intron-exon
organization
is well established. Similarity is considered "sufficient" if percent
identity
(as defined in Section 6) is >=60% between -79 and +20 or >=75%
between
-49 and +10.
A promoter is biologically functional if it contributes to the
source
organism's
survival and/or reproduction. This is bona fide assumed except for
promoters
of pseudogenes, minor transcription initiation sites (<20% of total
gene transcripts), promoters giving rise to an unstable RNA product,
and
mutant promoter.
The minimum sequence requirement is 45 bp between -49 and +10.
Promoters are considered distinct if they originate from
different gene
loci or different species. Identity is assumed if two promoters from
the
same species exhibit >95% similarity between -79 and +20 while their
genetic
relationship is unknown. Multiple isolates of viruses or transposable
elements
are considered distinct if at least one promoter region fails to
fulfill
the above similarity criterion.
3 ASSIGNMENT OF TRANSCRIPTION INITIATION SITE
A eukaryotic promoter is defined as a DNA sequence around a
transcription
initiation site. The position reference to the initiation site is
therefore
the central part of a promoter entry. Its assignment is based directly
on experimental data shown in an article, proposed adjustments
originating
from consensus sequence considerations being ignored.
In
the case of minor discrepancies between different publications averaged
positions
are given. Position references are subject to permanent re-evaluation.
A transcription initiation site may be reassigned upon publication of
new
data. Position references are replaced if longer upstream sequences of
the same promoter become available in a new EMBL sequence entry.
Several initiation sites preceding the same gene appear as alternative
promoters if they are clearly separated from each other or
differentially
regulated. The minimum distance required between two alternative
initiation
sites is 20 bp. Otherwise, they are considered a single promoter
region.
Four types of promoters are distinguished by one-letter codes in order
to account for the variety of transcription initiation patterns in
eukaryotes:
S: Single initiation site: >90% of all reported transcripts
initiate
within
10 bp (the experimental data usually do not allow distinction between a
single cap-site and small mRNA 5' heterogeneity).
M: Multiple initiation sites: >75% of all reported transcripts
initiate
within 20 bp.
R: Initiation region: >75% of all
reported
transcripts
initiate within 100 bp.
U: Undefined transcription initiation
pattern, exclusively
in 'preliminary' entries in epd_bulk.dat (see next section).
Note that in addition to true alternative
promoter
activity, variability in the position of the transcription initiation
site
might also be due to experimental constraints, a biological variability
in the activity of the DNA polymerase II, or the presence of highly
similar
(pseudo-) genes with distinct transcription initiation sites.
In sequence entries that contain a complete
RNA or DNA genome of a retrovirus or a retrovirus-like transposable
elements,
the position reference points to the U3/R boundary of the 3'terminal
LTR.
4 FORMAT CONVENTIONS
EPD is distributed as two ASCII flatfiles (epd.dat, epd_bulk.dat) in
essentially
identical format. Differences in the format of 'preliminary' entries in
'epd_bulk.dat' are described in paragraph
4.4. EPD files contain a title line followed by a number of
promoter
entries. Interspersed are group headings whose function and format are
described in the next section. The title line and parts of the promoter
entries are rigidly formatted so that the entire database conforms to
the
standards of an FPS file (functional position set) of our current
signal
search analysis (1,2)
software.
4.1. The title line
The title line of EPD is shown below:
TI EPD83 Eukaryotic Promoter Database / Release 83 EP
The TI line contains the following fields:
columns
data type
1- 2
"TI"
3- 5
(blank)
6-15
FPS name
16-70
title
71-72
FPS code
Explanations:
FPS name and FPS code are used by our data extraction software to
generate
default names for output files.
4.2. Promoter entries
An EPD entry contains the following types of information:
Promoter identification and description.
Machine-readable pointers to the transcription initiation site in
corresponding
sequence entries.
Description of the experimental evidence defining the
transcription
start
site.
Various kinds of promoter classifications useful for extraction
of
biologically
meaningful promoter subsets.
Information on regulatory properties.
Cross-references to other databases.
Bibliographic references.
Promoter entries are presented in a similar format as EMBL and
SWISS-PROT
sequence entries. Each line starts with a line code identifying the
type
of information presented. The current line types and line codes and the
order in which they appear in an entry, are shown below:
ID - IDentification. AC - ACcession number(s). DT - DaTe. DE - DEscription. OS - Organism Species. HG - Homology Group. AP - Alternative Promoter. NP - Neighbouring Promoter. DR - Database cross-References. RN - Reference Number. RX - Reference cross-references. RA - Reference Authors. RT - Reference Title. RL - Reference Location. ME - MEthods. SE - SEquence. FL - Full Length. IF - Initiation Frequency. TX - TaXonomy. KW - KeyWords. FP - Functional Position. DO - DOcumentation. RF - literature ReFerence. // - Termination line.
Spacer lines (XX) are inserted in order to make the promoter database
easier
to read by eye. Some line types occur many times in a single entry.
Each
entry must begin with an identification line (ID) and end with a
terminator
line (//). Text does not exceed column 72. Below is an example of a
promoter
entry:
ID HS_MYC_2 standard; single; VRT. XX AC EP11148; XX DT ??-APR-1987 (Rel. 11, created) DT 07-MAR-2005 (Rel. 82, Last annotation update). XX DE c-myc (cellular homologue of myelocytomatosis virus 29 oncogene), DE promoter 2. OS Homo sapiens (human) XX HG Homology group 53; Mammalian c-myc proto-oncogene, promoter 2 AP Alternative promoter #2 of 2; exon 1; site 2; major promoter. NP none. XX DR GENOME; NT_008046.15; NT_008046; [-41966656, 15188617]. DR EPD; EP11146; HS_MYC_1; alternative promoter; [-162; +]. DR CLEANEX; HS_MYC. DR EMBL; AC103819.3; [-87815, 60206]. DR EMBL; X00364.2; [-2489, 8507]. DR EMBL; D10493.1; [-2487, 5569]. DR EMBL; K01910.1; [-2451, 49]. DR EMBL; M16261.1; [-1843, 1048]. DR EMBL; J03253.1; [-1759, 461]. DR EMBL; L00057.1; [-810, 2795]. DR EMBL; K03015.1; [-555, 458]. DR EMBL; X00196.1; [-532, 2792]. DR EMBL; M12026.1; [-511, 678]. DR EMBL; K01708.1; [-410, 500]. DR EMBL; K00559.1; [-345, 1020]. DR EMBL; K02280.1; [-302, 178]. DR EMBL; K01909.1; [-266, 1365]. DR EMBL; S65124.1; [-266, 1023]. DR EMBL; M14206.1; [-266, 446]. DR EMBL; M20013.1; [-240, 982]. DR EMBL; AF111270.1; [-142, 264]. DR EMBL; K02275.1; [-96, 780]. DR EMBL; X00675.1; [-96, 404]. DR EMBL; K02277.1; [-96, 157]. DR SWISS-PROT; P01106; MYC_HUMAN. DR TRANSFAC; R01157; HS$CMYC_01; [-211, -189]; by position. DR TRANSFAC; R01158; HS$CMYC_02; [-168, -145]; by position. DR TRANSFAC; R01804; HS$CMYC_04; [-300, -283]; by position. DR TRANSFAC; R01851; HS$CMYC_05; [-65, -57]; by position. DR TRANSFAC; R01852; HS$CMYC_06; [-42, -34]; by position. DR TRANSFAC; R04076; HS$CMYC_12; [-251, -228]; by position. DR TRANSFAC; R04076; HS$CMYC_12; [-252, -229]; by position. DR TRANSFAC; R04076; HS$CMYC_12; [-253, -230]; by position. DR TRANSFAC; R04621; HS$CMYC_17; [-313, -262]; by position. DR TRANSFAC; R08503; HS$CMYC_18; [-50, -41]; by position. DR TRANSFAC; R16688; HS$CMYC_24; [-7, 41]; by position. DR TRANSFAC; R16689; HS$CMYC_25; [-7, 41]; by position. DR TRANSFAC; R17051; HS$CMYC_30; [-510, -480]; by position. DR TRANSFAC; R18503; HS$CMYC_31; [-185, -170]; by position. DR TRANSFAC; R18504; HS$CMYC_32; [-153, -168]; by position. DR RefSeq; NM_002467. DR MIM; 190080. XX RN [1] RX MEDLINE; 84026482. RA Battey J., Moulding C., Taub R., Murphy W., Stewart T., Potter H., RA Lenoir G., Leder P.; RT "The human c-myc oncogene: structural consequences of RT translocation into the IgH locus in Burkitt lymphoma"; RL Cell 34:779-787(1983). RN [2] RX MEDLINE; 84131953. RA Bernard O.D., Cory S., Gerondakis S., Webb E., Adams J.M.; RT "Sequence of the murine and human cellular myc oncogenes and two RT modes of myc transcription resulting from chromosome translocation RT in B lymphoid tumours"; RL EMBO J. 2:2375-2383(1983). RN [3] RX MEDLINE; 87257828. RA Lipp M., Schilling R., Wiest S., Laux G., Bornkamm G.W.; RT "Target sequences for cis-acting regulation within the dual RT promoter of the human c-myc gene."; RL Mol. Cell. Biol. 7:1393-1400(1987). RN [4] RX MEDLINE; 88038843. RA Broome H.E., Reed J.C., Godillot E.P., Hoover R.G.; RT "Differential promoter utilization by the c-myc gene in mitogen- RT and interleukin-2-stimulated human lymphocytes."; RL Mol. Cell. Biol. 7:2988-2993(1987). XX ME Nuclease protection [1,4]. ME Nuclease protection; transfected or transformed cells [3]. ME Length measurement of an RNA product; low-precision data [1]. XX SE agggagggatcgcgctgagtataaaagccggttttcggggctttatctaACTCGCTGTAG XX TX 6. Vertebrate promoters TX 6.1. Chromosomal genes TX 6.1.5. Hormones, growth factors, regulatory proteins TX 6.1.5.16. Various cellular protooncogenes XX KW Proto-oncogene, Nuclear protein, DNA-binding, Glycoprotein, KW Transcription regulation. XX FP Hs c-myc P2+:+S EU:NC_000008.9 1+ 128817660; 11148.053 010*2 XX DO Experimental evidence: 4,4#,2l DO Expression/Regulation: +mitogen RF Cell34:779 EMBOJ2:2375 MCB7:1393 MCB7:2988 //
A detailed description of each line type is given below.
4.2.1. The ID line
The identification line is always the first line of an entry. The
general
form of the ID line is:
ID ENTRY_NAME data class; initiation site type; TAXONOMIC DIVISION.
ENTRY_NAME is a unique entry identifier "HS_MYC_2"
which
obeys rigorous naming conventions. It contains 2 or 3 fields, the first
is the species identification code at most 4 alphanumeric characters
representing
the biological source of the promoter. The second field uses for gene
identification
the protein code of SWISS-PROT ID (if available). For human EPD
entries,
instead of the SwissProt ID the official gene symbol approved by the HUGO
nomenclature committee (if available) is used. The third field is
optional,
it is either a number which represents alternative promoters or a
letter
for promoters of duplicated genes. The `_' sign serves as a separator.
The data class field relates to the quality of the
information:
"standard" means that the information is complete and correct according
the standards laid down in this document; "preliminary" means that the
entry has not yet undergone all quality checks necessary for being
classified
as "standard".
The initiation site type is either "single", "multiple",
"region"
as defined in Section
3.
TAXONOMIC DIVISION are
PLN for plant
NEM for nematode
ART for arthropode
MLS for mollusc
ECH for echinoderm
VRT for vertebrates.
Note that these codes relate to the organism in which the promoter is
expressed,
not to the source organism in which the promoter is replicated as
defined
on the OS line.
The ID line is terminated by a period.
4.2.2. The AC line
AC EP11148;
The accession number consists of the character string "EP"
followed
by 5 digits representing the EMBL release number followed by the EPD
entry
order. Most EPD entries currently have only one accession number. If
necessary,
more then one AC will be used, separated by semicolons and the list is
terminated by a semicolon.
4.2.3. The DT line
The date lines show the date of entry or last modification of the
entry.
DT DD-MMM-YEAR (Rel. XX, Comment)
where `DD' is the day, `MMM' the month, `YEAR' the year, and `XX' the
EPD
release number. The comment portion of the line indicates the action
taken
on that date.
The first DT line indicates when the entry first appeared in the
database.
The second DT line indicates when the promoter data was last
modified.
It is terminated by a period.
4.2.4. The DE line
DE c-myc (cellular homologue of myelocytomatosis virus 29 oncogene), DE promoter 2.
The description lines contain general descriptive information about the
promoter. The description is given in ordinary English and is
free-format.
It contains the swiss-prot gene names when known. In some cases, more
than
one DE line is required; in this case, the text is divided only between
words. The last DE line is terminated by a period.
4.2.5. The OS line
OS Mus musculus (house mouse)
The species line specifies the source organism(s) of the promotery. The
species names are based on NCBI's taxonomy and thus can be
automatically
hyperlinked to the NCBI's taxonomy web pages.
4.2.6. The HG line
HG Homology group 53; Mammalian c-myc proto-oncogene, promoter 2
The homology group
line is optional, it contains 2 fields: a homology group number that
allows
identification of all sequence-wise similar promoters in EPD, and a
homology
group name.
4.2.7. The AP line
AP Alternative promoter #2 of 2; 5' exon 1; site 2; major promoter.
The AP line is optional and provides information on alternative
promoters of the same gene (for more details, see Section 4.3.1.).
It contains 3 or 4 fields, separated by semicolons, providing the
following
types of information:
descriptive text fields followed by
Two numbers indicating, respectively, the promoter's relative
position
along the gene, and the total number of alternative promoters of the
gene.
Promoters are numbered in the 5' to 3' directions starting with one.
A number referring to the exon preceded by the promoters. Note
that
multiple
promoters may be associated with the same (3'-coterminal) exon or with
different exons. Known exons are numbered in 5' to 3' direction
starting
with one.
Note that the nomenclature of 5'-exons in EPD
may differ from the usage in the literature.
A number
indicating the promoter's relative position among the
subset of
promoters preceeding the same exon.
The NP line is optional and provides information on promoters which are
physically closer to each other than 1000 bp. It contains 3 fields,
separated
by semicolons, providing the following types of information:
The EPD accession number of the neighbouring promoter.
The EPD identifier of the neighbouring promoter.
The last field indicates, respectively, the position and the
direction
of the neighbouring promoter relative to the transcription initiation
site
given in the promoter entry.
Negative numbers indicate the upstream region of this entry and
positive
ones indicate the downstream region.
The sign indicates the transcription direction of the
neighbouring
promoter
relative to the promoter entry:
"+" means same direction
"-" means opposite direction
4.2.9. The DR line
The DR lines contain
cross-references to other EPD entries (if there are alternative
promoters of the same gene), or to entries from other databases. So
far, we have incorporated links to CLEANEX,EMBL
(3),
GenBank (4), DDBJ
(5),
SWISS-PROT (6),
TRANSFAC (7),
Flybase (8), MIM
(9) and
MGD (10). The
precise format of these lines depends on the target database. Note
that some cross-references include numbers enclosed in square brackets
indicating the relative position of a linked sequence object, or
keywords characterising the nature of the relationship between the
entries. For instance, the ranges associated with cross-references to
EMBL entries define the extensions of the EMBL sequences relative to
the initiation site described by the EPD entry. The multiplicity of
EMBL cross-references in some entries mirrors the redundancy of the
sequence database. The first of these references corresponds to the
longest promoter region, except when the sequences are cancelled from
EMBL database, but still exist in GenBank or DDBJ. The format of
the DR line is shown by the following example lines:
DR GENOME; NT_037436.1; NT_037436; [-14139754, 9212459]. DR EPD; EP11146; HS_MYC_1; alternative promoter; [-162; +]. DR EMBL; J00120.1; [-2489, 8507]. DR SWISS-PROT; P01106; MYC_HUMAN. DR SPTREMBL; Q8IQL1. DR FLYBASE; FBgn0013718; nuf. DR TRANSFAC; R01804; HS$CMYC_04; [-300, -283]; by position. DR MIM; 190080. DR RefSeq; NM_003529. DR MGD; MGI:88468; Cola2. DR ENSEMBL; CG32140. DR TRANSCRIPTOME; DMe000571.
Explanations (for detailed information go to Guidelines
):
The first item on the DR line is the
abbreviated
name of the data collection to which reference is made. The currently
defined
data bank identifiers are the following:
GENOME
NCBI Reference Sequence (RefSeq) of
genomic sequence
contigs
EPD
Eukaryotic Promoter Database:
alternative promoters
of the same gene
CLEANEX
Gene expression database for human
EPD promoters
EMBL
Nucleotide sequence database of the
EMBL
SWISS_PROT
Protein sequence database
SPTREMBL
Subset of protein sequence database
TrEMBL. It
contains the entries which should be eventually incorporated into
SWISS-PROT.
SWISS-PROT accession numbers have been assigned for all SP-TrEMBL
entries
FLYBASE
Drosophila genome database
TRANSFAC
Transcription factor (TF) database
MIM
Mendelian Inheritance in Man Database
RefSeq
Reference Sequence Database
MGD
Mouse Genome Database
ENSEMBL
Metazoan genome annotation
TRANSCRIPTOME
Catalog of transcripts and their
mapping onto
the genome (LICR Lausanne branch)
TIGR
'gene identifiers' from the 'Rice
Genome Annotation'
project at TIGR
The second item is the primary accession number (or an equivalent
unique
identifier of another data banks) of the entry to which reference is
made.
The third item (if it exists) is a secondary idientifier or name
for
the
cross-referenced database entry.
The fourth item for EMBL and Transfac indicates the location and
extension
of the sequences given in these entries relative to the transcription
initiation
site given in the promoter entry. Negative numbers indicate the
upstream
region of this site and positive ones indicate the downstream part.
The fifth item
in the EPD line, indicates the position and the direction of
the
alternative
promoter as it is defined for the neighbouring promoter in the NP
line last field
in the TRANSFAC line, designates the criteria used to collect
the TF
entry:
- by position: The TF binding site is situated between -500 and + 100,
+1 being the transcription initiation site
- by function: The TF binding site is known to regulate the
corresponding
promoter.
NB : TRANSFAC cross-reference lines should not exceed the real
number
of binding sites found in "TRANSFAC Site Table". Thus the position
given
in this DR line in related to the longest EMBL entry common to both EPD
and TRANSFAC (version 6.3) databases.
4.2.10. The RN, RX, RA, RT and RL lines
These lines comprise the literature citations within EPD. The citations
indicate the papers from which the data has been abstracted. The
reference
lines for a given citation occur in a block, and are always in the
order
RN, RX, RA, RT, RL. Within each such reference block the RN line occurs
once, the RX lines occurs zero or more times, and the RA, RT and RL
lines
each occur one or more times. If several references are given, there
will
be a reference block for each.An example of a complete reference is:
RN [1] RX MEDLINE; 84026482. RA Battey J., Moulding C., Taub R., Murphy W., Stewart T., Potter H., RA Lenoir G., Leder P.; RT "The human c-myc oncogene: structural consequences of RT translocation into the IgH locus in Burkitt lymphoma"; RL Cell 34:779-787(1983).
The formats of the individual lines are explained below. >
4.2.10.1. The RN line
The RN line gives a sequential number to each reference citation in an
entry.This number is used to indicate the reference in the ME lines.
4.2.10.2 The RX line
The RX line is an optional line which is used to indicate the
identifier
assigned to a specific reference in PubMed (PMID, from the National
Library
of Medicine (NLM)). .
4.2.10.3 The RA line
The RA lines list the authors of the paper (or other work) cited. The
authors
are are listed in the order given in the paper. The names are listed
surname
first followed by a blank followed by initial(s) with periods. The
authors'
names are separated by commas and terminated by a semicolon. Author
names
are not split between lines.
4.2.10.4 The RT line
The RT lines contain the title of the reference citation.
4.2.10.5 The RL line
The RL lines contain the conventional citation information for the
reference.
In general, the RL lines alone are sufficient to find the paper in
question.
It includes the journal abbreviation, the volume number, the page
range,
and the year. Journal names are abbreviated according to the
conventions
used by the National Library of Medicine (NLM) and are based on the
existing
ISO and ANSI standards.
4.2.11. The ME line
The method lines describe experiments defining the transcription
initiation
site. The format of the ME line is as follows:
ME Method_description [; Qualifier...] [n,...].
A complete list of method descriptions is given in Section 4.3.2.
Qualifiers
may indicate that an experimental gene transcription system was used,
that
data are of low precision (less +/- 5 bp), or that the experiments were
done with a closely related gene. The number(s) enclosed in square
brackets
links the method descriptions to the bibliographic references included
in the promoter entry. The methods line from the example are:
ME Nuclease protection [1,4]. ME Nuclease protection; transfected or transformed cells [3]. ME Length measurement of an RNA product; low-precision data [1].
4.2.12. The SE line
The sequence line shows a short sequence segment corresponding to the
-49
to +10 region of the promoter. Transcribed and untranscribed
nucleotides
are represented by upper and lower case characters, respectively. This
line type is not meant to provide sequence data but serves as a control
string for sequence extraction.
4.2.13. The FL line
The Full length line designates the
large-scale
cDNA sequencing projects : NEDO (11),
MGC (12),
and
BDGP (15).
4.2.13. The IF line
The Initiation Frequency lines reflect the
frequency
at which each nucleotide within the initiation region is found at the
5'end
of bone fide full-length cDNA clone inserts.
4.2.14. The TX line
The TX (TaXonomy) lines define a promoter's location within EPD's
hierarchical
classification system (see Section 5). Note that
starting from release 72, the classification
system is no longer maintained.
4.2.15. The KW line
The KW lines define a number of keywords
describing
an entry.
4.2.16. The FP, DO and RF lines
These lines pertain to the EPD old format, see next Section.
4.2.17. The // line
The // (terminator) line contains no data or comments. It designates
the
end of an entry.
4.3. Line types retained from the old format
The last six lines of a entry present essential information in the more
concise, old format. A original description of the old format follows:
Each entry starts with an FP line that contains a position reference to
a transcription initiation site, and ends with a terminator (//).Below
is an example of a promoter entry:
FP Hs c-myc P2+:+S EU:NC_000008.9 1+ 128817660; 11148.053 010*2 XX DO Experimental evidence: 4,4#,<2> DO Expression/Regulation: +mitogen RF Cell34:779 EMBOJ2:2375 MCB7:1393 MCB7:2988 //
4.3.1. The FP line
The FP line contains the following fields and subfields:
The promoter name begins with a species code usually followed by
a gene
locus or gene product name. Species codes consist of the initials of
genus
and species name. Occasionally, three characters are required to
generate
unique codes. Standard abbreviations identify viruses. The full names
of
the organisms are given in appendix B.1. Subspecies or strains are
specified
in parentheses. Chromosomal locations (genetic
or
cytogenetic loci, genomic map units, etc.) may appear in square
brackets
immediately following species codes. Many gene products are referred to
by abbreviations explained in appendix B.3. Alternative promoters are
identified
by right-justified "P" and a digit indicating the corresponding
initiation
site numbered sequentially from 5' to 3'. An optional "E" and digit
refers
to the corresponding 5'exons, if known. Identical numbers indicate
3'co-terminal
exons. The strongest initiation site is marked by trailing + if known
(see
also List of
alternative
promoters)
genome db codes currently used are 'EM' for EMBL database, and
'EU' for genome contigs or chromosomal genome assemblies of the RefSeq
database.
The EMBL accession number always relates to the first EMBL
cross-reference.
This one is usually the longest promoter region except when the entry
is
cancelled from the EMBL database, but still present in GenBank or DDBJ.
The sequence type indicates whether the sequence is circular or
linear.
A sequence comprising exactly one repeat unit of a tandem repeat
cluster
is also considered circular. Note that the annotation as circular or
linear
sequences in EPD is not always in agreement with the corresponding
annotation
in EMBL.
The entry code is a five-digit number which is the only part of a
promoter
entry that is stable from release to release.
Alternative promoter identification code: Genes represented by
multiple
promoter entries in EPD are assigned a promoters group number. The
corresponding
initiation sites are numbered sequentially from 5' to 3'.
4.3.2. DO lines: Documentation
Documentation of promoter entries is presented on lines starting with
"DO".
They are essentially free format and so far not processed by specific
programs.
In the present release, there are two DO lines per entry, the first
referring
to the transcript mapping experiments that define the promoter, the
second
giving information about expression and regulation.The varies
experimental
techniques are identified by number codes.The "Medline's number" and/or
"example" in brackets are linked, respectively, to the abstract and/ or
to the full text article describing the related experiment.
DNA sequencing of a full-length processed pseudogene (3584116)
8
Reverse direction primer extension with homologous sequence
ladder
: Length measurement of an in vitro synthesised DNA primed upstream of
the initiation site and blocked by the 5'end of the RNA hybridized to
the
template (2451027)
Special characters appended to the number codes designate an
experimental
gene expression system where the RNA for the corresponding experiments
was synthesized.
*
RNA POL II in vitro system
o
injected amphibian oocytes
#
transfected or transformed cells, injected neurons
!
transgenic organisms
r
experiments performed with closely related gene
h
homologous sequence ladder used for length measurement
of nuclease
protection or primer extension product
l
low-precision data (error > +/- 5 bp)
Explanations and additional conventions:
The full-length assumption of a cDNA clone or a proccessed
pseudogene
is
based on consistency with accompanying nuclease-protection or primer
extension
data or, alternatively, the existence of multiple 5'coterminal clones
or
pseudogenes.
The information on expression/regulation may include indication of
developmental
stages, tissues, cell types, cell cycle stages, and various regulatory
features.Conventions:
Semicolon delimits the two fields : expression and regulation.
Comma delimits alternative keywords (e.g. liver, kidney)
"+" means "induced by" or "strongly expressed in".
"-" means "repressed by" or "weakly expressed in".
"~" means "modulated by".
Cell cycle stages are given in square brackets.
4.3.3. RF line: Literature references
The first four references from the RN, RX,
RA,
RT and RL lines are repeated in a highly condensed form. Each reference
is spaced by 15 letters and indicates journal, volume, and starting
page
of the referred article (maximal 14 letters). The journal code
explained
in Appendix B.2.
They primarily point to the articles where the experimental promoter
evidence is presented. Additional potential subjects are homology to
other
promoters, gene expression and regulation, nomenclature. Papers
containing
only sequence data are usually not referred to because they are easy to
find via the corresponding EMBL sequence entry descriptions.
4.3.4. Miscellaneous
Greek letters are sometimes represented by corresponding latin
letters
followed by apostrophe:
a' = alpha
b' = beta
g' = gamma
d' = delta
e' = epsilon
z' = zeta
h' = eta
th'= theta
k' = kappa
l' = lambda
n' = nu
r' = rho
Sub- and superscripts are sometimes indicated by preceding "_"
and "^",
respectively.
4.4. Distinct format of 'preliminary' entries in epd_bulk.dat
4.4.1. The title line:
TI epd83 Bulk Section Eukaryotic Promoter Database / Release 83 EP
4.4.2. The ID line
The identification line is always the first line of an entry. The form
of the ID line in 'epd_bulk.dat' is:
ID OS_bAAAA preliminary; undefined; TAXONOMIC DIVISION.
An unique entry identifier "OS_bAAAA"
is contructed using the species identification code ('OS') with at most
4 alphanumeric characters representing the biological source of the
promoter
and a 'b' (for bulk) followed by an arbitrary 4 letter code
"preliminary" data class field indicates that the entry
has not
(yet) undergone all quality checks necessary for being classified as
"standard".
"undefined" as initiation site type due to insufficient
data to
define transcription initiation patterns (Section
3).
TAXONOMIC DIVISION are
PLN for plant
NEM for nematode
ART for arthropode
MLS for mollusc
ECH for echinoderm
VRT for vertebrates.
Note that these codes relate to the organism in which the promoter is
expressed,
not to the source organism in which the promoter is replicated as
defined
on the OS line.
The ID line is terminated by a period.
4.4.3. The AC line
AC EP00001;
The accession number consists of the character string "EP"
followed
by 5 digits. Previously the first two digits of the AC designated the
release
number of initial appearance of the specific entry followed by the EPD
entry order. AC numbers in 'epd_bulk.dat' are continuous numbers,
excluding
ACs already used for entries in the main file 'epd.dat'.
5 CLASSIFICATION
Starting from release
72,
the classification system is no longer maintained. New entries are
presently
added by default to an '?Unclassified' category. The classification
system
might still provide valuable information for entries added before
release
72. However for any category, consider the possible existence of
additional,
potentially corresponding EPD entries in the default categories.
The entries of the Eukaryotic Promoter Database are embedded
in a hierarchical
classification
system. A promoter's taxonomic location is made clear by interspersed
group
headings. The example shown below is taken from top of the database. A
contrasting format has been chosen to emphasize the very different
nature
of this information.
A group heading consists of a series of node numbers and a title.
The
highest classification level distinguishes between promoters active in
major eukaryotic taxa (phyla). Further below, grouping considers
replicon
type and functional properties of gene products. On the lowest level,
homology
(as defined in section 6) is the criterion. A survey of the upper part
of the classification pyramid is presented in appendix A.The proposed
classification
system has a highly tentative character as it is often unclear how a
new
promoter should be classified, especially if the gene product is a
multifunctional
protein. Users should therefore not be surprised or discouraged if they
don't find a promoter at the initially expected place.
6 HOMOLOGOUS PROMOTERS
Homology is defined as sequence similarity due to common phylogenetic
origin.
In EPD, two promoters are considered homologous if they exhibit
>=50% sequence
similarity between -79 and +20. Similarity is calculated from optimal
alignments
generated with the aid of the UWGCG subroutine ShiftAlign (13)
using the following symbol comparison table:
A
C
G
N
T
1.0
0.0
0.0
0.5
0.0
A
1.0
0.0
0.5
0.0
C
1.0
0.5
0.0
G
0.5
0.5
N
1.0
T
Gap weight and gap length weight are specified as 3 and 0,
respectively.
Terminal gaps are ignored. Percent similarity is understood as
alignment
score divided by segment length, times 100. Groups of homologous
promoters
are identified by homology group numbers (see 4.2.1.). Definition of
these
groups is based on similarity scores as defined above and a tree
generation
method called UPGMA (14).
In a few cases, similarities between 50% and 56% were ignored
if the protein sequences of the corresponding genes were not related.
Similarities
were also ignored between alternative promoter sequences that are
spaced
by less than 50 bp. A subset of "independent" promoters is
marked
by "+" in column 27 of the FP line. This set contains only one member
per
homology group (usually, the promoter with the longest upstream
sequence
available) and is intended to be used for statistical analysis of
functional
patterns where it is important to avoid bias by multiples of closely
related
sequences.
7 PROMOTER SEQUENCE RETRIEVAL
Promoter sequence listings have not been incorporated into EPD for two
reasons: (i) to avoid duplication of data already existing elsewhere in
the EMBL data library, and (ii) to encourage usage of FPS-dependent
sequence
retrieval programs which enables the user to specify suitable 5'- and
3'boundaries
of the requested sequence segments himself. Effort is under way to
motivate
producers of standard nucleotide sequence analysis packages to provide
such tools in the future. In the meantime, users with some programming
experience will find it easy to write their own routines. Our local
sequence
extraction programs run in a UWGCG environment (13)
and have been implemented at several sites in Europe and the United
States.
They are documented and freely available on request.
8 REFERENCES
Bucher, P. & Trifonov, E.N., Compilation
and analysis of eukaryotic
POL II promoter sequences, Nucl. Acids Res. 14, 10009-10026
(1986). (3808945)
Bucher, P. & Bryan, B., Signal search
analysis: a new method to
localize and characterize functionally important DNA sequences,
Nucl.
Acids Res. 12, 287-305 (1984). (6546421)
Stoesser, G., Tuli,M.A., Lopez, R. and Sterk,
P., The EMBL
nucleotide
sequence database, Nucleic Acids. Res., 27,18-24
(1999).
(9847133)
Sugawara, H., Miyazaki, S., Gojobori,
T. and Tateno, Y.,DNA
Data
Bank of Japan dealing with large-scale data submission, Nucleic
Acids.
Res., 27, 25-28 (1999). (9847134)
Bairoch, A. and Apweiler, R., The
SWISS-PROT protein sequence data
bank
and its supplement TrEMBL in 1999, Nucleic Acids Res., 27,
49-54
(1999). (9847139)
Heinemeyer, T., Chen, X., Karas, H., Kel,
A.E., Kel, O.V., Liebich, I.,
Meinhardt, T., Reuter, I., Schacherer, F. and Wingender, E., Expanding
the TRANSFAC database towards an expert system of regulatory molecular
mechanisms, Nucleic Acids. Res., 27,318-322
(1999).
(9847216)
The FlyBase consortium, The FlyBase
database of the drosophilia
genome
projects and community litterature, Nucleic Acids. Res., 27,85-88
(1999). (9847148)
Pearson, P., Francomano, C., Foster, P.,
Bocchini, C., Li, P. and
McKusick,
V., The status of online Mendelian inheritance in man (OMIM) medio
1994, Nucleic
Acids Res., 22, 3470-3473 (1994). (7937048)
Blake, J.A., Richardson, J.E., Davisson,
M.T., Eppig, J.T. and the
Mouse
Genome Database Group, The Mouse Genome Database (MGD): genetic and
genomic information about the laboratory mouse, Nucleic Acids Res.,
27,
95-98 (1999). (9847150)
Suzuki Y., Yamashita R., Nakai K., Sugano
S., DBTSS: database of
human
transcriptional start sites and full-length cDNAs. Nucleic Acids
Res. 30(1):328-331(2002).
(11752328)
Devereux,J., Haeberli,P., & Smithies,O. A
comprehensive set of
sequence
analysis programs for the VAX, Nucl. Acids Res. 12, 387-395
(1984). (6546423)
Sneath,H.A. & Sokal,R.R., Numerical
taxonomy, W.H.
Freemann,
San Francisco, London (1973).
Stapleton M., Liao GC., Brokstein P., Hong
L., Carninci P., Shiraki T.,
Hayashizaki Y., Champe M., Pacleb J., Wan K., Yu C., Carlson J., George
R., Celniker S., and Rubin GM., The Drosophila Gene Collection:
Identification
of Putative Full-Length cDNAs for 70% of D. melanogaster Genes. Genome
Res., 12:1294-1300 (2002). (12176937)
Schmid C.D., Praz V., Delorenzi M.,
Périer R., and Bucher
P., The
Eukaryotic Promoter Database EPD: the impact of in silico primer
extension.
Nucleic Acids Res. 32, D82-5 (2004). (14681364)