Philipp Bucher and Vidhya Jagannathan,
Swiss Institute of Bioinformatics
and Swiss Institute for Experimental Cancer Research
Ch. des Boveresses 155
CH-1066 Epalinges s/Lausanne
Switzerland
This manual and the database it accompanies may be copied and
redistributed freely, without advance permission, provided that this statement
is reproduced with each copy.
Published Research assisted by the HTPSELEX database
cite:
Jagannathan V, Roulet E, Delorenzi M, Bucher P.
HTPSELEX--a database of high-throughput SELEX libraries for transcription factor binding sites.
Nucleic Acids Res. Jan 1;34(Database issue):D90-4 (2006) (PMID:16381982)
The HTPSELEX database contains sets of in vitro selected transcription factor
binding site sequences obtained with the high-throughput SELEX (HTPSELEX) method described in (Roulet et al. Nat Biotechnol,2002 20:831).
In addition the database also contains binding sites obtained with conventional SELEX method (Tuerk and Gold, Science,1990 249:505-510).
2. DATA DESCRIPTION
A complete SELEX experiment starts with a purified nucliec acid binding protein and terminates with a computational model of its binding specificity.
Each entry in the database corresponds to one HTP SELEX or conventional SELEX experiment.
For each HTPSELEX and SELEX experiment the following details are recorded(if available for SELEX enteries)
A detailed description of the experiment: nature of protein material, input DNA library, binding conditions
the original sequencing chromatograms
Phred/Phrap analysed clone sequences
the extracted binding site sequences
A hidden Markov model of the derived binding specificity in two formats recognized by the programs decodeanhmm (developed by A. Krogh) and and mamot (developed by M. Delorenzi).
3. FORMAT CONVENTIONS
HTPSELEX entries are presented in a similar format as EMBL and SWISS-PROT sequence entries.
3.1 Entry types and identifiers
HTPSELEX database is distributed as three main flat files from our FTP server, each containing a collection of a particular entry type:
htpselex.doc contains experiment entries
htpselex.dat contains clone sequence entries
htpselex.seq contains extraced binding site sequences.
The trace files and binding models are available as compressed archives.
HTPSELEX entries have composite identifiers reflecting the hierarchical relationships between them. The components are alphanumeric strings separated by underscore characters. An experiment entry is identified by a short alphanumeric string, e.g. .NF1. for the CTF/NF1 experiment.
The clone sequence entries contain either a complete insert sequence or a partial sequence from the left or right. The latter occurs when the complete sequence of the insert could not be assembled from the sequencing reads. The clone sequence identifiers consist of the experiment Id, the cycle number, the clone number and optionally the sequencing direction (e.g. NF1_3_00001, NF1_3_0500_F).
The tag identifier consists of the experiment ID, cycle number, clone number, and tag serial number (e.g. NF1_3_00001_1).
3.2. Experiments
Each line of an experiment entry begins with a two character line code indicating the type of information contained
in the line. The entry description is based on 28 fields. The current line types and line codes and the order in which they appear in an entry, are shown below:
ID - Identification
EN - Entry name
DT - Date of creation
DE - Description
FN - Factor name
FC - Factor complex
FS - Factor source
RN - Reference number
RX - Reference hyperlink
RA - Reference authors
RT - Reference title
RL - Reference link
EX - Nature of DNA-protein binding experiment
NS - Sequence notation for input library,vector clip left, vector clip right
- Tag unit
SQ - Sequence
SX - SELEX library descriptor
XR - Database cross references
// - Termination line
Spacer lines (XX) are inserted in order to make the database easier to read by eye. Some line types occur many times in a single entry. Each entry must begin with an identification line (ID) and end with a terminator line (//). Text does not exceed column 72. Below is an example of an entry:
ID NF1; HTS; version 1.
XX
EN CTF/NF1
XX
DT 09-Aug-2005
XX
DE HTP SELEX for transcription factor CTF/NF1, 4 cycles
XX
FN transcription factor CTF/NF1
FC A2
FS recombinant protein; vaccinia system
XX
RN [1]
RX PUBMED; 12101405
RA Roulet E, Busso S, Camargo AA, Simpson AJ, Mermod N, Bucher P.
RT High-throughput SELEX SAGE method for quantitative modeling of
RT transcription-factor binding sites.
RL Nat Biotechnol. 2002 Aug;20(8):831-5.
XX
EX HTPSELEX
XX
NS input; 78 bp
SQ tccatctctt ctgtatgtcg agatctannn nnnnnnnnnn nnnnnnnnnn nntagatctc
SQ ctaaccgact ccgttaatt
NS vector left; 62 bp;
SQ ggccgccagt gtgatggata tctgcagaat tccagcacac tggcggccgt tactagtgga
NS tag unit; 33 bp;
SQ tctannnnnn nnnnnnnnnn nnnnnnnnnt aga
NS vector right; 62 bp;
SQ tccgagctcg gtaccaagct tgatgcatag cttgagtatt ctatagtgtc acctaaatag
XX
SX Cycle 0; R25_0; 467 traces; 425 clones; 854 hq-tags
SX Cycle 1; NF1_1; 479 traces; 402 clones; 955 hq-tags
SX Cycle 2; NF1_2; 467 traces; 367 clones; 1203 hq-tags
SX Cycle 3; NF1_3; 1924 traces; 1425 clones; 5579 hq-tags
SX Cycle 4; NF1_4; 315 traces; 253 clones; 309 hq-tags
XX
XR gene A; HGNC; 7784; NF1A.
XR protein A; Uniprot/Swissprot; Q12857[1 ..399]; NF1A_HUMAN
XR input; HTPSELEX:R25
XR restriction endonuclease; REBASE:261; BglII (5' A|GATCT 3' TCTAG|A).
XR sequencing vector; pZERO-2T;EMBL:Y10545; ECY10545
XX
//
3.2.1 The ID line
The identification line is always the first line of an entry. The general form of the ID line is:
ID exp_type; version no.
exp_type defines whether the libray was obtained using HTP SELEX (HTS) or conventional SELEX method (SEL).
The ID line is terminated by a period.
3.2.2 The DE line
DE HTP SELEX for transcription factor CTF/NF1, 4 cycles
The description line is free format and gives general information about the entry.
3.2.3 The FN line
FN transcription factor CTF/NF1
The line describes the transcription factor used for the SELEX experiment.
3.2.4 The FC line
FC A2
The FC gives the factor complex. The factor complex describes the DNA-protein binding complexity
For example if the protein binds as a dimer to the DNA, the factor complex is described as A2, and if as monomer then it is given as A.
if heterodimer then the factor complex given as AB.
3.2.5 The FS line
The FS line describes the source of the factor. The format of the line is as follows
FS protein_type; production system
FS recombinant protein; vaccinia system
3.2.6 The NS line
Multiple NS lines gives a description of features of the SELEX cycle.
The general features given in this block are
length of input random sequence in base pairs.
length of Left clip of the read, in base pairs, based on vector sequence.
length of the tag unit where tag unit is a single binding site
length of right clip of the read, in base pairs, based on vector sequence.
3.2.7 The SQ line blocks
The SQ line blocks contain the sequence feature described in the NS line
The sequences are given in EMBL format of 60 nucleotides per line with substrings of 10 nucleotides.
3.2.8 The SX lines
The SX lines give a complete description of each cycle of the HTP SELEX procedure.
It gives the number of traces, clones and high quality tags (hq-tags) are avaliable
for each of the cycles. ex.
The XR lines are crosslinks to the various other databases. We have incorporated links to Uniprot/Swiss-prot,EMBL,HGNC and REBASE. The format of the lines depends on the target database.
The Sequence data is represented in EMBL-like format. Each sequence entry starts with an identifier line("ID") followed by further annotation lines.
3.3.1 Clone insert sequences
The Clone insert sequences are obtained after Phred/Phrap analysis of the trace files.
The start of the sequence is marked by a line starting with "SQ" and the end of the sequence is marked with two slashes("//").
3.3.1.1. The ID line
The ID line is of the format
ID cloneID standard; DNA; UNC; seqeunce length BP.
cloneID:stable identifier, consisting of alphanumeric character, describing the transcription factor used
in the HTP SELEX experiment. All letters are in upper case.
standard:Entries which are complete to the standards described in this manual.
UNC: Unclassified division according to EMBL database division.
Sequence length: The total number of bases in the sequence.
3.3.1.2. Database cross-references
The DR line gives cross references to the trace files used in the construction of the sequence entries.
The cross links as for now internal and in future will be linked to NCBI Trace Archive
3.3.1.3. Features Header
The FH (Feature Header) lines are present only to improve readability of an entry when it is printed or displayed on a terminal screen. The lines contain no data and may be ignored by computer programs. The format of these lines is always the same:
FH Key Location/Qualifiers
FH
The first line provides column headings for the feature table, and the second line serves as a spacer. If an entry contains no feature table (i.e. no FT lines - see below), the FH lines will not appear.
3.3.1.4. Features Table
The Feature table gives information about the tags in the insert sites.
Example
FT misc_binding 486..519
FT /bound_moiety ="CTF/NF1"
FT /label="NF1_3_00002_1"
FT /note="Base quality score is 3.2218e-08"
3.3.1.5. Tag sequences
The sequence line gives the total number of bases as well as the number of each base pair.
Jagannathan V, Roulet E, Delorenzi M, Bucher P.
HTPSELEX--a database of high-throughput SELEX libraries for transcription factor binding sites.
Nucleic Acids Res. Jan 1;34(Database issue):D90-4 (2006) (PMID: 16381982).
Roulet, E., Busso, S., Camargo, A.A.; Simpson, A.J., Mermod, N. & Bucher, P.
High-throughput SELEX SAGE method for quantitative modeling of transcription-factor binding sites,
Nat. Biotechnol. 20, 831-835.
(2002) (PMID:
12101405).