HTPSELEX DATABASE

HTPSELEX DATABASE

USER MANUAL

Release 1, Aug 2005

Written by

Philipp Bucher and Vidhya Jagannathan, Swiss Institute of Bioinformatics
and Swiss Institute for Experimental Cancer Research
Ch. des Boveresses 155
CH-1066 Epalinges s/Lausanne
Switzerland

Electronic mail via:

Home page /htpselex/

This manual and the database it accompanies may be copied and redistributed freely, without advance permission, provided that this statement is reproduced with each copy.

Published Research assisted by the HTPSELEX database cite:
Jagannathan V, Roulet E, Delorenzi M, Bucher P. HTPSELEX--a database of high-throughput SELEX libraries for transcription factor binding sites. Nucleic Acids Res. Jan 1;34(Database issue):D90-4 (2006) (PMID:16381982)


CONTENTS


1. INTRODUCTION
2. DATA DESCRIPTION
3. FORMAT CONVENTIONS 4. REFERENCES


1. INTRODUCTION

The HTPSELEX database contains sets of in vitro selected transcription factor binding site sequences obtained with the high-throughput SELEX (HTPSELEX) method described in (Roulet et al. Nat Biotechnol,2002 20:831). In addition the database also contains binding sites obtained with conventional SELEX method (Tuerk and Gold, Science,1990 249:505-510).


2. DATA DESCRIPTION

A complete SELEX experiment starts with a purified nucliec acid binding protein and terminates with a computational model of its binding specificity. Each entry in the database corresponds to one HTP SELEX or conventional SELEX experiment. For each HTPSELEX and SELEX experiment the following details are recorded(if available for SELEX enteries)

3. FORMAT CONVENTIONS


HTPSELEX entries are presented in a similar format as EMBL and SWISS-PROT sequence entries.

3.1 Entry types and identifiers

HTPSELEX database is distributed as three main flat files from our FTP server, each containing a collection of a particular entry type: The trace files and binding models are available as compressed archives. HTPSELEX entries have composite identifiers reflecting the hierarchical relationships between them. The components are alphanumeric strings separated by underscore characters. An experiment entry is identified by a short alphanumeric string, e.g. .NF1. for the CTF/NF1 experiment. The clone sequence entries contain either a complete insert sequence or a partial sequence from the left or right. The latter occurs when the complete sequence of the insert could not be assembled from the sequencing reads. The clone sequence identifiers consist of the experiment Id, the cycle number, the clone number and optionally the sequencing direction (e.g. NF1_3_00001, NF1_3_0500_F). The tag identifier consists of the experiment ID, cycle number, clone number, and tag serial number (e.g. NF1_3_00001_1).

3.2. Experiments

Each line of an experiment entry begins with a two character line code indicating the type of information contained in the line. The entry description is based on 28 fields. The current line types and line codes and the order in which they appear in an entry, are shown below:

ID  - Identification
EN  - Entry name
DT  - Date of creation
DE  - Description
FN  - Factor name
FC  - Factor complex
FS  - Factor source
RN  - Reference number
RX  - Reference hyperlink
RA  - Reference authors
RT  - Reference title
RL  - Reference link
EX  - Nature of DNA-protein binding experiment
NS  - Sequence notation for input library,vector clip left, vector clip right
    - Tag unit
SQ  - Sequence
SX  - SELEX library descriptor
XR  - Database cross references
//  - Termination line

Spacer lines (XX) are inserted in order to make the database easier to read by eye. Some line types occur many times in a single entry. Each entry must begin with an identification line (ID) and end with a terminator line (//). Text does not exceed column 72. Below is an example of an entry:

ID   NF1; HTS; version 1.
XX
EN   CTF/NF1
XX
DT   09-Aug-2005
XX
DE   HTP SELEX for transcription factor CTF/NF1, 4 cycles
XX
FN   transcription factor CTF/NF1
FC   A2
FS   recombinant protein; vaccinia system
XX
RN   [1]
RX   PUBMED; 12101405
RA   Roulet E, Busso S, Camargo AA, Simpson AJ, Mermod N, Bucher P.
RT   High-throughput SELEX SAGE method for quantitative modeling of
RT   transcription-factor binding sites.
RL   Nat Biotechnol. 2002 Aug;20(8):831-5.
XX
EX   HTPSELEX
XX
NS   input; 78 bp
SQ   tccatctctt ctgtatgtcg agatctannn nnnnnnnnnn nnnnnnnnnn nntagatctc
SQ   ctaaccgact ccgttaatt
NS   vector left; 62 bp;
SQ   ggccgccagt gtgatggata tctgcagaat tccagcacac tggcggccgt tactagtgga
NS   tag unit; 33 bp;
SQ   tctannnnnn nnnnnnnnnn nnnnnnnnnt aga
NS   vector right; 62 bp;
SQ   tccgagctcg gtaccaagct tgatgcatag cttgagtatt ctatagtgtc acctaaatag
XX
SX   Cycle 0; R25_0; 467 traces; 425 clones; 854 hq-tags
SX   Cycle 1; NF1_1; 479 traces; 402 clones; 955 hq-tags
SX   Cycle 2; NF1_2; 467 traces; 367 clones; 1203 hq-tags
SX   Cycle 3; NF1_3; 1924 traces; 1425 clones; 5579 hq-tags
SX   Cycle 4; NF1_4; 315 traces; 253 clones; 309 hq-tags
XX
XR   gene A; HGNC; 7784; NF1A.
XR   protein A; Uniprot/Swissprot; Q12857[1 ..399]; NF1A_HUMAN
XR   input; HTPSELEX:R25
XR   restriction endonuclease; REBASE:261; BglII (5' A|GATCT 3' TCTAG|A).
XR   sequencing vector; pZERO-2T;EMBL:Y10545; ECY10545
XX
//

3.2.1 The ID line

The identification line is always the first line of an entry. The general form of the ID line is:
  ID   exp_type; version no. 
The ID line is terminated by a period.

3.2.2 The DE line

DE   HTP SELEX for transcription factor CTF/NF1, 4 cycles
The description line is free format and gives general information about the entry.

3.2.3 The FN line

FN   transcription factor CTF/NF1
The line describes the transcription factor used for the SELEX experiment.

3.2.4 The FC line

FC   A2
The FC gives the factor complex. The factor complex describes the DNA-protein binding complexity For example if the protein binds as a dimer to the DNA, the factor complex is described as A2, and if as monomer then it is given as A. if heterodimer then the factor complex given as AB.

3.2.5 The FS line

The FS line describes the source of the factor. The format of the line is as follows
FS   protein_type; production system
FS   recombinant protein; vaccinia system

3.2.6 The NS line

Multiple NS lines gives a description of features of the SELEX cycle. The general features given in this block are

3.2.7 The SQ line blocks

The SQ line blocks contain the sequence feature described in the NS line

NS   input; 78 bp
SQ   tccatctctt ctgtatgtcg agatctannn nnnnnnnnnn nnnnnnnnnn nntagatctc
The sequences are given in EMBL format of 60 nucleotides per line with substrings of 10 nucleotides.

3.2.8 The SX lines

The SX lines give a complete description of each cycle of the HTP SELEX procedure. It gives the number of traces, clones and high quality tags (hq-tags) are avaliable for each of the cycles. ex.

SX   Cycle 1; NF1_1; 479 traces; 402 clones; 955 hq-tags
The line also hyperlinks to actual data.

3.2.9 The XR lines

The XR lines are crosslinks to the various other databases. We have incorporated links to Uniprot/Swiss-prot,EMBL,HGNC and REBASE. The format of the lines depends on the target database.

XR   gene A; HGNC; 7784; NF1A.
XR   protein A; Uniprot/Swissprot; Q12857[1 ..399]; NF1A_HUMAN
XR   input; HTPSELEX:R25
XR   restriction endonuclease; REBASE:261; BglII (5' A|GATCT 3' TCTAG|A).
XR   sequencing vector; pZERO-2T;EMBL:Y10545; ECY10545

3.3. Sequence data

The Sequence data is represented in EMBL-like format. Each sequence entry starts with an identifier line("ID") followed by further annotation lines.

3.3.1 Clone insert sequences

The Clone insert sequences are obtained after Phred/Phrap analysis of the trace files. The start of the sequence is marked by a line starting with "SQ" and the end of the sequence is marked with two slashes("//").

3.3.1.1. The ID line

The ID line is of the format

ID   cloneID   standard; DNA; UNC; seqeunce length BP.
  • cloneID:stable identifier, consisting of alphanumeric character, describing the transcription factor used in the HTP SELEX experiment. All letters are in upper case.
  • standard:Entries which are complete to the standards described in this manual.
  • UNC: Unclassified division according to EMBL database division.
  • Sequence length: The total number of bases in the sequence.
  • 3.3.1.2. Database cross-references

    The DR line gives cross references to the trace files used in the construction of the sequence entries. The cross links as for now internal and in future will be linked to NCBI Trace Archive

    3.3.1.3. Features Header

    The FH (Feature Header) lines are present only to improve readability of an entry when it is printed or displayed on a terminal screen. The lines contain no data and may be ignored by computer programs. The format of these lines is always the same:
    
    FH   Key             Location/Qualifiers
    FH
    The first line provides column headings for the feature table, and the second line serves as a spacer. If an entry contains no feature table (i.e. no FT lines - see below), the FH lines will not appear.

    3.3.1.4. Features Table

    The Feature table gives information about the tags in the insert sites.
    Example
    
    FT   misc_binding   486..519
    FT                  /bound_moiety ="CTF/NF1"
    FT                  /label="NF1_3_00002_1"
    FT                  /note="Base quality score is 3.2218e-08"
    
    

    3.3.1.5. Tag sequences

    The sequence line gives the total number of bases as well as the number of each base pair.
    Example
    
    SQ   Sequence 762 BP;    210 A; 170 C; 187 G; 195 T; 0 other;

    4. REFERENCES

    1. Jagannathan V, Roulet E, Delorenzi M, Bucher P. HTPSELEX--a database of high-throughput SELEX libraries for transcription factor binding sites. Nucleic Acids Res. Jan 1;34(Database issue):D90-4 (2006) (PMID: 16381982).

    2. Roulet, E., Busso, S., Camargo, A.A.; Simpson, A.J., Mermod, N. & Bucher, P. High-throughput SELEX SAGE method for quantitative modeling of transcription-factor binding sites, Nat. Biotechnol. 20, 831-835. (2002) (PMID: 12101405).
    Last update by Rouayda Cavin Périer October 2012