This document provides a technical description of the transcription start site assembly pipeline that was used to generate EPDnew version 002 for A. thaliana.
Promoter collection:
Name | Genome Assembly | Promoters | Genes | PMID | Access data | ||
---|---|---|---|---|---|---|---|
TAIR10 Genes | Feb 2011 TAIR10/araTha1 | 31615 | 27149 | 26201819 | SOURCE | DOC | DATA |
Experimental data:
Name | Type | Samples | Tags | PMID | Access data | ||
---|---|---|---|---|---|---|---|
Morton et al., 2014 | PEAT | 1 | 22,578,668 | 25035402 | SOURCE | DOC | DATA |
Cumbie et al., 2015 | NanoCAGE-XL | 3 | 226,177,104 | 26268438 | SOURCE | DOC | DATA |
Primary annotation data was downloaded from TAIR the 06-02-2015.
Genes annotations downloaded from TAIR did not contain direct links to RefSeq ID. For this reason, RefSeq ID has been parsed from NCBI RefSeq files.
A total number of 31615 promoters were selected.The TAIR10 TSS collection is stored as a tab-deliminated text file conforming to the SGA format under the name:
Data was imported from the lab web page or GEO in BAM or SRA file formats. Please refer to the "Source Data" table at the beginning of this document for the links to raw data archives. A detailed guide on how to import, map and convert these samples can be found in the corresponding "MGA doc" files.
For the 4 samples present, peak calling has been carried out using ChIP-Peak on-line tool with the following parameters:
The 4 samples used were then individually processed in a pipeline aiming at validating transcription start sites with mRNA peaks. A TAIR10 TSS was experimentally confirmed if a CAGE peak lied in a window of 100 bp around it. The validated TSS was then shifted to the nearest base with the higher tag density.
Each sample in the dataset was used to generate a separate promoter collection. Potentially, the same transcript could be validated by multiple samples and it could have different start sites in different samples. To avoid redundancy, the individual collections were used as input for an additional step in the analysis (Assembly pipeline part B).
The total number (summing up all samples) of non experimentally validated TSS was around 7000.
The 4 promoter collections were merged into a unique file and further analysed. Transcripts validated by multiple samples could potentially have the TSS set on a broader region and not to single position. To avoid such inconsistency, for each transcript we selected the position that was validated by the larger number of samples as the true TSS.
Transcription Start Sites that mapped closed to other TSS that belonged to the same gene (100 bp window) were merged into a unique promoter following the same rule: the promoter that was validated by the higher number of samples was kept.
The 15000 experimentally validated promoter were stored in the EPDnew database that can be downloaded from our ftp site. Scientist are wellcome to use our other tools ChIP-Seq (for correlation analysis) and SSA (for motifs analysis around promoters) to analyse EPDnew database.