TSS assembly pipeline for Am_EPDnew_001
Introduction
This document provides a technical description
of the transcription start site assembly pipeline that was used to
generate EPDnew version 001 for
A. mellifera.
Source Data
Promoter collection:
Name |
Genome Assembly |
Promoters |
Genes |
PMID |
Access data |
RefSeq Genes
|
Apr 2011 Amel_4.5/amel5
|
17735
|
10727
|
22121212
|
SOURCE
|
DOC
|
DATA
|
Experimental data:
Name |
Type |
Samples |
Tags |
PMID |
Access data |
Khamis et al., 2015
|
CAGEscan
|
16
|
70,802,351
|
26073445
|
SOURCE
|
DOC
|
DATA
|
Assembly pipeline overview
Description of procedures and intermediate data files
1. Download annotated TSS
Data was downloaded from
RefSeq the 20-07-2016.
Transcripts have been filtered according to the following rules:
- Transcripts of protein coding genes only
- Transcripts have a non-empty description field
Gene names were taken from the field "Locus ID". Since the
EPD format does not allow gene names longer than 18 characters,
we checked whether the names respected this limitation.
A total number of 17735 promoters were selected.
2. RefSeq TSS collection
The RefSeq TSS collection is stored as a tab-deliminated text file
conforming to the SGA format under the name:
The six fields contain the following information:
- NCBI/RefSeq chromosome id
- "TSS"
- position
- strand ("+" or "-")
- "1"
- Locus ID
Note that the second and forth fields are invariant.
3. Import CAGE data
Data was imported from GEO as SRA file format. Raw sequence files were
mapped to amel_4.5 genome using Bowtie. The resulting BAM files were
converted to SGA file format using
ChIP-Convert.
5. mRNA 5' tags peak calling
For each individual sample (16), peak calling for the merged file has been
carried out using
ChIP-Peak
on-line tool with the following parameters:
- Window width = 200
- Vicinity range = 200
- Peak refine = Y
- Count cutoff = 9999999
- Threshold = 5
6. TSS validation and shifting
Each sample in the collection (mRNA peaks and RefSeq TSS) was then
separately processed in a pipeline aiming at validating transcription
start sites with mRNA peaks. A RefSeq TSS was experimentally confirmed
if an mRNA peak lied in a window of 300 bp around it. The validated
TSS was then shifted to the nearest base with the higher tag
density.
7. RefSeq not-validated TSS
The total number (summing up all samples) of non experimentally validated TSS was around 10000.
8. Promoter collection for each sample
Each sample in the dataset was used to generate a separate
promoter collection. Potentially, the same transcript could be
validated by multiple samples and it could have different start
sites in different samples. To avoid redundancy, the individual
collections were used as input for an additional step in the
analysis (Assembly pipeline part B).
9. Merging collections and second TSS selection
The 16 promoter collections were merged into a unique file and
further analysed. Transcripts validated by multiple samples could
potentially have the TSS set on a broader region and not to
single position. To avoid such inconsistency, for each transcript
we selected the position that was validated by the larger number
of samples as the true TSS.
Different TSSs that belong to the same gene were classified according
to their global expression level. The primary TSS of a gene (marked with an '_1' at the end of the ID) always has the highest expression level, followed by all the others
in decreasing order of expression (marked with '_2', '_3', etc.).
10. Filtering
TSSs that mapped closed to other TSSs belonging to the same gene
(500-bp window) were merged into a unique promoter following the same rule:
the promoter that was validated by the highest number of samples was kept.
10. Final EPDnew collection
The 6493
experimentally validated promoters were stored in the
EPDnew database, which can be downloaded from our ftp
site. Scientists are welcome to use our other tools
ChIP-Seq
(for correlation analysis) and
SSA
(for motif analysis around promoters) to analyze the
EPDnew database.