TSS assembly pipeline for Pf_EPDnew_001
Introduction
This document provides a technical description
of the transcription start site assembly pipeline that was used to
generate the EPDnew version 001 for
P. falciparum.
Source Data
Promoter collection:
Name |
Genome Assembly |
Promoters |
Genes |
PMID |
Access data |
EnsemblProtists40 genes
|
Apr 2016 ASM276v2/pfa2
|
5286
|
4994
|
29092050
|
SOURCE
|
DOC
|
DATA
|
Experimental data:
Assembly pipeline overview
Description of procedures and intermediate data files
1. ENSEMBL Download
Data was downloaded from
EnsemblProtists release 40 in October 2018. Only transcrips from protein
coding genes have been selected.
5362 of 5693 genes originally present in the ENSEMBL database were
retained after this filtering. Transcripts with the same TSS position
were then merged under a common ID, yielding 5286 uniquely mapped
promoters in the TSS collection.
2. ENSEMBL TSS collection
The ENSEMBL TSS collection is stored as a tab-delimited text file
conforming to the SGA format under the name:
Pf_ensembl40_tss_pfa2.sga
The six fields contain the following information:
- NCBI/RefSeq chromosome id
- "TSS"
- position
- strand ("+" or "-")
- "1"
- ENSEMBLGeneID..geneName
Note that the second and fourth fields are invariant.
3. Data import from FTP at GEO
BEDGRAPH files for the 12 CAGE samples were downloaded from NCBI GEO ftp site
(see link above). Files were then converted into SGA format using in-house software.
Individual SGA files can be downloaded from our ftp website (see link above).
4. CAGE tag peak calling
A first selection of strong promoters was done based on a merged file containing
tags from all time points and replicates, using the
ChIP-Peak,
and
ChIP-Cor online tools.
ChIP-Peak was used with the following parameters:
- window width = 1
- vicinity range = 150
- refine peak positions = Y
- count cutoff = 9999999
- threshold = 10
The peaks were used as reference features in ChIP-Cor, and all CAGE tags from
a merged file as target features. The 10000 promoters most expressed were then
selected based on the total tags in a 150-bp centered window.
5. TSS validation and shifting
Each of the TSSs in the ENSEMBL collection was then validated and shifted
if at least one peak from the previous step was found between 1000 bp upstream
and the TSS position. This yielded a preliminary collection of 4015 promoters.
6. CAGE tag peak calling (second round)
A second selection was done using ChIP-Peak on merged files of the duplicates from each
time point, with the same parameters as above but a lower threshold. After merging,
peaks too close from one another were removed using ChIP-Peak with the same parameters
as above but a threshold of 0. TSSs with < 50 tags in a 150-bp centered window were
excluded.
7. TSS validation and shifting (second round)
Unmatched ENSEMBL TSSs from step 5 were validated and shifted as above using
the peaks obtained in step 6, yielding 1580 promoters, which were added
to the preliminary collection.
8. Addition of differentially expressed promoters
Low-coverage TSSs from step 6 (< 50 tags) that had at least 30 tags within
75 bp around the TSS and that showed differential expression across time points
(defined as more than 50% of tags at a single time point) were validated against
ENSEMBL TSSs as above, yielding 134 additional promoters.
9. Filtering by relative expression
For genes with several potential promoters, we filtered out those
representing < 5% of the tags from all promoters associated with the
corresponding gene.
10. Final EPDnew collection
The 5597
experimentally validated promoters were stored in the
EPDnew database, which can be downloaded from our ftp
site. Scientists are welcome to use our other tools
ChIP-Seq
(for correlation analysis) and
SSA
(for motif analysis around promoters) to analyze the
EPDnew database.