TSS assembly pipeline for Dm_EPDnew_004

Introduction

This document provides a technical description of the transcription start site assembly pipeline that was used to generate EPDnew version 004 for D. melanogaster.

Source Data

Promoter collection:

Name	Genome Assembly	Promoters	Genes	PMID	Access data
ENSEMBL86	Aug 2014 BDGP Rel6 + ISO1 MT/dm6	18409	13660	19420058	SOURCE	DOC	DATA

Experimental data:

Name	Type	Samples	Tags	PMID	Access data
MachiBase	OligoCap	7	24,984,353	18842623	SOURCE	DOC	DATA
Hoskins et al., 2012	CAGE	1	17,979,809	21177961	SOURCE	DOC	DATA
modENCODE	CAGE	49	596,317,845	24985915	SOURCE	DOC	DATA
Ni et al., 2010	CAGE	2	27,173,616	20495556	SOURCE	DOC	DATA

Assembly pipeline overview

Description of procedures and intermediate data files

1. Biomart Download

Data was downloaded from BioMart selecting the following attributes:

Ensembl Gene ID
Ensembl Transcript ID
Chromosome Name
Strand
Transcript Start (bp)
Transcript End (bp)
Gene Start (bp)
Gene End (bp)
Status (transcript)
Status (gene)
Associated Gene Name

Then, transcrips have been filtered according to the following rules:

Transcripts of protein coding genes only
Transcript length > 0 [Transcript Start different from Transcript End]
Transcript lies on full chromosomes
Gene must have a 5' UTR [Transcript Start different from Gene Start]
Genes must be annotated [Associated Gene Name present]
Gene and transcripts status known

Gene names were taken from the field 'Associated Gene Name'. Since the EPD format doesn't allow gene names longer than 18 characters, we checked whether the names repsected this limitation.

Transcripts with the same TSS position were merged under a common ID. As a conseguence of this, from the 23850 transcrips originally present in the ENSEMBL database, 5953 were merged, leaving 17897 uniquely mapped promoters in the input list.

2. EMBL TSS collection

The ENSEMBL TSS collection is stored as a tab-deliminated text file conforming to the SGA format under the name:

Dm_ENSEMBL70.sga

The six field contain the following kinds of information:

NCBI/RefSeq chromosome id
"TSS"
position
strand ("+" or "-")
"1"
gene name.

Note that the second and forth fields are invariant.

3. Data import from MachiBase

MachiBase data were generated with the oligo-capping technology. The source data were downloaded from:

http://download.utgenome.org/pub/machibase/tssExp.tar.gz

According to the readme file included in the tar archive, the 5' end tags were mapped to the Drosophila genome using BLAT as alignment tool allowing for up to three mismatches.

4. oligocap tags

The compressed version of this file is available from the MGA archive (see above) under the name:

all_oligocap.sga.gz

5. Data import from Genome Research

Mapped sequence tags were extracted from Supplementary Data File 1 available from Genome Research at:

http://genome.cshlp.org/content/21/2/182/suppl/DC1

The downloaded source file is in SAM format and has been generated with the tag mapping program StatMap as described in the article cyted above. We extracted all tags with mapping quality scores greater or equal to 30.

6. CAGE tags

The compressed version of this file is available from our ftp site (see above link) with the name:

embryo_cage.sga.gz

7. Data import from SRA

BAM files for the SRA serie SRP001602 and SRX018832 were downloaded from SRA site and converted into SGA file using in house software.

8. TSS validation and shifting

Each sample in the collection (mRNA peaks and ENSEMBL TSS) was then processed in a pipeline aiming at validating transcription start sites with mRNA peaks. An Ensembl TSS was experimentally confirmed if a CAGE peak lied in a window of 200 bp around it and if it had a maximum high of at least 3 tags. The validated TSS was then shifted to the nearest base with the higher tag density.

9. Promoter collection for each sample

Each sample in the dataset was used to generate a separate promoter collection. Potentially, the same transcript could be validated by multiple samples and it could have different start sites in different samples. To avoid redundancy, the individual collections were used as input for an additional step in the analysis (Assembly pipeline part B).

10. Merging collections and second TSS selection

All promoter collections were merged into a unique file and further analysed. The promoter of a transcript was mantained in the list only if validated by at least two samples. Transcript validated by multiple samples could potentially have the TSS set on a broader region and not to single position. To avoid such inconsistency, for each transcript we selected the position that was validated by the larger number of samples as the true TSS.

11. Filtering

Transcription Start Sites that mapped closed to other TSS that belonged to the same gene (200 bp window) were merged into a unique promoter following the same rule: the promoter that was validated by the higher number of samples was kept.

12. EPDnew collection

The experimentally validated promoter were stored in the EPDnew database that can be downloaded from our ftp site. Scientist are wellcome to use our other tools ChIP-Seq (for correlation analysis) and SSA (for motifs analysis around promoters) to analyse EPDnew database.

Last update October 2019

SIB Swiss Institute of Bioinformatics | Computational Cancer Genomics | ExPASy | Privacy Notice |

Back to the Top