TSS assembly pipeline for Cf_EPDnew_001

Introduction

This document provides a technical description of the transcription start site assembly pipeline that was used to generate EPDnew version 001 for C. familiaris.

Source Data

Promoter collection:

Name Genome Assembly Promoters Genes PMID Access data
ENSEMBL92 genes Sep 2011 CanFam3.1/canFam3 23855 19971 29155950 SOURCE DOC DATA

Experimental data:

Name Type Samples Tags PMID Access data
lizio17 CAGE 12 448,927,614 29182598 SOURCE DOC DATA

Assembly pipeline overview

Description of procedures and intermediate data files

1. UCSC Download

Data was downloaded from UCSC table browser in June 2018 Then, transcrips have been filtered according to the following rules:
  1. Transcripts of protein coding genes only
  2. Transcript lies on full chromosomes
  3. Genes must be annotated [Associated Gene Name present]
  4. Gene and transcripts status known
Gene names were taken from the field "Associated Gene Name". Since the EPD format doesn't allow gene names longer than 18 characters, we checked whether the names repsected this limitation.
Transcripts with the same TSS position were merged under a common ID. As a consequence, of the 24860 transcripts originally present in the database, 23855 uniquely mapped promoters were kept in the input list, covering 19971 genes.

2. UCSC TSS collection

If present, the UCSC TSS collection is stored as a tab-deliminated text file conforming to the SGA format under the name (for dog):
    Cf_ensembl92_tss_canFam3.sga
The six fields contain the following information:
  • NCBI/RefSeq chromosome id
  • "TSS"
  • position
  • strand ("+" or "-")
  • "1"
  • ENSEMBLGeneID .. geneName
Note that the second and forth fields are invariant.

3. Data import from FANTOM5

BAM files for high quality CAGE samples (hCAGE) were downloaded from FANTOM5 ftp-site (see link above). Files were then converted into SGA format using in-house software. There are a total number of 12 samples in this collection. Individual SGA files can be downloaded from our ftp website (link above).

5. mRNA 5' tags peak calling

For each individual sample, peak calling for the merged file has been carried out using ChIP-Peak on-line tool with the following parameters:
  • Window width = 1
  • Vicinity range = 200
  • Peak refine = Y
  • Count cutoff = 9999999
  • Threshold = 5

6. TSS validation and shifting

Each sample in the collection (mRNA peaks and UCSC TSS) was then separately processed in a pipeline aiming at validating transcription start sites with mRNA peaks. A UCSC TSS was experimentally confirmed if an mRNA peak lied in a window of 200 bp around it or it mapped in the 5' UTR region. The validated TSS was then shifted to the nearest base with the higher tag density.

8. Promoter collection for each sample

Each sample in the dataset was used to generate a separate promoter collection. Potentially, the same transcript could be validated by multiple samples and it could have different start sites in different samples. To avoid redundancy, the individual collections were used as input for an additional step in the analysis (Assembly pipeline part B).

9. Merging collections and second TSS selection

The 12 promoter collections were merged into a unique file and further analysed. The promoter of a transcript was mantained in the list only if validated by at least two samples. Transcript validated by multiple samples could potentially have the TSS set on a broader region and not to single position. To avoid such inconsistency, for each transcript we selected the position that was validated by the larger number of samples as the true TSS.

10. Filtering

TSSs that mapped close to other TSSs that belonged to the same gene (500-bp window) were merged into a unique promoter following the same rule: the promoter that was validated by the highest number of samples was kept.

10. Final EPDnew collection

The 7545 experimentally validated promoters were stored in the EPDnew database, which can be downloaded from our ftp site. Scientists are welcome to use our other tools ChIP-Seq (for correlation analysis) and SSA (for motif analysis around promoters) to analyze the EPDnew database.

Last update October 2019