TSS assembly pipeline for Hs_EPDnew_005

Introduction

This document provides a technical description of the transcription start site assembly pipeline that was used to generate EPDnew version 005 for H. sapiens.

Source Data

Promoter collection:

Name Genome Assembly Promoters Genes PMID Access data
Gencode Dec 2013 GRCh38/hg38 31931 18960 27250503 SOURCE DOC DATA

Experimental data:

Name Type Samples Tags PMID Access data
FANTOM5 CAGE 941 18,244,201,540 24670764 SOURCE DOC DATA
ENCODE CAGE 145 7,134,200,060 22955620 SOURCE DOC DATA

Assembly pipeline overview

Description of procedures and intermediate data files

1. UCSC Download

Data was downloaded from UCSC Table Browser (30-06-2017) selecting the following attributes:
  1. hg19.knownGene
  2. hg19.ensemblSource
  3. hg19.kgXref
  4. hg19.knownToEnsembl
  5. hg19.refSeqStatus
  6. hg19.spMrna
Then, transcrips were filtered according to the following rules:
  1. Transcripts of protein coding genes only (Ensembl annotation)
  2. Transcripts must have a RefSeq protein ID
Gene names were taken from the field "Associated Gene Name". Since the EPD format doesn\'t allow gene names longer than 18 characters, we checked whether the names repsected this limitation.
Transcripts with the same TSS position were merged under a common ID. As a consequence of this the total number of TSS in the list was 28210.

2. Gencode TSS collection

The Gencode TSS collection is stored as a tab-deliminated text file conforming to the SGA format. The six fields in the file contain the following kinds of information:
  • NCBI/RefSeq chromosome id
  • "ENSEMBL"
  • position
  • strand ("+" or "-")
  • "1"
  • TranscriptID..GeneName.
Note that the second and forth fields are invariant.

3. Data import from ENCODE and FANTOM5 CAGE

CAGE Tag Data were downloaded from UCSC ftp-site and FANTOM5 http-site (see links above). The source files are in bam format mapped on hg19 genome assembly. Samples were lifted-over to hg38 genome assembly using the liftOver tool. The complete list of files can be found here for ENCODE and here for FANTOM5. Bam files were converted into bed files with bamToBed program. Files were kept and analysed individually.

4. CAGE tags

The compressed versions of these files are available from the MGA archive (see links above).

5. mRNA 5' tags peak calling

Peak calling for each individual CAGE data file has been carried out using ChIP-Peak on-line tool with the following parameters:
  • Window width = 1
  • Vicinity range = 200
  • Peak refine = N
  • Count cutoff = 9999999
  • Threshold = 5

6. TSS validation and shifting

Each sample in the collection (mRNA peaks and Gencode TSS) was then processed in a pipeline aiming at validating transcription start sites with mRNA peaks. An Gencode TSS was experimentally confirmed if a CAGE peak lied in a window of 300 bp around it or if mapped in the 5'UTR region and if it had a maximum high of at least 5 tags (50 tags for peaks in the 5'UTR). The validated TSS was then shifted to the nearest base with the higher tag density.

7. Gencode not-validated TSS

The total number (summing up all samples) of non experimentally validated TSS was around 3000.

8. Promoter collection for each sample

Each sample in the dataset was used to generate a separate promoter collection. Potentially, the same transcript could be validated by multiple samples and it could have different start sites in different samples. To avoid redundancy, the individual collections were used as input for an additional step in the analysis (Assembly pipeline part B).

9. Quality controls of sample-specific promoter collections

The quality of promoter collections derived from each sample was tested to exclude low quality samples from the final collection. To achive this, each promoter collection was scored according to the distribution of the TATA-box and Inr motif in the expected position (-29bp from the TSS and at the TSS respectively). Samples with very low motif frequencies (Inr frequency < 10% and TATA-box < 5%) were discarded (9 samples in total) from further analyses.

10. Merging collections and second TSS selection

The good-quality promoter collections were merged into a unique file and further analysed. The promoter of a transcript was mantained in the list only if validated by at least two samples. Transcript validated by multiple samples could potentially have the TSS set on a broader region and not to single position. To avoid such inconsistency, for each transcript we selected the position that was validated by the larger number of samples as the true TSS.

11. Filtering

Transcription Start Sites that mapped closed to other TSS that belonged to the same gene (500 bp window) were merged into a unique promoter following the same rule: the promoter that was validated by the higher number of samples was kept.

12. Final EPDnew collection

The 25503 experimentally validated promoter were stored in the EPDnew database that can be downloaded from our ftp site. Scientist are wellcome to use our other tools ChIP-Seq (for correlation analysis) and SSA (for motifs analysis around promoters) to analyse EPDnew database.

Last update October 2019