Promoter collection:
Name | Genome Assembly | Promoters | Genes | PMID | Access data | ||
---|---|---|---|---|---|---|---|
HGNC | Dec 2013 GRCh38/hg38 | 8685 | 3496 | 30304474 | SOURCE | DOC | DATA |
Experimental data:
Name | Type | Samples | Tags | PMID | Access data | ||
---|---|---|---|---|---|---|---|
FANTOM5 | CAGE | 941 | 18,244,201,540 | 24670764 | SOURCE | DOC | DATA |
ENCODE | CAGE | 145 | 7,134,200,060 | 22955620 | SOURCE | DOC | DATA |
ENCODE | RAMPAGE | 225 | 13,540,041,874 | 22936248 | SOURCE | DOC | DATA |
A list of non-coding RNA genes was downloaded from HGNC BioMart. Genes were kept if the locus type was either antisense or long intergenic non-coding RNA. As we could not retrieve the coordinates directly from HGNC, we used Ensembl BioMart and the Ensembl Gene ID to get them wherever possible, and RefSeq otherwise. Genes which we could not get coordinates for were discarded. All transcripts associated with any given gene were considered, but those with the same TSS position were merged under a common transcript ID. The resulting list contained 8685 TSSs covering 3496 genes.
The HGNC TSS collection is stored as a tab-delimited text file conforming to the SGA format. The six fields in the file contain the following information:
TSS mapping data (CAGE and RAMPAGE) was downloaded from UCSC ftp site and FANTOM5 http site (see links above). The source files are in BAM format mapped on the hg19 genome assembly. Samples were lifted-over to the hg38 genome assembly using the liftOver tool. The complete list of files can be found here for ENCODE (or here for RAMPAGE) and here for FANTOM5. BAM files were converted into BED files with the bamToBed program. Files were kept and analyzed individually.
The compressed versions of these files are available from the MGA archive (see links above).
Each sample was processed in a pipeline aiming at validating TSSs with CAGE/RAMPAGE peaks. An HGNC TSS was experimentally confirmed if a peak lied in a window of 50 bp around it. The validated TSS was then shifted to the nearest base with the higher tag density.
Each sample in the dataset was used to generate a separate promoter collection. The same promoter could be validated in multiple samples and could have different start sites in different samples. To avoid redundancy, the individual collections were used as input for an additional step in the analysis (see part B in the figure above).
All sample-specific promoter collections were merged into a unique file and further analyzed. A promoter was retained in the list only if validated by at least 3 samples. Promoters validated by multiple samples may have their start site set on a broader region rather than a single position. For each transcript, we thus selected the position validated by the largest number of samples as the "true" TSS.
We used our ChIP-Peak online tool as above but with a vicinity of 150 and a threshold of 1, in order to retain the single most expressed promoter in each promoter "cluster".
We finally applied an additional filtering on relative expression, keeping only promoters whose expression represents at least 10% of the associated gene's total expression. We also decided to limit the number of promoters per gene to 5.