|March 2012 GRCm38/mm10
Data was downloaded from BioMart. Genes were kept if the gene type was long intergenic non-coding RNA. All transcripts associated with any given gene were considered, but those with the same TSS position were merged under a common transcript ID. The resulting list contained 8685 TSSs covering 3496 genes.
The Ensembl TSS collection is stored as a tab-delimited text file conforming to the SGA format. The six fields in the file contain the following information:
TSS mapping data was downloaded from the FANTOM5 http site (see link above). The source files are in BAM format mapped on the mm10 genome assembly. The complete list of files can be found here. BAM files were converted into BED files with the bamToBed program. Files were kept and analyzed individually.
The compressed versions of these files are available from the MGA archive (see links above).
Each sample was processed in a pipeline aiming at validating TSSs with CAGE peaks. An Ensembl TSS was experimentally confirmed if a peak lied in a window of 50 bp around it. The validated TSS was then shifted to the nearest base with the higher tag density.
Each sample in the dataset was used to generate a separate promoter collection. The same promoter could be validated in multiple samples and could have different start sites in different samples. To avoid redundancy, the individual collections were used as input for an additional step in the analysis (see part B in the figure above).
All sample-specific promoter collections were merged into a unique file and further analyzed. A promoter was retained in the list only if validated by at least 3 samples. Promoters validated by multiple samples may have their start site set on a broader region rather than a single position. For each transcript, we thus selected the position validated by the largest number of samples as the "true" TSS.
We used our ChIP-Peak online tool as above but with a vicinity of 150 and a threshold of 1, in order to retain the single most expressed promoter in each promoter "cluster".
We finally applied an additional filtering on relative expression, keeping only promoters whose expression represents at least 10% of the associated gene's total expression. We also decided to limit the number of promoters per gene to 5.