TSS assembly pipeline for Hs_EPDnew_005
  Introduction
  This document provides a technical description of the transcription
  start site assembly pipeline that was used to generate EPDnew
  version 005 for 
H. sapiens.
  
  Source Data
Promoter collection:
  
    | Name | 
    Genome Assembly | 
    Promoters | 
    Genes | 
    PMID | 
    Access data | 
  
  
    | 
      Gencode
     | 
    
      Dec 2013 GRCh38/hg38
     | 
    
      31931
     | 
    
      18960
     | 
    
      27250503
     | 
    
      SOURCE
     | 
    
      
      DOC
     | 
    
      
      DATA
     | 
  
Experimental data:
  Assembly pipeline overview
  
    
      
    
    
    Description of procedures and intermediate data files
    1. UCSC Download
    Data was downloaded from UCSC Table Browser (30-06-2017) selecting
    the following attributes:
    
      - hg19.knownGene
 
      - hg19.ensemblSource
 
      - hg19.kgXref
 
      - hg19.knownToEnsembl
 
      - hg19.refSeqStatus
 
      - hg19.spMrna
 
    
    Then, transcrips were filtered according to the following rules:
    
      - Transcripts of protein coding genes only (Ensembl
      annotation)
 
      - Transcripts must have a RefSeq protein ID
 
    
    Gene names were taken from the field "Associated Gene Name". Since
    the EPD format doesn\'t allow gene names longer than 18 characters,
    we checked whether the names repsected this limitation.
    Transcripts with the same TSS position were merged under a common
    ID. As a consequence of this the total number of TSS in the list
    was 28210.
    
2. Gencode TSS collection
    The Gencode TSS collection is stored as a tab-deliminated text file
    conforming to the SGA format.
    The six fields in the file contain the following kinds of
    information:
    
      - NCBI/RefSeq chromosome id
 
      - "ENSEMBL"
 
      - position
 
      - strand ("+" or "-")
 
      - "1"
 
      - TranscriptID..GeneName.
 
    
    Note that the second and forth fields are invariant.
    
3. Data import from ENCODE and FANTOM5 CAGE
    CAGE Tag Data were downloaded from UCSC ftp-site and FANTOM5
    http-site (see links above).  The source files are in bam format
    mapped on hg19 genome assembly. Samples were lifted-over to hg38
    genome assembly using the liftOver tool. The complete list of
    files can be found 
    here for ENCODE and 
 here
    for FANTOM5. Bam files were converted into bed files with
    bamToBed program. Files were kept and analysed individually.
    
4. CAGE tags
    The compressed versions of these files are available from the MGA
    archive (see links above).
    
5. mRNA 5' tags peak calling
    Peak calling for each individual CAGE data file has been carried
    out using 
ChIP-Peak
    on-line tool with the following parameters:
    
      - Window width = 1
 
      - Vicinity range = 200
 
      - Peak refine = N
 
      - Count cutoff = 9999999
 
      - Threshold = 5
 
    
    6. TSS validation and shifting
    Each sample in the collection (mRNA peaks and Gencode TSS) was then
    processed in a pipeline aiming at validating transcription start
    sites with mRNA peaks. An Gencode TSS was experimentally confirmed
    if a CAGE peak lied in a window of 300 bp around it or if mapped
    in the 5'UTR region and if it had a maximum high of at least 5
    tags (50 tags for peaks in the 5'UTR). The validated TSS was then
    shifted to the nearest base with the higher tag density.
    
7. Gencode not-validated TSS
    The total number (summing up all samples) of non experimentally
    validated TSS was around 3000.
    
8. Promoter collection for each sample
    Each sample in the dataset was used to generate a separate
    promoter collection. Potentially, the same transcript could be
    validated by multiple samples and it could have different start
    sites in different samples. To avoid redundancy, the individual
    collections were used as input for an additional step in the
    analysis (Assembly pipeline part B).
    
9. Quality controls of sample-specific promoter
    collections
    The quality of promoter collections derived from each sample was
    tested to exclude low quality samples from the final
    collection. To achive this, each promoter collection was scored
    according to the distribution of the 
TATA-box
    and 
Inr
    motif in the expected position (-29bp from the TSS and at the TSS
    respectively). Samples with very low motif frequencies (Inr
    frequency < 10% and TATA-box < 5%) were discarded (9 samples in
    total) from further analyses.
    
10. Merging collections and second TSS selection
    The good-quality promoter collections were merged into a unique
    file and further analysed. The promoter of a transcript was
    mantained in the list only if validated by at least two
    samples. Transcript validated by multiple samples could
    potentially have the TSS set on a broader region and not to single
    position. To avoid such inconsistency, for each transcript we
    selected the position that was validated by the larger number of
    samples as the true TSS.
    
11. Filtering
    Transcription Start Sites that mapped closed to other TSS that
    belonged to the same gene (500 bp window) were merged into a
    unique promoter following the same rule: the promoter that was
    validated by the higher number of samples was kept.
    
12. Final EPDnew collection
    The 25503 experimentally validated promoter were stored in the
    EPDnew database that can be downloaded from our ftp
    site. Scientist are wellcome to use our other tools 
ChIP-Seq (for
    correlation analysis) and 
SSA (for motifs analysis
    around promoters) to analyse EPDnew database.