TSS assembly pipeline for Ce_EPDnew_001

Introduction

This document provides a technical description of the transcription start site assembly pipeline that was used to generate EPDnew version 001 for C. elegans.

Source Data

Promoter collection:

Name Genome Assembly Promoters Genes PMID Access data
UCSC Genes May 2008 WS190/ce6 20531 11786 26590259 SOURCE DOC DATA

Experimental data:

Name Type Samples Tags PMID Access data
Kruesi et al., 2013 GRO-cap 9 236,210,104 23795297 SOURCE DOC DATA

1. Download annotated TSS

Data was downloaded from UCSC table browser. Transcrips have been filtered according to the following rules:
  1. Transcripts of protein coding genes only
  2. Transcript lies on full chromosomes
  3. Genes must be annotated [Associated Gene Name present]
  4. Gene and transcripts status known
Gene names were taken from the field "Associated Gene Name". Since the EPD format doesn't allow gene names longer than 18 characters, we checked whether the names repsected this limitation.
A total number of 20531 promoters were selected.

2. UCSC TSS collection

The UCSC TSS collection is stored as a tab-deliminated text file conforming to the SGA format under the name:
    ucsc_promoter_list.sga
The six field contain the following kinds of information:
  • NCBI/RefSeq chromosome id
  • "TSS"
  • position
  • strand ("+" or "-")
  • "1"
  • gene name / ID
Note that the second and forth fields are invariant.

3. Import CAGE data

Data was imported from GEO as SRA file format. Raw sequence files were mapped to ce6 genome using Bowtie. The resulting BAM files were converted to SGA file format using ChIP-Convert.
A step-by-step guide on how to import, map and convert these samples can be found here

4. Download annotated TSS file from eLIFE

The list of promoters published by Kruesi et al., was downloaded from ELife. XLS file was converted to a tab delimited flat file using OpenOffice and converted to a bed file using in-house scripts. (Note that one line in the input data file contains up to 4 TSS coordinates)

5. LiftOver ce10 to ce6 and generate an SGA file

The Kruesi et al. promoter list was lifted over from ce10 to ce6 using the liftOver tool from UCSC Genome Browser.
The resulting BED file was converted to SGA using ChIP-Convert.

6. Annote kruesi13 SGA file with GRO-cap counts

The published promoter collection was annotated using the GRO-cap raw data. This step was done to get the total number of GRO-cap reads that mapped at the annotated TSSs.

7. Select TSS with maximal GRO-cap

Promoters that belong to the same genes were merged if their distance was shorter that 100 bp. The site with the higher tag count was then selected as EPD promoter.
Last update October 2019