TSS assembly pipeline for Ce_EPDnew_001
Introduction
This document provides a technical description
of the transcription start site assembly pipeline that was used to
generate EPDnew version 001 for
C. elegans.
Source Data
Promoter collection:
Name |
Genome Assembly |
Promoters |
Genes |
PMID |
Access data |
UCSC Genes
|
May 2008 WS190/ce6
|
20531
|
11786
|
26590259
|
SOURCE
|
DOC
|
DATA
|
Experimental data:
Name |
Type |
Samples |
Tags |
PMID |
Access data |
Kruesi et al., 2013
|
GRO-cap
|
9
|
236,210,104
|
23795297
|
SOURCE
|
DOC
|
DATA
|
1. Download annotated TSS
Data was downloaded from
UCSC table browser.
Transcrips have been filtered according to the following rules:
- Transcripts of protein coding genes only
- Transcript lies on full chromosomes
- Genes must be annotated [Associated Gene Name present]
- Gene and transcripts status known
Gene names were taken from the field "Associated Gene Name". Since the
EPD format doesn't allow gene names longer than 18 characters,
we checked whether the names repsected this limitation.
A total number of 20531 promoters were selected.
2. UCSC TSS collection
The UCSC TSS collection is stored as a tab-deliminated text file
conforming to the SGA format under the name:
The six field contain the following kinds of information:
- NCBI/RefSeq chromosome id
- "TSS"
- position
- strand ("+" or "-")
- "1"
- gene name / ID
Note that the second and forth fields are invariant.
3. Import CAGE data
Data was imported from GEO as SRA file format. Raw sequence files were
mapped to ce6 genome using Bowtie. The resulting BAM files were
converted to SGA file format using
ChIP-Convert.
A step-by-step guide on how to import, map and convert these samples
can be found
here
4. Download annotated TSS file from eLIFE
The list of promoters published by Kruesi et al., was downloaded from
ELife. XLS
file was converted to a tab delimited flat file using OpenOffice and
converted to a bed file using in-house scripts. (Note that one line in
the input data file contains up to 4 TSS coordinates)
5. LiftOver ce10 to ce6 and generate an SGA file
The Kruesi et al. promoter list was lifted over from ce10 to ce6 using
the
liftOver
tool from UCSC Genome Browser.
The resulting BED file was
converted to SGA using
ChIP-Convert.
6. Annote kruesi13 SGA file with GRO-cap counts
The published promoter collection was annotated using the GRO-cap raw
data. This step was done to get the total number of GRO-cap reads that
mapped at the annotated TSSs.
7. Select TSS with maximal GRO-cap
Promoters that belong to the same genes were merged if their distance
was shorter that 100 bp. The site with the higher tag count was then
selected as EPD promoter.