ENSEMBL52, TSS collection
Description
Transcription Start Sites of ENSEMBL52 database downloaded from Biomart.
Source
Data have been downloaded from BioMart
Input file format: Tab-delimited TXT
Samples
From H. sapiens (March 2006 NCBI36/hg18).
Technical Notes
The following attributes have been selected:
- Ensembl Transcript ID
- Chromosome Name
- Strand
- Transcript Start (bp)
- Transcript End (bp)
- Gene Start (bp)
- Gene End (bp)
- Status (transcript)
- Status (gene)
- Associated Gene Name
Then, transcrips have been filtered according to the following rules:
- Transcript length > 0 [Transcript Start different from Transcript End]
- Transcript lies on full chromosomes
- Gene must have a 5' UTR [Transcript Start different from Gene Start]
- Genes must be annotated [Associated Gene Name present]
- Gene and transcripts status known
This can be archived using the following awk command:
awk -F \\t '
$2 ~ "^[0-9][0-9]?|^[XY]" && $3 == "1" && $4 != $5 && $4 != $6 && $10 != "" && $8 == "KNOW" && $9 == "KNOW" {print "chr" $2 "\tTSS\t" $4 "\t+\t" 1 "\t" $10}
$2 ~ "^[0-9][0-9]?|^[XY]" && $3 == "-1" && $4 != $5 && $5 != $7 && $10 != "" && $8 == "KNOW" && $9 == "KNOW" {print $2 "\tTSS\t" $5 "\t-\t" 1 "\t" $10}
' biomart_output.txt | sort -s -k1,1 -k3,3n -k4,4 | compact_sga.pl > ENSEMBL.sga
The SGA file can than be transformed into an FPS file using sga2fps.pl
References
- Haider S, Ballester B, Smedley D, Zhang J, Rice P, Kasprzyk A.
BioMart Central Portal--unified access to biological data.
Nucleic Acids Res. 37:W23-7. PMID:
19420058