ENSEMBL52, TSS collection

Description

Transcription Start Sites of ENSEMBL52 database downloaded from Biomart.

Source

Data have been downloaded from BioMart
Input file format: Tab-delimited TXT

Samples

From H. sapiens (March 2006 NCBI36/hg18).

Filename Description Feature GEO-ID
1 Hum_ENSEMBL52.sga TSS from ENSEMBL52 TSS -

Technical Notes

The following attributes have been selected:

  1. Ensembl Transcript ID
  2. Chromosome Name
  3. Strand
  4. Transcript Start (bp)
  5. Transcript End (bp)
  6. Gene Start (bp)
  7. Gene End (bp)
  8. Status (transcript)
  9. Status (gene)
  10. Associated Gene Name
Then, transcrips have been filtered according to the following rules:
  1. Transcript length > 0 [Transcript Start different from Transcript End]
  2. Transcript lies on full chromosomes
  3. Gene must have a 5' UTR [Transcript Start different from Gene Start]
  4. Genes must be annotated [Associated Gene Name present]
  5. Gene and transcripts status known
This can be archived using the following awk command:

awk -F \\t '
$2 ~ "^[0-9][0-9]?|^[XY]" && $3 == "1" && $4 != $5 && $4 != $6 && $10 != "" && $8 == "KNOW" && $9 == "KNOW" {print "chr" $2 "\tTSS\t" $4 "\t+\t" 1 "\t" $10}
$2 ~ "^[0-9][0-9]?|^[XY]" && $3 == "-1" && $4 != $5 && $5 != $7 && $10 != "" && $8 == "KNOW" && $9 == "KNOW" {print $2 "\tTSS\t" $5 "\t-\t" 1 "\t" $10}
' biomart_output.txt | sort -s -k1,1 -k3,3n -k4,4 | compact_sga.pl > ENSEMBL.sga

The SGA file can than be transformed into an FPS file using sga2fps.pl

References

  1. Haider S, Ballester B, Smedley D, Zhang J, Rice P, Kasprzyk A.
    BioMart Central Portal--unified access to biological data. Nucleic Acids Res. 37:W23-7. PMID: 19420058