From C. elegans (May 2008 WS190/ce6).
Data was downloaded from BioMart selecting the following
attributes:
- Ensembl Transcript ID
- Chromosome Name
- Strand
- Transcript Start (bp)
- Transcript End (bp)
- Gene Start (bp)
- Gene End (bp)
- Status (transcript)
- Status (gene)
- Associated Gene Name
Then, transcrips have been filtered according to the following
rules:
- Transcript length > 0 [Transcript Start different from
Transcript End]
- Transcript lies on full chromosomes
- Gene must have a 5' UTR [Transcript Start different
from Gene Start]
- Genes must be annotated [Associated Gene Name
present]
- Gene and transcripts status known
This can be archived using the following awk command:
awk -F \\t '
$2 ~ "^[0-9][0-9]?|^[XY]" && $3 == "1" && $4 != $5 && $4 != $6 && $10 != "" && $8 == "KNOW" && $9 == "KNOW" {print "chr" $2 "\tTSS\t" $4 "\t+\t" 1 "\t" $10}
$2 ~ "^[0-9][0-9]?|^[XY]" && $3 == "-1" && $4 != $5 && $5 != $7 && $10 != "" && $8 == "KNOW" && $9 == "KNOW" {print $2 "\tTSS\t" $5 "\t-\t" 1 "\t" $10}
' biomart_output.txt | sort -s -k1,1 -k3,3n -k4,4 | compact_sga.pl > ENSEMBL.sga
The SGA file can than be transformed into an FPS file using
sga2fps.pl