From D. melanogaster (Aug 2014 BDGP Rel6 + ISO1 MT/dm6).
The following attributes have been selected:
- Ensembl Transcript ID
- Chromosome Name
- Strand
- Transcript Start (bp)
- Transcript End (bp)
- Gene Start (bp)
- Gene End (bp)
- Status (transcript)
- Status (gene)
- Associated Gene Name
Then, transcrips have been filtered according to the following rules:
- Transcript length > 0 [Transcript Start different from
Transcript End]
- Transcript lies on full chromosomes
- Gene must have a 5' UTR [Transcript Start different from
Gene Start]
- Genes must be annotated [Associated Gene Name present]
- Gene and transcripts status known
This can be archived using the following awk command:
awk -F \\t '
$2 ~ "^[0-9][0-9]?|^[XY]" && $3 == "1" &&
$4 != $5 && $4 != $6 && $10 != "" && $8 == "KNOW" && $9 ==
"KNOW" {print "chr" $2 "\tTSS\t" $4 "\t+\t" 1 "\t" $10}
$2 ~ "^[0-9][0-9]?|^[XY]" && $3 == "-1" && $4 != $5 && $5 !=
$7 && $10 != "" && $8 == "KNOW" && $9 == "KNOW" {print $2
"\tTSS\t" $5 "\t-\t" 1 "\t" $10}
' biomart_output.txt |
sort -s -k1,1 -k3,3n -k4,4 | compact_sga.pl >
ENSEMBL.sga