ENSEMBL61, 5\'end collection.

Data was downloaded from BioMart selecting the following attributes:

Ensembl Transcript ID
Chromosome Name
Strand
Transcript Start (bp)
Transcript End (bp)
Gene Start (bp)
Gene End (bp)
Status (transcript)
Status (gene)
Associated Gene Name

Then, transcrips have been filtered according to the following rules:

Transcript length > 0 [Transcript Start different from Transcript End]
Transcript lies on full chromosomes
Gene must have a 5' UTR [Transcript Start different from Gene Start]
Genes must be annotated [Associated Gene Name present]
Gene and transcripts status known

This can be archived using the following awk command:

awk -F \\t '
$2 ~ "^[0-9][0-9]?|^[XY]" && $3 == "1" && $4 != $5 && $4 != $6 && $10 != "" && $8 == "KNOW" && $9 == "KNOW" {print "chr" $2 "\tTSS\t" $4 "\t+\t" 1 "\t" $10}
$2 ~ "^[0-9][0-9]?|^[XY]" && $3 == "-1" && $4 != $5 && $5 != $7 && $10 != "" && $8 == "KNOW" && $9 == "KNOW" {print $2 "\tTSS\t" $5 "\t-\t" 1 "\t" $10}
' biomart_output.txt | sort -s -k1,1 -k3,3n -k4,4 | compact_sga.pl > ENSEMBL.sga

The SGA file can than be transformed into an FPS file using sga2fps.pl

	Filename	Description	Feature	GEO-ID
1	Cen_ENSEMBL61.sga	5p-end from ENSEMBL61	5END	-

ENSEMBL61, 5\'end collection.

Description

Source

Samples

Technical Notes

References