Cumbie 2015, Nano-CAGE data of Arabidopsis roots.

Description

A new CAGE technique is been developed (NanoCAGE-XL) that promises the identification of high confidence transcription start sites. This is a proof-of-concept dataset.

Source

Raw data was downloaded from: PRJNA270670
Input file format: SRA

Samples

From A. thaliana (Feb 2011 TAIR10/araTha1).

Transcription Profiling data:

	Filename	Description	Feature	GEO-ID
1	SRX815832.sga	9 day-old roots\|CAGE\|\|exp1	CAGE	-
2	SRX1097403.sga	9 day-old roots\|CAGE\|\|exp2	CAGE	-
3	SRX1097494.sga	9 day-old roots\|CAGE\|\|exp3	CAGE	-

Technical Notes

Following publication guidelines each experiment was treated differently:

SRX815832 (experiment 1): reads were 101 bp long and contain a serie of G (3) at the 5'end due to enzimatic reaction during library preparation. We trim them and noticed that often there were more Gs that followed. An anlysis of Inr motif and read distribution around EPDnew promoters confirmed that the additional G were mostly artefacts, so we decided to remove them. Reads were further trimed at the 3'end resulting in 50bp long sequences (similar lenght of the other samples).
SRX1097403 (experiment 2): read lenght 51. The manuscript describes the presence in this library of 3 different barcodes and 3 linkers (total lenght of 16bp). We could not easly find them (even allowing 2 MM) and instead decided to trim the first 16 bp from each read. This simple procedure delivered good results in term of motif distribution and read distribution around EPDnew database.
SRX1097494 (experiment 3): read lenght 51. Reads contained 6 barcodes (total of 9 bp long) at the 5'end that could be identified (2MM allowed) using an in-house perl script. After trimming the read mapped locations were shifted 1bp upstream of the expected position (Inr motif). For this reason they were trimmed of an additional base.

FASTQ files were extracted from SRA files using fastq-dump (SRA toolkit v2.5.0). After trimming, reads were mapped to TAIR10 genome using Bowtie v0.12.8. SAM files were then converted into bam using samtools v0.1.14 and to bed using bamToBed v2.12.0 (bedtools). SGA conversion was carried out using bed2sga.pl (ChIP-Seq v. 1.5.3).

References

Cumbie JS, Ivanchenko MG, Megraw M
NanoCAGE-XL and CapFilter: an approach to genome wide identification of high confidence transcription start sites. BMC Genomics. 2015 Aug 13;16:597. doi: 10.1186/s12864-015-1670-6. 26268438

Last update: 1 Oct 2018