ChIP-Seq Tutorial 2

Part B: Reformatting and uploading your own data
(Philipp Bucher)

This tutorial explains how to use the ChIP-Seq server at:

An article describing the server and the back-end tools can be found at:

/chipseq/doc/Ambrosini_chipseq_2011.pdf

Links to additional documentations including other hands-on tutorials can be found on the home page indicated above. Note that some of the analysis steps described in this tutorial rely on programs from the Signal Search Analysis (SSA) server at:

/ssa/

This part of the tutorial explains how to make your data ready for upload to the ChIP-Seq server. To understand the reformatting procedure, it is necessary to know something about the internal data storage format, which is called SGA (Simple Genome Annotation). SGA is a compact tab-delimited text file format with five obligatory fields containing the following informations:

chromosome name (e.g chr1, NC_000067.5)
feature name (an experiment identifier)
position (5' end of mapped sequence tag, peak center position)
strand +, −, or 0
count (# of tags mapping to the same position)

Chromoses are internally identified by RefSeq chromosome accession numbers. This prevents confusion between different assemblies of the same genome. For uploaded data, chromosome names used by the UCSC genome browser are also acceptable, but in this case the corresponding genome assembly needs to be specified on input.

The "feature name" identifies the experiment. This makes it possible to merge data from different experiments into one SGA file. Fields 3-5 are more or less self-explanatory. Note however that genomic features in SGA format my be assigned orientation zero, which means "unoriented". This is appropriate for features that have no defined orientation such as peaks derived from ChIP-Seq data.

Very importantly, SGA files need to be sorted by chromosome name, position and strand, in this order of priority. This enables fast processing by the ChIP-Seq programs at the back-end of the server. For uploaded data, sorting can be delegated to the server. ChIP-Seq Technical Document.

If you want to analyze your own data, you first have to convert them into SGA format. The server offers a utility for automatic conversion of the widely used formats BED and BAM into SGA. Note further that the ChIP-Seq server offers a large collections of public ChIP-Seq data for the human and mouse genomes. All current data are mapped to the genome assemblies NCBI36/hg18 and NCBI37/mm9. You can carry out some basic tasks with data mapped to another genome assemblies but the most powerful features will not work. We strongly recommend users who have human or mouse data mapped to another assembly, to remap their data to hg18 or mm9 using the "liftOver" utility provided by UCSC (see below).

Step-by-step procedure: We illustrate the procedure with a file containing mapped sequence reads from a ChIP-Seq experiment in BED format that can be downloaded from GEO:

https://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/supplementary/samples/GSM288nnn/GSM288345/GSM288345_Nanog.bed.gz

Download the file. To save time, you may also copy over the file from here:

./GSM288345_Nanog.bed.gz

http://ccg.vital-it.ch/var/inserm212/GSM288345_Nanog.bed

The file contains data from an experiment carried out to map Nanog binding sites in mouse embryonic stem cells, see Chen et al. 2008. The genomic coordinates relate to genome assembly NCBI36/mm8, see note in corresponding GEO entry GSM288345. The tags thus need to be remapped to mm9.

Uncompress the bed file (on a UNIX machine: gunzip ./GSM288345_Nanog.bed.gz). Go to the UCSC genome browser LiftOver page at

http://genome.ucsc.edu/cgi-bin/hgLiftOver

Select:

   Original genome/assembly = Mouse/Feb. 2006 (NCBI36/mm8)
   New genome/assembly = Mouse/July 2007 (NCBI37/mm9)

Then upload the file GSM288345_Nanog.bed, submit, and save the output file under the name GSM288345_Nanog_mm9.bed. (The output file can be accessed via the hyper-link "View Conversions".) Compress this file with gzip or zip. This job may take some time because the UCSC liftOver page doesn't accept compressed files. For regular use, it is recommended to download and run the liftOver program on a local computer. You can find the remapped and gzipped bed file here:

./GSM288345_Nanog_mm9.bed.gz

http://ccg.vital-it.ch/var/inserm212/GSM288345_Nanog_mm9.bed.gz

In order to convert this file into SGA format, go to the ChIP-Seq server and choose the ChIP-Converter page from the main menu. The direct link is:

/chipseq/format_convert.php

Under "Select Conversion" select BED-to-SGA". (New menu items will appear). Then upload the file GSM288345_Nanog_mm9.bed.gz. (New menu items will appear). At the bottom of the form, activate checkbox "Genome" and select "M. musculus (July 2007 NCBI37/mm9)". Leave oll other text fields or checkboxes blank and press the "Run" button. Save the output under the name ES_Nanog.sga.gz. This file can be found here:

./ES_Nanog.sga.gz

http://ccg.vital-it.ch/var/inserm212/ES_Nanog.sga.gz

The GEO entry GSM288345 also provides a peak list in a tab-delimited text file. Download this file from GEO. You may use the direct link below for this purpose.

https://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/supplementary/samples/GSM288nnn/GSM288345/GSM288345_ES_Nanog.txt.gz

Uncompress the file (gunzip on a UNIX machine). Then upload to the UCSC genome browser and liftOver the data to mm9. The liftOver page accepts the input but returns the remapped genome coordinate in a format with lines of the following type:

   chr1:3053032-3053034

In order to convert this format into legal BED, replace all ":" and "-" by tabs with a text editor of your choice and save the output under the name ES_Nanog_peaks.bed. The file resulting from this procedure can be found here:

./ES_Nanog_peaks.bed

To convert this file into SGA, go again to the ChIP-converter page:

/chipseq/format_convert.php

Chose same options as before but activate this time the checkbox "Centered SGA" which will have the following effects:

Rather than taking the beginning or end position depending on the strand, the midpoint between the positions will be put in the SGA file.
The strand field will be set to 0, which indicates that the feature has no orientation.

Push the "Run" button and save the output file under the name ES_Nanog_peaks.sga.gz. As this file is small, you may uncompress it. An uncompressed version can be found here:

./ES_Nanog_peaks.sga

The two SGA files you have generated in this exercise correspond the samples "ES Nanog" and "ES Nanog peaks" from the Chen08 data series used in part A of this tutorial. Try to repeat some of the analysis proposed there by uploading these data rather than selecting them from the menu available data sets. In order to so, activate the radio button "Upload custom Data" and select your file via the "Browse..." button. Don't forget to select format SGA and Genome Mus musculus(July 2007 MCBI37/mm9).

ChIP-Seq Tutorial 2

Part B: Reformatting and uploading your own data (Philipp Bucher)

Part B: Reformatting and uploading your own data
(Philipp Bucher)