ChIP-Seq Tutorial 2
Part B: Reformatting and uploading your own data
(Philipp Bucher)
This tutorial explains how to use the ChIP-Seq server at:
An article describing the server and the back-end tools
can be found at:
Links to additional documentations including other hands-on
tutorials can be found on the home page indicated above.
Note that some of the analysis steps described in this tutorial rely
on programs from the Signal Search Analysis (SSA) server at:
This part of the tutorial explains how to make your data ready for
upload to the ChIP-Seq server. To understand the reformatting
procedure, it is necessary to know something about the
internal data storage format, which is called SGA (Simple Genome Annotation).
SGA is a compact tab-delimited text file format with five obligatory
fields containing the following informations:
- chromosome name (e.g chr1, NC_000067.5)
- feature name (an experiment identifier)
- position (5' end of mapped sequence tag, peak center position)
- strand +, −, or 0
- count (# of tags mapping to the same position)
Chromoses are internally identified by RefSeq chromosome accession
numbers. This prevents confusion between different assemblies of the
same genome. For uploaded data, chromosome names used by the UCSC
genome browser are also acceptable, but in this case the corresponding
genome assembly needs to be specified on input.
The "feature name" identifies the experiment. This makes it possible to
merge data from different experiments into one SGA file. Fields 3-5
are more or less self-explanatory. Note however that genomic features
in SGA format my be assigned orientation zero, which means "unoriented".
This is appropriate for features that have no defined orientation such
as peaks derived from ChIP-Seq data.
Very importantly, SGA files need to be sorted by chromosome name, position and strand,
in this order of priority. This enables fast processing by the ChIP-Seq programs
at the back-end of the server. For uploaded data, sorting can be delegated to
the server.
ChIP-Seq Technical Document.
If you want to analyze your own data, you first have to convert them into
SGA format. The server offers a utility for automatic conversion
of the widely used formats BED and BAM into SGA.
Note further that the ChIP-Seq server offers a large collections of public ChIP-Seq data
for the human and mouse genomes. All current data are mapped to the
genome assemblies NCBI36/hg18 and NCBI37/mm9. You can carry out some basic
tasks with data mapped to another genome assemblies but the most
powerful features will not work. We strongly recommend users who have
human or mouse data mapped to another assembly, to remap their data
to hg18 or mm9 using the "liftOver" utility provided by UCSC (see below).
Step-by-step procedure: We illustrate the procedure with a file containing
mapped sequence reads from a ChIP-Seq experiment in BED format that can
be downloaded from GEO:
Download the file. To save time, you may also copy over the file
from here:
The file contains data from an experiment carried out to map Nanog binding
sites in mouse embryonic stem cells, see
Chen et al. 2008. The genomic coordinates relate to genome assembly
NCBI36/mm8, see note in corresponding GEO entry
GSM288345. The tags thus need to be remapped to mm9.
Uncompress the bed file (on a UNIX machine: gunzip ./GSM288345_Nanog.bed.gz).
Go to the UCSC genome browser LiftOver page at
Select:
Original genome/assembly = Mouse/Feb. 2006 (NCBI36/mm8)
New genome/assembly = Mouse/July 2007 (NCBI37/mm9)
Then upload the file GSM288345_Nanog.bed, submit, and save
the output file under the name GSM288345_Nanog_mm9.bed. (The
output file can be accessed via the hyper-link "View Conversions".)
Compress this file with gzip or zip. This job may take
some time because the UCSC liftOver page doesn't accept compressed
files. For regular use, it is
recommended to download and run the liftOver program on a local
computer. You can find the remapped and gzipped bed file here:
In order to convert this file into SGA format, go to the ChIP-Seq server and
choose the ChIP-Converter page from the main menu. The direct link is:
Under "Select Conversion" select BED-to-SGA".
(New menu items will appear). Then upload the file
GSM288345_Nanog_mm9.bed.gz. (New menu items will appear).
At the bottom of the form, activate checkbox "Genome" and select
"M. musculus (July 2007 NCBI37/mm9)". Leave oll other text fields or
checkboxes blank and press the "Run" button. Save the output
under the name ES_Nanog.sga.gz. This file can be found here:
The GEO entry GSM288345 also provides a peak list in a tab-delimited text
file. Download this file from GEO. You may use the direct link below
for this purpose.
Uncompress the file (gunzip on a UNIX machine). Then upload to the UCSC genome
browser and liftOver the data to mm9. The liftOver page accepts the input but returns
the remapped genome coordinate in a format with lines of the following type:
chr1:3053032-3053034
In order to convert this format into legal BED, replace all ":" and "-" by tabs with
a text editor of your choice and save the output under the name ES_Nanog_peaks.bed.
The file resulting from this procedure can be found here:
To convert this file into SGA, go again to the ChIP-converter
page:
Chose same options as before but activate this time the
checkbox "Centered SGA" which will have the following
effects:
- Rather than taking the beginning or end position depending
on the strand, the midpoint between the positions will be
put in the SGA file.
- The strand field will be set to 0, which indicates that
the feature has no orientation.
Push the "Run" button and save the output file under the
name ES_Nanog_peaks.sga.gz. As this file is small, you may
uncompress it. An uncompressed version can be found here:
The two SGA files you have generated in this exercise correspond
the samples "ES Nanog" and "ES Nanog peaks" from the Chen08 data
series used in part A of this tutorial. Try to repeat some of the
analysis proposed there by uploading these data rather than selecting
them from the menu available data sets. In order to so, activate the
radio button "Upload custom Data" and select your file via the
"Browse..." button. Don't forget to select format SGA and
Genome Mus musculus(July 2007 MCBI37/mm9).