ChIP-Seq Analysis on Amazon cloud

Running ChIP-Seq analysis tools on Amazon Web Services (AWS) cloud

Overview

A public AMI (Amazon Machine Image) has been created on AWS cloud with our ChIP-Seq analysis tools. Latest version of all the tools have been installed and configured. This page describes how to use our tools on AWS cloud. Its a three step process (if you are using AWS cloud for the first time).

STEP I: Installing and configuring Amazon Web Services command line interface (AWS-CLI)

Create an accout on AWS Console and sign in.
Open the IAM (Identity and Security Management) console.
From the navigation menu, click Users. Then create/ select user name.
Click 'User Actions', and then click Manage Access Keys and create access key (This will be required to configure AWS-CLI on your system).
Click 'Download Credentials', and store the keys in a secure location.
Follow the instruction here to install AWS-CLI. You can install it locally or as root.

Configure AWS-CLI (this will require AWS access key from IAM).

Type the follwing in your terminal and follow the instructions:


	$ aws configure
	$ AWS Access Key ID: AK***************SQ
	$ AWS Secret Access Key: AO**************************************Ek
	$ Default region name [us-east-1]: us-east-1
	$ Default output format [json]: json

STEP II: Start Elastic Compute Cloud (EC2) instance from AWS console

Please check various machines offered by AWS and choose one that suits your compute requirement (here) and budget.

Sign into AWS console and select region: "US West (Oregon)".
Click EC2 (Virtual servers in the cloud).
Select 'Key Pairs' from 'NETWORK & SECURITY' and 'Create Key Pair'. Download and save it in a secure location.
Click on 'Instances' from the navigaiton menu and the select 'Launch Instance'.
Search for 'Community AMIs' and enter 'ami-a9c3dc99' or 'ChIP-Seq-Tools_SIB' and then select.
Follow the instruction onscreen to launch an instance. Copy the Public DNS/ IP, it will be used in next step.

# optionally you can also start the EC2 instance via CLI. Please check here for more information

You can try it for free with AWS Free Tier

STEP III: Use ChIP-Seq tools on AWS

Type following in your terminal and start using ChIP-Seq analysis tools:


chmod 400 keypair.pem
ssh -i keypair.pem ubuntu@IP # use Public DNS/ IP that you generated in the last step.

Tools organization

The ChIP-Seq tools are installed in /usr/bin. The main programs are the follwoing:

chipcor: Positional correlation tool;
chipextract: Correlation and feature extraction tool;
chippeak: Peak calling tool;
chippart: Segmentation tool;
chipcenter: Tag centering tool;
chipscore: Feature selection based on tag coverage.

The software package can be found in /home/ubuntu/chipseq. The main subdirectories are the following:

data: this directory contains a few data sets that can be used to run some tests (you must unzip the data files before using them);
doc: this directory includes the user's guide (ChipSeq_Tools-UsersGuide.pdf).
tools: this directory includes a series of Perl tools that can be used to perform format conversion tasks.

The ChIP-seq main programs use as a format a simplified GFF format, called SGA (Simplified Genome Annotation), which is sorted by sequence name and position. In a data analysis pipeline, the SGA file is typically generated from a variety of richer formats, such as the Solexa genome mapping files, BED files, or FPS (Functional Position Set) files used by the Signal Search Analysis programs at SIB (SSA).

SGA is a single-line-oriented and tab-delimited format with the following five obligatory fields:

1. Sequence name/ID (Char String)
2. Feature (Char String)
3. Sequence Position (Integer)
4. Strand (+/- or 0)
5. Tag Counts (Integer)

An example of use of the chipcor program (feature correlation tool) is the following:

chipcor -A "H3K4me3 +" -B "H3K4me3 -" -b -1000 -e 1000 -w 1 -c 20 -n 1 H3K4me3.sga > H3K4me3_fc_n1.out

Where 'H3K4me3.sga' is the file containing the list of ChIP-Seq tags, which correspond to the H3K4me3 histon modification data. The '-c' option specifies the cut-off on input counts. Tags corresponding to histone modifications along the positive strand (option '-A "H3K4me3 +"') are correlated with tags corresponding to the same histone modification pattern on the opposite strand (option '-B "H3K4me3 -"'), and their relative distances are distributed in a histogram within the range [- 1000; + 1000] (options: '-b -1000', '-e 1000'). The output file (H3K4me3_fc_n1.out) contains all histogram entries in simple text format. Histogram entries show count density values (option '-n 1') of the target feature (H3K4me3 tags on the negative strand) at relative distances to the reference feauture (H3K4me3 tags on the positive strand). 'Count Density' means number of tags per base pair.

Other useful tools installed on AWS include:

Bowtie
Samtools/vcftools
R

Last update June 2017

SIB Swiss Institute of Bioinformatics | Computational Cancer Genomics |

Back to the Top