Application Examples

Scan the hg19 human genome with the chen10 CTCF Position Weight Matrix (PWM)

Important remarks:

For the following application examples we will use the hg19 human assembly files that can be found in the genomedb sub-directory of the PWMScan software package.

The PWMScan programs rely on a few rules to make output conversion to BED format easier, namely FASTA header and file naming conventions.

The final output format is BEDdetail, which is an extension of BED format that is used to enhance the track display page. For PWMScan, we use BEDdetail format to include the name of the PWM as well as the p-value associated to the sites identified by PWMScan.

The BEDdetail format provides the following obligatory fields:

FASTA headers

All genome sequence files have a FASTA header that is formatted as follows:

            >chr|NC_000001|NC_000001.10 some text
        

The sequence identifier includes three words concatenated via a UNIX pipe '|': the word 'chr' followed by the NCBI RefSeq partial and full accession identifiers. The accession records indicate sequence identity. The sequence identifier is parsed by most of the PWMScan programs and utilities.

For each genome assembly, we provide a table for NCBI RefSeq identifier to chromosome number conversion (chr_NC_gi) as well as a table for FASTA sequence identifier to chromosome number conversion (chr_hdr). The chr_NC_gi file has been downloaded from NCBI whereas the chr_hdr file has been locally-generated. By consequence, the chr_hdr table can be changed to adapt to other FASTA header format conventions.

Here is an example for the human genome assembly hg19:

            chr_NC_gi
            #Chr    Accession.ver   gi
            1       NC_000001.10    224384768
            2       NC_000002.11    224384767
            3       NC_000003.11    224384766
            4       NC_000004.11    224384765
            5       NC_000005.9     224384764
            6       NC_000006.11    224384763
            7       NC_000007.13    224384762
            8       NC_000008.10    224384761
            9       NC_000009.11    224384760
            10      NC_000010.10    224384759
            11      NC_000011.9     224384758
            12      NC_000012.11    224384757
            13      NC_000013.10    224384756
            14      NC_000014.8     224384755
            15      NC_000015.9     224384754
            16      NC_000016.9     224384753
            17      NC_000017.10    224384752
            18      NC_000018.9     224384751
            19      NC_000019.9     224384750
            20      NC_000020.10    224384749
            21      NC_000021.8     224384748
            22      NC_000022.10    224384747
            X       NC_000023.10    224384746
            Y       NC_000024.9     224384745
            M       NC_012920.1     251831106
        
            chr_hdr
            #Chr    Sequence Header                 Assembly
            1       chr|NC_000001|NC_000001.10      hg19
            2       chr|NC_000002|NC_000002.11      hg19
            3       chr|NC_000003|NC_000003.11      hg19
            4       chr|NC_000004|NC_000004.11      hg19
            5       chr|NC_000005|NC_000005.9       hg19
            6       chr|NC_000006|NC_000006.11      hg19
            7       chr|NC_000007|NC_000007.13      hg19
            8       chr|NC_000008|NC_000008.10      hg19
            9       chr|NC_000009|NC_000009.11      hg19
            10      chr|NC_000010|NC_000010.10      hg19
            11      chr|NC_000011|NC_000011.9       hg19
            12      chr|NC_000012|NC_000012.11      hg19
            13      chr|NC_000013|NC_000013.10      hg19
            14      chr|NC_000014|NC_000014.8       hg19
            15      chr|NC_000015|NC_000015.9       hg19
            16      chr|NC_000016|NC_000016.9       hg19
            17      chr|NC_000017|NC_000017.10      hg19
            18      chr|NC_000018|NC_000018.9       hg19
            19      chr|NC_000019|NC_000019.9       hg19
            20      chr|NC_000020|NC_000020.10      hg19
            21      chr|NC_000021|NC_000021.8       hg19
            22      chr|NC_000022|NC_000022.10      hg19
            X       chr|NC_000023|NC_000023.10      hg19
            Y       chr|NC_000024|NC_000024.9       hg19
            M       chr|NC_012920|NC_012920.1       hg19
        

Note that the second field of the chr_hdr file corresponds to the chromosome sequence identifier in the FASTA files. Given that the Bowtie indices are generated starting from the chromosome FASTA files, the Bowtie output will include the sequence identifier as specified in the second field of the chr_hdr file. To convert to BED format, which uses chromosome numbers, programs such as bowtie2bed need to read the chr_hdr table.

If Bowtie indices are generated from chromosome FASTA files with different headers (or sequence identifiers), the second field of the chr_hdr table has to be changed accordingly.

Chromosome files downloaded from UCSC are named chr*.fa. If this is the case, the adapted wrapper scripts pwm_scan_ucsc and pwm_mscan_wrapper_ucsc should be used instead of the deafult ones pwm_scan and pwm_mscan_wrapper (see the Wrapper Scripts Section). The pwm_bowtie_wrapper script should in principle work with genome index files built from UCSC genome assemblies, provided the chr_hdr file is changed according to the UCSC chromosome header convention, which is the following:

            >chr1
        

An example of such header files for the mouse genome mm10 downloaded from UCSC would then be the following:

            chr_NC_gi
             #Chr    Accession.ver   gi      Assembly
             1       chr1    372099109       mm10
             2       chr2    372099108       mm10
             3       chr3    372099107       mm10
             4       chr4    372099106       mm10
             5       chr5    372099105       mm10
             6       chr6    372099104       mm10
             7       chr7    372099103       mm10
             8       chr8    372099102       mm10
             9       chr9    372099101       mm10
             10      chr10   372099100       mm10
             11      chr11   372099099       mm10
             12      chr12   372099098       mm10
             13      chr13   372099097       mm10
             14      chr14   372099096       mm10
             15      chr15   372099095       mm10
             16      chr16   372099094       mm10
             17      chr17   372099093       mm10
             18      chr18   372099092       mm10
             19      chr19   372099091       mm10
             X       chrX    372099090       mm10
             Y       chrY    372099089       mm10
             M       chrM    34538597        mm10
        
            chr_hdr
             #Chr    Sequence Header  Assembly
             1       chr1             mm10
             2       chr2             mm10
             3       chr3             mm10
             4       chr4             mm10
             5       chr5             mm10
             6       chr6             mm10
             7       chr7             mm10
             8       chr8             mm10
             9       chr9             mm10
             10      chr10            mm10
             11      chr11            mm10
             12      chr12            mm10
             13      chr13            mm10
             14      chr14            mm10
             15      chr15            mm10
             16      chr16            mm10
             17      chr17            mm10
             18      chr18            mm10
             19      chr19            mm10
             X       chrX             mm10
             Y       chrY             mm10
             M       chrM             mm10
        

Files and paths

The tar files genomes_chr_NC_gi.tar.gz and genomes_chr_hdr.tar.gz include a list of chr_NC_gi and chr_hdr tables for several common model organisms such as human, mouse, fruit fly, worm, zebrafish, yeast, and more. As mentioned above, this tables are used to convert match lists to BED format.

To check the content, type:

tar -tvzf genomes_chr_NC_gi.tar.gz
tar -tvzf genomes_chr_hdr.tar.gz

To extract the files, execute the following commands:

cd genomedb
tar -xvzf genomes_chr_NC_gi.tar.gz
tar -xvzf genomes_chr_hdr.tar.gz

The default genome root location (variable genome_root_dir) for conversion programs and scripts is the following:

genome_root_dir=../genomedb

which corresponds to the root directory of the entire genome assembly data, and is the genome directory used for the following application examples..

The genome root directory can be changed via the command option -d <genome-root-dir> in all bash wrapper scripts (as described below).

The PWM for the CTCF transcription factor that will be used in our example is the chen10_ctcf.mat file in the examples sub-directory..

Upon installation, all PWMScan-specific binary files are located in the bin sub-directory, so we define:

bin_dir=../bin

Analysis Pipeline using two different scanning methods

Shell (bash) Wrappers to execute the analysis pipeline

These shell wrapper scripts have been written to make it easier to run the whole pwmscan pipeline, irrespecitve of what method one chooses.
They require to specify the matrix file containing the integer PWM, the p-value, the path to the assembly or index files, and the UCSC assembly name for the species (e.g. hg19, hg38, mm9 etc.). Optional parameters are the overlapping matches flag (-o), the forward scanning (-f) option, the background base composition (-b), the parallel execution (-p) and write to file (-w) flags.
If the '-o' option is specified, overlapping matches are retained. The '-f' option activates forward scanning. If the '-p' option is set, the matrix_scan program execution is done by parallel dispatch. If the -w option is set, the results are written to file, else output goes to STDOUT.
The output file is a match list in BEDdetail format.

The pwm_scan script scans a genome with a PWM using either Bowtie or matrix_scan depending on both the p-value and the matrix length. For high p-values and long motifs, the Bowtie-based strategy becomes inefficient and the matrix_scan-based approach is the fastest one.
The pwm_bowtie_wrapper script implements the Bowtie-based pipeline whereas the pwm_mscan_wrapper script uses the matrix_scan-based approach.
The pwm_scan_ucsc and pwm_mscan_wrapper_ucsc scripts are just adaptations of the original ones in order to deal with chromosome files downloaded from UCSC.

The pwmlib_scan script scans a genome with a collection of PWMs coming from a database.
Accepted motif database formats are the MEME libraries as well as the motif libraries (both log-odds and letter-probability formats) provided by the PWMScan Web interface.

NOTE: The path to all the PWMScan binaries and scripts is defined by the bin_dir variable and is hard-coded in all bash scripts, and is changed upon installation via the Makefile, so that all installed scripts in bin_dir have the correct binary pathname.

A few remarks on the general pwm_scan script.

The <genome-root-dir> input argument is the root directory of the genome files (the Bowtie index files, and the FASTA chromosome files used by matrix_scan, and the assembly tables used to convert the PWMScan pipeline output to BED format).
<genome-root-dir> is supposed to have two sub-directories types, the /bowtie sub-directory where the Bowtie index files are stored, and the /<assembly> sub-directory for storing assembly-specific chromosome files.

In our setting, <genome-root-dir> is set to ../genomedb.
The chrNC_dir variable (used for locating the the chr_NC_gi/chr_hdr files) is set to chrNC_dir=<genome-root-dir>.

Shell (bash) Wrapper for matrix conversion

The pwm_convert bash script is used to convert JASPAR, TRANSFAC, PFM, LPM and real PWM formats to integer log-odds format.

How to compute the background base composition of a set of DNA sequences

How to build Bowtie index files