Background |
Input is a set of genomic or RNA sequences in FASTA format.
The PWM format used is the letter-probability matrix.
For each sequence, the sum occupancy score is calculated, which is the sum over all motif matches weighted by their respective scores. The occupancy score is used to rank all sequences.
For probe sequence s and matrix PWM of length k, the sum occupancy score is computed as follows:
(1) |
where pi is the nucleotide prior (or background) probability. PWM frequencies are normalized by the background sequence composition.
The background sequence composition can be estimated by the following three methods:
If the best match scoring option is set, the program reports the best single match score within the sequence instead of computing the sum occupancy score. The sequence ranking is therefore based on best single match scores.
The motif libraries have been downloaded from the MEME Motif Database, and include the following sets: JASPAR CORE 2018, Human and Mouse HT-SELEX motifs (Jolma 2013), UniPROBE Mouse, and HOCOMOCO (version 11).
1.1 Conversion of position frequency or count matrix (PFM, JASPAR, TRASFAC) to letter probability matrix (LPM)
The conversion of base counts to corrected frequencies (fib), that is relative frequencies which are corrected by pseudo-count fractions distributed according to residue priors, uses the following formula:
(2) |
1.2 Conversion of position weight matrix (PWM) to letter probability matrix (LPM)
A position weight matrix can be converted to a letter probability matrix in two ways:
(3) |
Matrix Formats |
TRANSFAC-style matrices look like this:
AC M00223 XX ID V$STAT_01 XX DT 29.11.1995 (created); ewi. DT 11.03.2003 (updated); dtc. CO Copyright (C), Biobase GmbH. XX NA STATx XX DE signal transducers and activators of transcription XX BF T01575 STAT1alpha; Species: mouse, Mus musculus. BF T01492 STAT1alpha; Species: human, Homo sapiens. BF T01573 STAT1beta; Species: human, Homo sapiens. XX PO A C G T 01 0 0 0 14 T 02 0 0 0 14 T 03 2 12 0 0 C 04 0 12 0 2 C 05 0 9 2 3 C 06 4 2 8 0 G 07 0 0 8 6 K 08 13 1 0 0 A 09 14 0 0 0 A XX BA 14 genomic sequences of 14 different genes XX CC compiled sequences XX RN [1]; RE0003481. RX PUBMED: 7774815. RA Horvath C. M., Wen Z., Darnell jr J. E. RT A STAT protein domain that determines DNA sequence recognition suggests a novel DNA-binding domain RL Genes Dev. 9:984-994 (1995). XX
JASPAR-style matrices look like this:
>MA0137.2 STAT1 A [ 208 859 251 10 8 106 23 528 696 53 1900 2030 954 336 417 ] C [1076 496 574 22 14 1921 1900 762 31 30 124 17 263 760 804 ] G [ 415 279 144 11 38 14 7 115 1292 1700 29 23 552 270 425 ] T [ 378 446 1112 2038 2023 44 155 680 66 302 32 15 315 714 431 ]
PFM-style matrices look like this:
> M00224 STAT1 8.0 18.0 41.9 32.1 23.2 9.7 37.2 29.8 13.9 31.3 19.8 35.2 17.5 37.9 38.1 6.4 59.6 13.5 15.1 11.9 31.2 26.3 9.4 33.2 0 0 0 100 0 0 0 100 3.9 90.9 2 3.2 0 100 0 0 0 32.5 67.5 0 0 0 100 0 3.8 0 94.9 1.2 100 0 0 0 100 0 0 0 36.3 8.8 29.5 25.5 11.1 11.1 11.1 66.7 19.8 10.5 69.6 0 24.8 12.7 46.6 15.9 12.5 36.7 49.2 1.5 31.8 12.6 39.7 16
LPM-style matrices look like this:
> letter-probability matrix V_STAT1_01: alength= 4 w= 21 nsites= 100.1 E= 0 0.080000 0.180000 0.419000 0.321000 0.232232 0.097097 0.372372 0.298298 0.138723 0.312375 0.197605 0.351297 0.175175 0.379379 0.381381 0.064064 0.595405 0.134865 0.150849 0.118881 0.311688 0.262737 0.093906 0.331668 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 1.000000 0.039000 0.909000 0.020000 0.032000 0.000000 1.000000 0.000000 0.000000 0.000000 0.325000 0.675000 0.000000 0.000000 0.000000 1.000000 0.000000 0.038038 0.000000 0.949950 0.012012 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.362637 0.087912 0.294705 0.254745 0.111000 0.111000 0.111000 0.667000 0.198198 0.105105 0.696697 0.000000 0.248000 0.127000 0.466000 0.159000 0.125125 0.367367 0.492492 0.015015 0.317682 0.125874 0.396603 0.159840