Does a positive score imply a larger PWM score in the reference allele?
The scorediff column reports the difference in PWM scores between alternate and reference allele. Therefore, a positive score implies a larger PWM score in the alternate allele.
Are score differences comparable across PWMs?
No, large score differences for a PWM do not necessarily imply large effects on TF binding.
One could use ChIP-seq data to compare score differences and effects of binding site occupancy, but this is not trivial.
Currently, we haven't implemented an easy way to link score differences with effects on real binding.
How is the TF enrichment calculation done? And what does it represent?
The TF enrichment values are calculated as the ratio of the observed SNP hits over the expected ones (for each TF).
For each TF, the enrichment value is computed on the bases of the following numbers:
So, the expected number of SNP hits is calculated as the product of p (Number 5) and the number of SNPs in the uploaded set that are present in the database
The p probability is, in turn, computed as the ratio of the total number of SNPs affecting genome-wide the given TF (Number 2) over the total number of SNPs in the database (Number 1): p=Number 2/Number 1.
- The total number of SNPs in our database (~3'500'000);
- The number of SNPs in the database that affect the TF of interest (column #2 of the TF enrichment statistics output file);
- The number of observed SNPs hits for the TF of interest (column #3 of the TF enrichment statistics output file);
- The number of SNPs in the uploaded set that are present in the database (Number of Matching variants shown in the SNPSelect output page);
- The probability (p) of observing a SNP hit in the database for the TF of interest.
As a demonstration, we consider Example1 on the Web entry page of SNPSelect. In this example, we use the diabetes-related set of variants from the NHGRI-EBI catalog.
SNPSelect extracts 188 variants matching SNP2TFBS (Number 3) out of the initial set of 817 variants.
The TF-enrichment plot shows the SRF TF as the top-ranked factor with an enrichment value of about 8.9.
By clicking the TF Enrichment statistics link at the bottom of the output page, the results file from SNPSelect is displayed. For the SRF TF, the number of observed SNP hits is 5 (3rd field) out of 10323 SNPs genome-wide (second field). The probability p of getting a hit by chance (success) is given by:
where 3'500'000 is the total number of SNPs in our database.
- p = 10323/3500000 ~ 0.003
The expected number of SNP hits for SRF is therefore p*188=0.564, and the enrichment value is obtained by the following calculation:
- TF-enrich(SRF) = 5/0.564 = 8.9
How do we compute the P-value for TF enrichment?
To compute the p-values, we use a binomial-based test and different color codes to display different levels of significance.
We use the R function pbinom, the cumulative distribution function for the binomial distribution with parameters size and prob.
It is conventionally interpreted as the number of successes in size trials:
where q is the number of observations, prob the vector of probabilities, and size the number of trials (zero or more). With lower.tail=TRUE, probabilities are P[X ≤ x], otherwise P[X > x].
- pbinom(q, size, prob, lower.tail = TRUE, log.p = FALSE)
So, taking the example above, if the number of observed SNP hits is 5, the selected variants (size) are 188, and the probability of getting a hit by chance is prob=0.003 as shown before, the p-value is computed as follows:
The p-value represents the probability that more than 4 hits are observed by chance, therefore q=5-1. Since p-value<0.005, the color code of the corresponding enrichment value is displayed as a red dot.
- p-value = pbinom((5-1),188, 0.003, lower.tail=F) = 0.0002859739
No correction for multiple testing is applied.
Why is the score difference between reported SNPs sometines zero?
The scorediff column reports the difference in PWM scores between alternate and reference allele. A positive score implies a larger PWM score in the alternate allele.
The scoring system based on PWMs (integer log-odds position weight matrices) has, of course, its own limitations.
First of all, given that we use PWM raw binding scores, which are computed as the sum of the position-specific weights over all bases of the binding site, these numbers have no absolute meaning and are not comparable across different PWMs (or TFs).
Secondly, predicted TF binding sites match the PWM with a p-value threshold of 10-5. The P-value of PWM raw score x is defined as the probability that a random k-mer sequence has a binding score ≥x given the base composition of the human genome. This is a quite stringent cut-off, and depending on the base composition and length of the PWM model, it may result in a very restricted raw score space. As a matter of fact, for several TFs, such as BRCA1, ARID2A and CREB1, and many others, above the 10-5 cut-off all raw scores have the same value. If you have a look at the cumulative PWM score distribution function, you will see that the explored score space is actually the flat right-end tail of the curve. We are thinking of lowering the low cut-off threshold or adding new PWM score-based filtering options on the Web interface so to have a more suitable scoring system.
With regard more specifically to this question, when comparing the score difference alt-ref for each overlapping variant, if the PWM match is missing in one of the two genomes (ref and alt), we use the raw low score threshold to score the missing hit. It might so happen that for those PWMs that assign the same raw score to all matches above the p-value threshold of 10-5, we get a score difference of 0 for variants that appear to disrupt or create a new binding site.
For this reason, in such cases, we indicate the score difference as '0' if the SNP disrupts an existing binding site or '>0' if it creates a new binding site.
This is not very intuitive and should be changed as well.