Combined P-Value Methods
Statistical methods for aggregating evidence across datasets in SSPsyGene
Overview & Pipeline
Each gene in the SSPsyGene database can appear in multiple datasets and, within a single dataset, can have multiple associated p-values (e.g., one per perturbation experiment that affected it). Our goal is to combine these p-values into a single summary statistic per gene that captures total evidence across all experiments. This is exactly what p-value combination methods like Fisher's, CCT, and HMP are designed for. They take multiple p-values and produce a single aggregate p-value.
Are p-values from different assays comparable? Yes. P-values have no “unit”. Regardless of the assay (RNA-seq, behavioral testing, GWAS, etc.), a p-value is always a number between 0 and 1 with the same statistical meaning. Under the null hypothesis, every p-value is uniformly distributed on [0, 1], no matter where it came from. This makes them inherently comparable and valid to combine across assay types. Because all of our datasets fall within the broad domain of neuropsychiatric research, the combined rankings answer the question: which genes show consistent evidence of association across a range of neuropsychiatric diseases and assay types? Reassuringly, the top-ranked genes are well-known players in brain development, many of which already cause known Mendelian neurodevelopmental disorders, giving us confidence that the method is working as intended.
The pipeline for each gene proceeds as follows:
- Collect raw p-values from every dataset table that declares a pvalue_column. A single gene may contribute multiple p-values per table and across many tables.
Note: this step uses the nominal (unadjusted) p-value column declared by each dataset, not the per-dataset FDR / padjcolumn. That is why combined values can be far smaller than any single study's adjusted p-value.
- Pre-collapse(for Fisher only): reduce each table's p-values for that gene down to a single per-table p-value using min(p) × n, capped at 1.0.
- Combine using three methods: Fisher operates on the collapsed per-table p-values; CCT and HMP operate directly on all raw p-values.
Fisher requires at least 2 collapsed table p-values (both < 1.0) to produce a result. CCT and HMP can operate on any number of p-values ≥ 1. All statistical computations are performed in R using reference implementations.
Pre-Collapse: Bonferroni Within-Table Correction
Problem: A gene may appear in multiple rows of the same data table. For instance, in a perturbation screen, gene G might be a differentially expressed target in experiments where 5 different risk genes were knocked down. That gives us 5 p-values for G from a single table. These p-values are not independent; they all come from the same assay measuring the same gene, and we should not feed them individually into Fisher as though they were independent studies.
Solution: For each gene-table combination, we compute a single representative p-value:
where n is the number of rows for that gene in that table. This is the Bonferroni correction applied to the minimum: we take the best p-value but penalize it by the number of looks. This is conservative but guarantees we do not inflate significance from within-table multiplicity.
Who uses it:Fisher's method. The CCT and HMP, being robust to correlation, operate on the full set of raw p-values directly.
Fisher's Method
Fisher's method (1932) is the oldest and most widely used p-value combination technique. Under the null hypothesis, each p-value is Uniform(0, 1), so −2 ln(p) is distributed as χ²(2). Sums of independent chi-squared variables are themselves chi-squared.
Test statistic:
where k is the number of tables (after pre-collapse).
Null distribution:
The combined p-value is P(χ²(2k) ≥ X²).
Why it works:
- If p ~ Uniform(0,1), then −ln(p) ~ Exponential(1).
- An Exponential(1) variable equals Gamma(1,1), and 2×Exponential(1) = χ²(2).
- Therefore −2 ln(p) ~ χ²(2) for each p.
- Sums of independent χ² variables: χ²(d1) + χ²(d2) = χ²(d1+d2).
- Hence X² ~ χ²(2k).
Independence assumption:Step 4 requires the p-values to be independent. When p-values are positively correlated, Fisher's method tends to be anti-conservative. This is why we use the pre-collapse step to reduce inputs to one per table.
Computed using poolr::fisher().
Fisher, R.A. (1932). Statistical Methods for Research Workers, 4th ed.
Cauchy Combination Test (CCT)
The CCT (Liu & Xie, 2019) was designed for settings where input p-values may be correlated. It exploits a special property of the Cauchy distribution.
Test statistic:
where L is the total number of raw p-values and wi = 1/L (equal weights summing to 1).
The key transform: Uniform to Cauchy:
- p ~ Uniform(0,1).
- 0.5 − p~ Uniform(−0.5, 0.5).
- (0.5 − p) · π ~ Uniform(−π/2, π/2).
- tan((0.5 − p) · π) ~ Cauchy(0, 1).
Step 4 is a classical result: if U~ Uniform(−π/2, π/2), then tan(U) follows a standard Cauchy distribution.
Why the Cauchy distribution is special: Any weighted sum of independent Cauchy random variables is again Cauchy. With our weights summing to 1, T ~ Cauchy(0,1) under independence. More importantly, even under dependency, Liu & Xie proved (Theorem 1) that the tail behavior of Tis well-approximated by Cauchy(0,1). Formally, the theorem requires that the underlying test statistics follow bivariate normal distributions for each pair (Condition C.1), but permits arbitrary correlation matrices. Simulations show the approximation is robust well beyond this assumption. The heavy tails of the Cauchy “absorb” the effect of correlation.
Combined p-value:
For very small p-values (< 10−15), the transform tan((0.5 − p) · π) is replaced by its asymptotic equivalent 1/(p · π) for numerical stability. Computed using the reference implementation from the method's authors (ACAT::ACAT).
Liu, Y. & Xie, J. (2019). Cauchy combination test: a powerful test with analytic p-value calculation under arbitrary dependency structures. Journal of the American Statistical Association, 115(529), 393–402. doi:10.1080/01621459.2018.1554485
Harmonic Mean P-Value (HMP)
The HMP (Wilson, 2019) is a dependency-robust method that uses the harmonic mean. It was developed for combining p-values from genome-wide studies where correlation structures are complex and unknown.
Definition:
with equal weights wi = 1/L. This is the harmonic mean of the p-values, strongly influenced by small values, which is exactly the behavior we want for combining p-values.
Landau distribution calibration:
Under H0, Wilson (2019) showed that 1/HMP follows a Landau distribution(a heavy-tailed, positively-skewed stable distribution with characteristic exponent α = 1). Rather than using the raw harmonic mean directly as a p-value, we use R's harmonicmeanp::p.hmp() function, which calibrates the HMP against the Landau distribution to obtain an exact p-value. This accounts for the finite-sample behavior of the harmonic mean and provides better calibration than the asymptotic approximation, especially for moderate p-values.
Robustness to dependency:
Wilson's Theorem 1 shows that the HMP is an asymptotically valid p-value under arbitrary dependency when weights are normalized. The proof leverages the fact that 1/p has a Pareto(1) distribution (heavy-tailed, infinite mean), and sums of such variables converge to a stable law whose tail behavior is controlled regardless of the dependency structure, analogous to why the CCT works.
Wilson, D.J. (2019). The harmonic mean p-value for combining dependent tests. Proceedings of the National Academy of Sciences, 116(4), 1195–1200. doi:10.1073/pnas.1814092116
A Note on the Uniform(0,1) Assumption
Our p-value combiners (Fisher, CCT, HMP) assume input p-values are U[0, 1] under the null. Many of our input tables do not fully satisfy this: some are stored as DEG-only (only rows surviving an FDR threshold are reported), some have a large mass at p ≈ 1, and some are meta-analysis outputs. Where input p-values are censored on the right, combined-p magnitudes are biased downward.
A practical question is whether the censoring also moves the top-of-list gene rankingsshown in the UI. We cannot test this directly, as the rows the original authors filtered out aren't available to us. Instead we test the converse: take the input tables that are not censored, artificially impose a uniform DEG-only filter (p ≤ T) on each, and measure top-KJaccard overlap and Spearman's ρ against the unfiltered baseline. If imposing this kind of censoring on otherwise-clean tables does not move the top rankings, that is evidence the censoring already present in some of the production tables does not move them either. All three ranking methods (HMP, CCT, Fisher) are very stable at the top of the list:
| T | method | top-50 Jaccard | top-100 Jaccard | top-50 ρ | top-100 ρ |
|---|---|---|---|---|---|
| 0.05 | HMP | 0.98 | 1.00 | 1.00 | 1.00 |
| 0.05 | CCT | 0.98 | 0.97 | 1.00 | 1.00 |
| 0.05 | Fisher | 0.96 | 0.94 | 1.00 | 0.99 |
| 0.10 | HMP | 0.98 | 0.99 | 1.00 | 1.00 |
| 0.10 | CCT | 0.98 | 0.97 | 1.00 | 1.00 |
| 0.10 | Fisher | 0.98 | 0.95 | 1.00 | 0.99 |
| 0.20 | HMP | 0.98 | 1.00 | 1.00 | 1.00 |
| 0.20 | CCT | 0.98 | 0.97 | 1.00 | 1.00 |
| 0.20 | Fisher | 0.96 | 0.96 | 1.00 | 0.99 |
Top-50 Jaccard ≥ 0.96 and top-50 ρ = 1.00 across all three methods means the user-visible ranking at the top of the list is essentially unchanged whether or not the full non-significant tail is preserved. We therefore keep all input tables in the production combine, including the censored ones.
Stouffer's method, which earlier versions of this page also exposed, did notshare this stability (top-50 Jaccard 0.74–0.78 in the same audit, with ρ falling to ~0.56) — qnorm gives moderate-prows real weight, so the censored bulk drives Stouffer's rankings. Stouffer is therefore no longer offered.
Bottom line: read combined-p as a ranking primitive, not a calibrated probability. Top-of-list ordering is robust; small-magnitude differences between near-tied genes are not.
Why These Three Methods?
We compute all three combination methods because they have complementary strengths and the “true” dependency structure among our p-values is unknown:
| Method | Input | Dependency | Sensitivity |
|---|---|---|---|
| HMP | All raw p-values | Robust to arbitrary dependency | Driven by small p-values (harmonic mean) |
| CCT | All raw p-values | Robust to arbitrary dependency | Tail-driven (heavy-tail property) |
| Fisher | Collapsed (per-table) | Requires independence | Driven by strongest single signal |
HMPis the default. It's easy to explain (a harmonic mean of p-values, with a Landau-distribution calibration step), uses all raw p-values directly, and is empirically the most rank-stable of the three methods on this dataset.
CCT uses the same raw-p-value input set as HMP but combines via the heavy-tailed Cauchy transform. It is also robust to arbitrary dependency between studies. In practice CCT and HMP tend to produce nearly identical rankings at the top of the list.
Fisher is canonical and well-understood, and is the only method here that requires independence between inputs — we approximate that with the per-table pre-collapse step. Fisher is more sensitive than CCT/HMP to a single very small p-value from one table, which can be useful for surfacing genes whose evidence is concentrated in one dataset.
In practice, all three methods tend to produce similar gene rankings, especially at the top. When they diverge, examining which method ranks a gene differently can provide insight: for instance, a gene significant under Fisher but not HMP may be driven by a single very small p-value from one table.
Johannes Birgmeier, using Claude Opus 4.6, March 3rd, 2026
