SSPsyGene Logo

Combined P-Value Methods

Statistical methods for aggregating evidence across datasets in SSPsyGene

Overview & Pipeline

Each gene in the SSPsyGene database can appear in multiple datasets and, within a single dataset, can have multiple associated p-values (e.g., one per perturbation experiment that affected it). Our goal is to combine these p-values into a single summary statistic per gene that captures total evidence across all experiments. This is exactly what p-value combination methods like Fisher's, Stouffer's, CCT, and HMP are designed for. They take multiple p-values and produce a single aggregate p-value.

Are p-values from different assays comparable? Yes. P-values have no “unit”. Regardless of the assay (RNA-seq, behavioral testing, GWAS, etc.), a p-value is always a number between 0 and 1 with the same statistical meaning. Under the null hypothesis, every p-value is uniformly distributed on [0, 1], no matter where it came from. This makes them inherently comparable and valid to combine across assay types. Because all of our datasets fall within the broad domain of neuropsychiatric research, the combined rankings answer the question: which genes show consistent evidence of association across a range of neuropsychiatric diseases and assay types? Reassuringly, the top-ranked genes are well-known players in brain development, many of which already cause known Mendelian neurodevelopmental disorders, giving us confidence that the method is working as intended.

The pipeline for each gene proceeds as follows:

  1. Collect raw p-values from every dataset table that declares a pvalue_column. A single gene may contribute multiple p-values per table and across many tables.
  2. Pre-collapse (for Fisher/Stouffer only): reduce each table's p-values for that gene down to a single per-table p-value using min(p) × n, capped at 1.0.
  3. Combine using four methods: Fisher and Stouffer operate on the collapsed per-table p-values; CCT and HMP operate directly on all raw p-values.

Fisher and Stouffer require at least 2 collapsed table p-values (both < 1.0) to produce a result. CCT and HMP can operate on any number of p-values ≥ 1. All statistical computations are performed in R using reference implementations.

Pre-Collapse: Bonferroni Within-Table Correction

Problem: A gene may appear in multiple rows of the same data table. For instance, in a perturbation screen, gene G might be a differentially expressed target in experiments where 5 different risk genes were knocked down. That gives us 5 p-values for G from a single table. These p-values are not independent; they all come from the same assay measuring the same gene, and we should not feed them individually into Fisher or Stouffer as though they were independent studies.

Solution: For each gene-table combination, we compute a single representative p-value:

ptable = min( min(p1, …, pn) × n, 1.0 )

where n is the number of rows for that gene in that table. This is the Bonferroni correction applied to the minimum: we take the best p-value but penalize it by the number of looks. This is conservative but guarantees we do not inflate significance from within-table multiplicity. Pre-collapse uses arbitrary-precision arithmetic (mpmath) to avoid precision loss with very small p-values.

Who uses it: Fisher's method and Stouffer's method. The CCT and HMP, being robust to correlation, operate on the full set of raw p-values directly.

Fisher's Method

Fisher's method (1932) is the oldest and most widely used p-value combination technique. Under the null hypothesis, each p-value is Uniform(0, 1), so −2 ln(p) is distributed as χ²(2). Sums of independent chi-squared variables are themselves chi-squared.

Test statistic:

X² = −2 ∑i=1k ln(pi)

where k is the number of tables (after pre-collapse).

Null distribution:

X² ~ χ²(2k)

The combined p-value is P(χ²(2k) ≥ X²).

Why it works:

  1. If p ~ Uniform(0,1), then −ln(p) ~ Exponential(1).
  2. An Exponential(1) variable equals Gamma(1,1), and 2×Exponential(1) = χ²(2).
  3. Therefore −2 ln(p) ~ χ²(2) for each p.
  4. Sums of independent χ² variables: χ²(d1) + χ²(d2) = χ²(d1+d2).
  5. Hence X² ~ χ²(2k).

Independence assumption: Step 4 requires the p-values to be independent. When p-values are positively correlated, Fisher's method tends to be anti-conservative. This is why we use the pre-collapse step to reduce inputs to one per table.

Computed using poolr::fisher().

Fisher, R.A. (1932). Statistical Methods for Research Workers, 4th ed.

Stouffer's Method

Stouffer's method (1949) converts each p-value to a Z-score via the inverse normal CDF, then sums and normalizes.

Test statistic:

Z = i=1k Φ−1(1 − pi)k

Under H0 with independent p-values, Z ~ Normal(0, 1). The combined p-value is P(Z Zobs).

Comparison with Fisher: Fisher is more sensitive to one very small p-value; Stouffer responds more evenly to moderate signals across many studies.

Computed using poolr::stouffer().

Stouffer, S.A. et al. (1949). The American Soldier, Vol. 1.

Cauchy Combination Test (CCT)

The CCT (Liu & Xie, 2019) was designed for settings where input p-values may be correlated. It exploits a special property of the Cauchy distribution.

Test statistic:

T = ∑i=1L wi · tan((0.5 − pi) · π)

where L is the total number of raw p-values and wi = 1/L (equal weights summing to 1).

The key transform: Uniform to Cauchy:

  1. p ~ Uniform(0,1).
  2. 0.5 − p ~ Uniform(−0.5, 0.5).
  3. (0.5 − p) · π ~ Uniform(−π/2, π/2).
  4. tan((0.5 − p) · π) ~ Cauchy(0, 1).

Step 4 is a classical result: if U ~ Uniform(−π/2, π/2), then tan(U) follows a standard Cauchy distribution.

Why the Cauchy distribution is special: Any weighted sum of independent Cauchy random variables is again Cauchy. With our weights summing to 1, T ~ Cauchy(0,1) under independence. More importantly, even under dependency, Liu & Xie proved (Theorem 1) that the tail behavior of T is well-approximated by Cauchy(0,1). Formally, the theorem requires that the underlying test statistics follow bivariate normal distributions for each pair (Condition C.1), but permits arbitrary correlation matrices. Simulations show the approximation is robust well beyond this assumption. The heavy tails of the Cauchy “absorb” the effect of correlation.

Combined p-value:

pcombined = P(Cauchy(0,1) > T) = 12 arctan(T)π

For very small p-values (< 10−15), the transform tan((0.5 − p) · π) is replaced by its asymptotic equivalent 1/(p · π) for numerical stability. Computed using the reference implementation from the method's authors (ACAT::ACAT).

Liu, Y. & Xie, J. (2019). Cauchy combination test: a powerful test with analytic p-value calculation under arbitrary dependency structures. Journal of the American Statistical Association, 115(529), 393–402. doi:10.1080/01621459.2018.1554485

Harmonic Mean P-Value (HMP)

The HMP (Wilson, 2019) is a dependency-robust method that uses the harmonic mean. It was developed for combining p-values from genome-wide studies where correlation structures are complex and unknown.

Definition:

HMP = i wii wi / pi = Li 1/pi

with equal weights wi = 1/L. This is the harmonic mean of the p-values, strongly influenced by small values, which is exactly the behavior we want for combining p-values.

Landau distribution calibration:

Under H0, Wilson (2019) showed that 1/HMP follows a Landau distribution (a heavy-tailed, positively-skewed stable distribution with characteristic exponent α = 1). Rather than using the raw harmonic mean directly as a p-value, we use R's harmonicmeanp::p.hmp() function, which calibrates the HMP against the Landau distribution to obtain an exact p-value. This accounts for the finite-sample behavior of the harmonic mean and provides better calibration than the asymptotic approximation, especially for moderate p-values.

Robustness to dependency:

Wilson's Theorem 1 shows that the HMP is an asymptotically valid p-value under arbitrary dependency when weights are normalized. The proof leverages the fact that 1/p has a Pareto(1) distribution (heavy-tailed, infinite mean), and sums of such variables converge to a stable law whose tail behavior is controlled regardless of the dependency structure, analogous to why the CCT works.

Wilson, D.J. (2019). The harmonic mean p-value for combining dependent tests. Proceedings of the National Academy of Sciences, 116(4), 1195–1200. doi:10.1073/pnas.1814092116

Why All Four Methods?

We compute all four combination methods because they have complementary strengths and the “true” dependency structure among our p-values is unknown:

MethodInputDependencySensitivity
FisherCollapsed (per-table)Requires independenceDriven by strongest single signal
StoufferCollapsed (per-table)Requires independenceResponds evenly to moderate signals
CCTAll raw p-valuesRobust to arbitrary dependencyTail-driven (heavy-tail property)
HMPAll raw p-valuesRobust to arbitrary dependencyDriven by small p-values (harmonic mean)

Fisher and Stouffer are canonical and well-understood. By using pre-collapsed per-table p-values, we approximate independence. However, subtle dependencies may still exist across datasets.

CCT and HMP are newer methods designed for unknown or complex dependency structures. They use all raw p-values and do not require pre-collapse. The trade-off is that they are asymptotically valid (accurate for small combined p-values) rather than exactly valid at all significance levels.

In practice, all four methods tend to produce similar gene rankings, especially at the top. When they diverge, examining which method ranks a gene differently can provide insight: for instance, a gene significant under Fisher but not HMP may be driven by a single very small p-value from one table.

Johannes Birgmeier, using Claude Opus 4.6, March 3rd, 2026