Combined P-Value Methods

Statistical methods for aggregating evidence across datasets in SSPsyGene

Overview & Pipeline

Each gene in the SSPsyGene database can appear in multiple datasets and, within a single dataset, can have multiple associated p-values (e.g., one per perturbation experiment that affected it). Our goal is to combine these p-values into a single summary statistic per gene that captures total evidence across all experiments. This is exactly what p-value combination methods like Fisher's, CCT, and HMP are designed for. They take multiple p-values and produce a single aggregate p-value.

What is combined? The cross-study ranking on the Most Significant Genes page combines p-values only from differential-expression (DEG) studies: RNA-seq differential expression (assay expression) and perturbation-based differential expression such as Perturb-seq, ECCITE-seq, and bulk RNA-seq after CRISPRi/a (assay perturbation_deg). Assays that report a different kind of statistic — regulatory-network edge weights, cell-type proportion shifts, behavioral readouts, or curated database memberships — are excluded from the combination.

Why restrict to DEG studies?P-values are unitless — under the null hypothesis every p-value is uniformly distributed on [0, 1], so they are mathematically combinable regardless of origin. But a DEG p-value and, say, a regulatory-network edge p-value are testing different null hypotheses, so combining them doesn't have a clean interpretation, and because every combined value depends on every input, adding datasets of a new assay type would reshuffle the whole ranking. Restricting the combination to differential-expression assays — where each p-value answers the same question (“is this gene's expression changed?”) — keeps the ranking interpretable and stable as the database grows. The set of DEG assays is configured in globals.yaml under metaAnalysisAssays. Reassuringly, the top-ranked genes are well-known players in brain development, many of which already cause Mendelian neurodevelopmental conditions, giving us confidence that the method is working as intended.

The pipeline for each gene proceeds as follows:

Collect raw p-values from every differential-expression dataset table that declares a pvalue_column (i.e. whose assay is one of the configured metaAnalysisAssays). A single gene may contribute multiple p-values per table and across many tables.
Note: this step uses the nominal (unadjusted) p-value column declared by each dataset, not the per-dataset FDR / padj column. That is why combined values can be far smaller than any single study's adjusted p-value.
Pre-collapse(for Fisher only): reduce each table's p-values for that gene down to a single per-table p-value using min(p) × n, capped at 1.0.
Combine using three methods: Fisher operates on the collapsed per-table p-values; CCT and HMP operate directly on all raw p-values.

Fisher requires at least 2 collapsed table p-values (both < 1.0) to produce a result. CCT and HMP can operate on any number of p-values ≥ 1. All statistical computations are performed in R using reference implementations.

Pre-Collapse: Bonferroni Within-Table Correction

Problem: A gene may appear in multiple rows of the same data table. For instance, in a perturbation screen, gene G might be a differentially expressed target in experiments where 5 different risk genes were knocked down. That gives us 5 p-values for G from a single table. These p-values are not independent; they all come from the same assay measuring the same gene, and we should not feed them individually into Fisher as though they were independent studies.

Solution: For each gene-table combination, we compute a single representative p-value:

p_table = min( min(p₁, …, p_n) × n, 1.0 )

where n is the number of rows for that gene in that table. This is the Bonferroni correction applied to the minimum: we take the best p-value but penalize it by the number of looks. This is conservative but guarantees we do not inflate significance from within-table multiplicity.

Who uses it: Fisher's method. The CCT and HMP, being robust to correlation, operate on the full set of raw p-values directly.

Fisher's Method

Fisher's method (1932) is the oldest and most widely used p-value combination technique. Under the null hypothesis, each p-value is Uniform(0, 1), so −2 ln(p) is distributed as χ²(2). Sums of independent chi-squared variables are themselves chi-squared.

Test statistic:

X² = −2 ∑_i=1^k ln(p_i)

where k is the number of tables (after pre-collapse).

Null distribution:

X² ~ χ²(2k)

The combined p-value is P(χ²(2k) ≥ X²).

Why it works:

If p ~ Uniform(0,1), then −ln(p) ~ Exponential(1).
An Exponential(1) variable equals Gamma(1,1), and 2×Exponential(1) = χ²(2).
Therefore −2 ln(p) ~ χ²(2) for each p.
Sums of independent χ² variables: χ²(d₁) + χ²(d₂) = χ²(d₁+d₂).
Hence X² ~ χ²(2k).

Independence assumption: Step 4 requires the p-values to be independent. When p-values are positively correlated, Fisher's method tends to be anti-conservative. This is why we use the pre-collapse step to reduce inputs to one per table.

Computed using poolr::fisher().

Fisher, R.A. (1932). Statistical Methods for Research Workers, 4th ed.

Cauchy Combination Test (CCT)

The CCT (Liu & Xie, 2019) was designed for settings where input p-values may be correlated. It exploits a special property of the Cauchy distribution.

Test statistic:

T= ∑_i=1^L w_i · tan((0.5 − p_i) · π)

where L is the total number of raw p-values and w_i = 1/L (equal weights summing to 1).

The key transform: Uniform to Cauchy:

p ~ Uniform(0,1).
0.5 − p~ Uniform(−0.5, 0.5).
(0.5 − p) · π ~ Uniform(−π/2, π/2).
tan((0.5 − p) · π) ~ Cauchy(0, 1).

Step 4 is a classical result: if U~ Uniform(−π/2, π/2), then tan(U) follows a standard Cauchy distribution.

Why the Cauchy distribution is special: Any weighted sum of independent Cauchy random variables is again Cauchy. With our weights summing to 1, T ~ Cauchy(0,1) under independence. More importantly, even under dependency, Liu & Xie proved (Theorem 1) that the tail behavior of T is well-approximated by Cauchy(0,1). Formally, the theorem requires that the underlying test statistics follow bivariate normal distributions for each pair (Condition C.1), but permits arbitrary correlation matrices. Simulations show the approximation is robust well beyond this assumption. The heavy tails of the Cauchy “absorb” the effect of correlation.

Combined p-value:

p_combined = P(Cauchy(0,1) > T) = 12 − arctan(T)π

For very small p-values (< 10⁻¹⁵), the transform tan((0.5 − p) · π) is replaced by its asymptotic equivalent 1/(p · π) for numerical stability. Computed using the reference implementation from the method's authors (ACAT::ACAT).

Liu, Y. & Xie, J. (2019). Cauchy combination test: a powerful test with analytic p-value calculation under arbitrary dependency structures. Journal of the American Statistical Association, 115(529), 393–402. doi:10.1080/01621459.2018.1554485

Harmonic Mean P-Value (HMP)

The HMP (Wilson, 2019) is a dependency-robust method that uses the harmonic mean. It was developed for combining p-values from genome-wide studies where correlation structures are complex and unknown.

Definition:

HMP = ∑_i w_i∑_i w_i / p_i = L∑_i 1/p_i

with equal weights w_i = 1/L. This is the harmonic mean of the p-values, strongly influenced by small values, which is exactly the behavior we want for combining p-values.

Landau distribution calibration:

Under H₀, Wilson (2019) showed that 1/HMP follows a Landau distribution(a heavy-tailed, positively-skewed stable distribution with characteristic exponent α = 1). Rather than using the raw harmonic mean directly as a p-value, we use R's harmonicmeanp::p.hmp() function, which calibrates the HMP against the Landau distribution to obtain an exact p-value. This accounts for the finite-sample behavior of the harmonic mean and provides better calibration than the asymptotic approximation, especially for moderate p-values.

Robustness to dependency:

Wilson's Theorem 1 shows that the HMP is an asymptotically valid p-value under arbitrary dependency when weights are normalized. The proof leverages the fact that 1/p has a Pareto(1) distribution (heavy-tailed, infinite mean), and sums of such variables converge to a stable law whose tail behavior is controlled regardless of the dependency structure, analogous to why the CCT works.

Wilson, D.J. (2019). The harmonic mean p-value for combining dependent tests. Proceedings of the National Academy of Sciences, 116(4), 1195–1200. doi:10.1073/pnas.1814092116

A Note on the Uniform(0,1) Assumption

Our p-value combiners (Fisher, CCT, HMP) assume input p-values are U[0, 1] under the null. Many of our input tables do not fully satisfy this: some are stored as DEG-only (only rows surviving an FDR threshold are reported), some have a large mass at p ≈ 1, and some are meta-analysis outputs. Where input p-values are censored on the right, combined-p magnitudes are biased downward.

A practical question is whether the censoring also moves the top-of-list gene rankings shown in the UI. We cannot test this directly, as the rows the original authors filtered out aren't available to us. Instead we test the converse: take the input tables that are not censored, artificially impose a uniform DEG-only filter (p ≤ T) on each, and measure top-K Jaccard overlap and Spearman's ρ against the unfiltered baseline. If imposing this kind of censoring on otherwise-clean tables does not move the top rankings, that is evidence the censoring already present in some of the production tables does not move them either. All three ranking methods (HMP, CCT, Fisher) are very stable at the top of the list:

T	method	top-50 Jaccard	top-100 Jaccard	top-50 ρ	top-100 ρ
0.05	HMP	0.98	1.00	1.00	1.00
0.05	CCT	0.98	0.97	1.00	1.00
0.05	Fisher	0.96	0.94	1.00	0.99
0.10	HMP	0.98	0.99	1.00	1.00
0.10	CCT	0.98	0.97	1.00	1.00
0.10	Fisher	0.98	0.95	1.00	0.99
0.20	HMP	0.98	1.00	1.00	1.00
0.20	CCT	0.98	0.97	1.00	1.00
0.20	Fisher	0.96	0.96	1.00	0.99

Top-50 Jaccard ≥ 0.96 and top-50 ρ = 1.00 across all three methods means the user-visible ranking at the top of the list is essentially unchanged whether or not the full non-significant tail is preserved. We therefore keep all input tables in the production combine, including the censored ones.

Stouffer's method, which earlier versions of this page also exposed, did not share this stability (top-50 Jaccard 0.74–0.78 in the same audit, with ρ falling to ~0.56) — qnorm gives moderate-p rows real weight, so the censored bulk drives Stouffer's rankings. Stouffer is therefore no longer offered.

Bottom line: read combined-p as a ranking primitive, not a calibrated probability. Top-of-list ordering is robust; small-magnitude differences between near-tied genes are not.

Why These Three Methods?

We compute all three combination methods because they have complementary strengths and the “true” dependency structure among our p-values is unknown:

Method	Input	Dependency	Sensitivity
HMP	All raw p-values	Robust to arbitrary dependency	Driven by small p-values (harmonic mean)
CCT	All raw p-values	Robust to arbitrary dependency	Tail-driven (heavy-tail property)
Fisher	Collapsed (per-table)	Requires independence	Driven by strongest single signal

HMP is the default. It's easy to explain (a harmonic mean of p-values, with a Landau-distribution calibration step), uses all raw p-values directly, and is empirically the most rank-stable of the three methods on this dataset.

CCT uses the same raw-p-value input set as HMP but combines via the heavy-tailed Cauchy transform. It is also robust to arbitrary dependency between studies. In practice CCT and HMP tend to produce nearly identical rankings at the top of the list.

Fisher is canonical and well-understood, and is the only method here that requires independence between inputs — we approximate that with the per-table pre-collapse step. Fisher is more sensitive than CCT/HMP to a single very small p-value from one table, which can be useful for surfacing genes whose evidence is concentrated in one dataset.

In practice, all three methods tend to produce similar gene rankings, especially at the top. When they diverge, examining which method ranks a gene differently can provide insight: for instance, a gene significant under Fisher but not HMP may be driven by a single very small p-value from one table.

Volcano Plot Sampling

The effect-distribution (volcano) chart on each dataset table plots effect size against −log₁₀(p). Tables can hold many thousands of rows, so the grey background is a sample rather than every row, drawn from rows with a non-null effect size and p-value.

The sample is a union of two sets:

Top by p-value: the most-significant rows (smallest p) are alwaysincluded, so the volcano's tails are visible regardless of the random draw.
Deterministic pseudo-random sample:the remaining background is filled by ordering rows on Knuth's multiplicative hash of the row id and taking the first ~1,000. This is reproducible per database build and breaks any correlation between insertion order and effect/p-value — it is not a uniform random draw redrawn on each request.

By default, rows whose perturbed-gene link points at a control sentinel (e.g. NonTarget1, SafeTarget, GFP) are excluded so the distribution is not biased toward “no biological perturbation.” When a specific perturbed gene is selected, the background is further restricted to that perturbation's rows. The queried gene's own rows are always drawn at full opacity on top of the sample.

Johannes Birgmeier, using Claude Opus 4.6, March 3rd, 2026