The p-value is the probability of obtaining a test statistic as extreme as (or more extreme than) the observed value, assuming the null hypothesis is true. It does NOT measure the probability that the null hypothesis is true; only the probability of the observed data given the null.

When is a result statistically significant?

A result is typically considered statistically significant when the p-value is less than the significance level (commonly 0.05). However, statistical significance is increasingly criticized as arbitrary. Always report effect size and confidence interval alongside p-value for proper interpretation.

What does a p-value of 0.03 mean?

P = 0.03 means: IF the null hypothesis is true, there is a 3% probability of seeing data this extreme by chance alone. It does NOT mean 97% probability the alternative is true, or that the effect is large. It just means the data is somewhat inconsistent with no effect.

Should I use one-tailed or two-tailed test?

Use two-tailed in most cases — it tests for any difference (positive or negative). Use one-tailed only when you have strong prior reason to expect direction. Two-tailed is more conservative (requires more extreme data); one-tailed has higher power but tests only one direction.

How is p-value calculated from a z-score?

Two-tailed: p = 2 × (1 - Φ(|z|)) where Φ is the standard normal CDF. Right-tailed: p = 1 - Φ(z). Left-tailed: p = Φ(z). In Excel: =2*(1-NORM.S.DIST(ABS(z), TRUE)) for two-tailed. For t-distribution: use T.DIST with degrees of freedom.

Is p < 0.05 always meaningful?

No. P-value alone doesn't indicate practical importance. Small effects with large samples can show p < 0.05 but be trivial. Always report effect size and confidence interval alongside p-value. The American Statistical Association statement (2016) emphasizes this.

What is multiple testing correction?

When running multiple statistical tests, the chance of at least one false positive increases. To control this: Bonferroni (α/k for k tests; conservative), Benjamini-Hochberg (controls false discovery rate; less conservative), or permutation tests. Required for genomics (thousands of tests) and other multi-test scenarios.

P-Value Calculator

Q: Is p < 0.05 always meaningful?

No. P-value alone doesn't indicate practical importance. Small effects with large samples can show p < 0.05 but be trivial. Always report effect size and confidence interval alongside p-value. The American Statistical Association statement (2016) emphasizes this.

Q: What is multiple testing correction?

When running multiple statistical tests, the chance of at least one false positive increases. To control this: Bonferroni (α/k for k tests; conservative), Benjamini-Hochberg (controls false discovery rate; less conservative), or permutation tests. Required for genomics (thousands of tests) and other multi-test scenarios.

Enter a test statistic (z-score or t-score) and select one-tailed or two-tailed to calculate the p-value. Useful for determining statistical significance in hypothesis testing.

The p-value is the most-cited but most-misunderstood number in modern statistics. It's the probability of observing a test statistic at least as extreme as the one calculated, **assuming the null hypothesis is true**. A small p-value means the observed data would be unusual if there's no effect; a large p-value means the data is consistent with no effect. By convention, p < 0.05 is the threshold for "statistical significance" — though this threshold is increasingly criticized as arbitrary and overused.

This calculator returns the p-value from a z-score (or t-statistic) for one-tailed or two-tailed tests. Use it to assess significance after running a hypothesis test, or to understand the relationship between test statistic and p-value. The lower the |z|, the larger the p-value; the higher |z|, the smaller p-value.

Crucially: a p-value tells you whether observed data is consistent with the null hypothesis, NOT whether the null hypothesis is true or how large any effect is. Statistical significance ≠ practical significance. A study with millions of participants can produce p < 0.001 for a tiny, meaningless effect, while a study with 30 participants might miss a clinically important effect with p > 0.05.

Inputs

Test Statistic (z or t)

Tail Type

Significance Level (alpha)

Results

P-Value

0.049996

Significant?

Yes

Interpretation

Statistically significant (p < 0.05). Reject the null hypothesis.

Confidence Level

95%

Last updated: May 29, 2026

Formula

**Two-tailed p-value (from z-score):** p = 2 × P(Z > |z|) = 2 × (1 - Φ(|z|)) Where Φ is the cumulative standard normal distribution function. **One-tailed p-values:** - **Right-tailed (positive z)**: p = P(Z > z) = 1 - Φ(z) - **Left-tailed (negative z)**: p = P(Z < z) = Φ(z) **Worked example: z = 1.96, two-tailed** p = 2 × P(Z > 1.96) = 2 × (1 - 0.975) = 2 × 0.025 = **0.05** This is exactly the conventional significance threshold. **Common z and p relationships:** | Z-score | Two-tailed p | One-tailed p | |---|---|---| | 0.0 | 1.000 | 0.500 | | 1.0 | 0.317 | 0.159 | | 1.5 | 0.134 | 0.067 | | 1.645 | 0.100 | 0.050 (α = 0.05, one-tail) | | 1.96 | 0.050 | 0.025 (α = 0.05, two-tail) | | 2.0 | 0.046 | 0.023 | | 2.576 | 0.010 | 0.005 (α = 0.01, two-tail) | | 3.0 | 0.003 | 0.001 | | 3.291 | 0.001 | 0.0005 (α = 0.001) | **Hypothesis testing framework:** 1. **State null hypothesis (H₀)**: usually "no effect" or "no difference." 2. **State alternative hypothesis (H₁)**: what you're testing for. 3. **Choose significance level (α)**: commonly 0.05. 4. **Calculate test statistic** (z, t, F, χ², etc.). 5. **Compute p-value** from test statistic. 6. **Compare to α**: if p < α, reject H₀; if p ≥ α, fail to reject H₀. 7. **Interpret in context**: significance, effect size, practical meaning. **Critical z-values for two-tailed α:** | Confidence level | α | Critical z | |---|---|---| | 90% | 0.10 | ±1.645 | | 95% | 0.05 | ±1.960 | | 99% | 0.01 | ±2.576 | | 99.9% | 0.001 | ±3.291 | **One-tailed vs two-tailed:** - **Two-tailed**: testing if effect is different (in either direction). - **One-tailed**: testing if effect is in a specific direction. - Two-tailed is more conservative (requires more extreme data). - One-tailed has more power but tests only one direction. When in doubt, use two-tailed. **Common misconceptions about p-value:** ❌ "P = 0.04 means 4% chance the null is true." ✓ P = 0.04 means: assuming the null is true, there's a 4% chance of seeing data as extreme as observed. ❌ "P > 0.05 means no effect exists." ✓ P > 0.05 means data isn't strong enough to reject null hypothesis. Effect may exist but not detected. ❌ "P < 0.05 means a real, important effect." ✓ Could be a trivial effect with large sample, or a true effect, or chance occurrence. ❌ "Lower p means larger effect." ✓ Effect size and p-value are separate. p depends on both effect size and sample size. **Effect size (separate from p-value):** Always report effect size alongside p-value: - **Cohen's d** (continuous): small=0.2, medium=0.5, large=0.8. - **Pearson's r** (correlation): small=0.1, medium=0.3, large=0.5. - **Odds ratio**: 1=no effect, 2=double the odds. **Significance thresholds:** - **0.05**: traditional, very common, increasingly criticized. - **0.01**: more conservative; medical and high-stakes research. - **0.001**: very conservative; replication thresholds. - **0.000001**: physics standards (5-sigma confidence). - **No fixed threshold**: emerging best practice is to report p with context. **P-value criticism:** The American Statistical Association statement (2016) on p-values: 1. P-values don't measure the probability of the null hypothesis being true. 2. Significance doesn't imply causation or practical importance. 3. Don't draw conclusions from p < threshold alone. 4. Always report effect sizes and confidence intervals. **Pre-registration and HARKing:** - **Pre-registration**: declare hypotheses before collecting data; prevents post-hoc p-hacking. - **HARKing** (Hypothesizing After Results are Known): bad practice; inflates Type I errors. - **P-hacking**: trying many tests until one significant; deflates true alpha. **Family-wise error rate:** When testing multiple hypotheses, individual p-values aren't sufficient. Use: - **Bonferroni correction**: α / k for k tests. - **Benjamini-Hochberg**: control false discovery rate. - **Family-wise error rate**: probability of at least one Type I error.

How to use this calculator

Enter your test statistic (z or t-score).
Select tail type: two-tailed (most common), left or right.
Set significance level (α; usually 0.05).
Calculator returns p-value.
Compare to α: if p < α, reject null hypothesis.
Always report effect size alongside p-value.

Worked examples

New drug effectiveness test

**Scenario:** New drug tested vs placebo. Z = 2.3 for difference in cure rates. Two-tailed test, α = 0.05. **Calculation:** Two-tailed p = 2 × (1 - 0.989) = 2 × 0.011 = 0.022. Compare to α = 0.05. Since p < α: reject null hypothesis. **Result:** Drug shows statistically significant effect (p = 0.022 < 0.05). But also report: effect size (cure rate difference), confidence interval, and sample size for full interpretation.

Quality control measurement

**Scenario:** Sample of 50 parts has mean weight 0.5 grams below specification. Z = -2.1. Testing if production is off-target (two-tailed). α = 0.05. **Calculation:** |z| = 2.1. Two-tailed p = 2 × (1 - 0.982) = 2 × 0.018 = 0.036. **Result:** P = 0.036 < α = 0.05. Reject null hypothesis (production is on target). Conclude: production is statistically significantly below specification. Action: adjust process.

Survey result interpretation

**Scenario:** Pre-election poll: candidate A leads B by 5 points. Z = 1.65 for one-tailed test of A leading. α = 0.05. **Calculation:** One-tailed p = 1 - 0.951 = 0.049. Just below significance threshold. **Result:** P = 0.049 just barely meets significance. Lead is statistically significant at 5% level. But: just-barely-significant is unstable. Election outcome uncertain. Report confidence interval and margin of error.

When to use this calculator

**Use p-values for:**

- **Hypothesis testing**: standard statistical inference. - **Clinical trials**: efficacy testing of treatments. - **A/B testing**: comparing two versions. - **Quality control**: process monitoring. - **Research papers**: standard reporting practice. - **Scientific evidence**: framework for evaluating findings.

**P-value interpretation:**

| P-value | Convention | |---|---| | > 0.10 | Insufficient evidence to reject null | | 0.05-0.10 | Marginal evidence; often called "trend" | | 0.01-0.05 | Statistically significant | | 0.001-0.01 | Highly significant | | < 0.001 | Very strong evidence |

**Effect size vs p-value:**

A p-value answers "is there an effect?" but not "how big?" Effect size measures magnitude: - **Small p, small effect**: real but trivial. - **Small p, large effect**: meaningful and unlikely chance. - **Large p, small effect**: possibly real but not detected. - **Large p, large effect**: insufficient evidence; may need more data.

**Statistical power:**

Power = 1 - β = probability of detecting a true effect.

- High power → small p when effect exists. - Low power → likely to miss real effects. - Increase power: larger sample, larger true effect, more sensitive test. - Target power: 80% conventional.

**Type I and Type II errors:**

- **Type I (α)**: rejecting null when it's true. Controlled by significance level. - **Type II (β)**: failing to reject null when it's false. Controlled by power. - **Trade-off**: reducing α increases β; balance based on consequences.

**Multiple testing:**

When running multiple tests, family-wise alpha inflates: - Single test at α = 0.05: 5% Type I error rate. - 10 tests at α = 0.05: ~40% chance of at least one false positive. - 100 tests: ~99% chance of at least one false positive.

Solutions: - **Bonferroni**: α / k (very conservative). - **Benjamini-Hochberg**: controls false discovery rate. - **Permutation tests**: robust alternative.

**Bayesian alternatives:**

Bayesian methods report: - **Posterior probability**: P(hypothesis | data). - **Bayes factor**: ratio of evidence for/against hypotheses. - **Credible intervals**: probability the parameter is in a range.

These avoid some misinterpretations of frequentist p-values.

**Best practices in reporting:**

1. **Always report p-value** with exact value (not just "p < 0.05"). 2. **Include effect size** (Cohen's d, Pearson r, odds ratio). 3. **Include confidence interval** for the effect. 4. **State sample size**. 5. **Note one-tailed vs two-tailed**. 6. **Don't claim significance** without effect size context.

**Avoiding p-hacking:**

- **Pre-register hypotheses**. - **Pre-specify analysis methods**. - **Don't selectively report tests** that worked. - **Avoid "garden of forking paths"** (changing methods based on results). - **Reproduce in replication study**.

**Modern statistical practice:**

- Move away from binary significance tests. - Emphasize effect sizes and confidence intervals. - Use cross-validation and replication. - Bayesian approaches gaining acceptance. - Pre-registration becoming standard.

**Common pitfalls:**

- Treating non-significant as "no effect." - Treating significant as "important effect." - Comparing p-values directly (different sample sizes change interpretation). - Multiple testing without correction. - Stopping data collection at significance. - HARKing (hypothesizing after results).

Common mistakes to avoid

Treating p-value as probability null hypothesis is true. It's conditional probability of seeing such data IF null is true.
Treating p = 0.05 as universally meaningful. Threshold is arbitrary; effect size matters more.
Confusing statistical significance with practical importance.
Using one-tailed when two-tailed is appropriate. Conservative default is two-tailed.
Multiple testing without correction. Increases Type I error rate.
Stopping when p = 0.05. P-hacking; should set sample size in advance.
Comparing p-values across studies with different sample sizes. Direct comparison is misleading.

P-Value Calculator

Inputs

Results

Formula

How to use this calculator

Worked examples

New drug effectiveness test

Quality control measurement

Survey result interpretation

When to use this calculator

Common mistakes to avoid

Frequently Asked Questions

Sources & further reading

Related Calculators

Z-Score Calculator

Confidence Interval Calculator

Standard Deviation Calculator

Inputs

Results

Formula

How to use this calculator

Worked examples

New drug effectiveness test

Quality control measurement

Survey result interpretation

When to use this calculator

Common mistakes to avoid

Frequently Asked Questions

What is a p-value?

When is a result statistically significant?

What does a p-value of 0.03 mean?

Should I use one-tailed or two-tailed test?

How is p-value calculated from a z-score?

Is p < 0.05 always meaningful?

What is multiple testing correction?

Sources & further reading

Related Calculators

Z-Score Calculator

Confidence Interval Calculator

Standard Deviation Calculator