P-Value Calculator
Enter a test statistic (z-score or t-score) and select one-tailed or two-tailed to calculate the p-value. Useful for determining statistical significance in hypothesis testing.
The p-value is the most-cited but most-misunderstood number in modern statistics. It's the probability of observing a test statistic at least as extreme as the one calculated, **assuming the null hypothesis is true**. A small p-value means the observed data would be unusual if there's no effect; a large p-value means the data is consistent with no effect. By convention, p < 0.05 is the threshold for "statistical significance" — though this threshold is increasingly criticized as arbitrary and overused.
This calculator returns the p-value from a z-score (or t-statistic) for one-tailed or two-tailed tests. Use it to assess significance after running a hypothesis test, or to understand the relationship between test statistic and p-value. The lower the |z|, the larger the p-value; the higher |z|, the smaller p-value.
Crucially: a p-value tells you whether observed data is consistent with the null hypothesis, NOT whether the null hypothesis is true or how large any effect is. Statistical significance ≠ practical significance. A study with millions of participants can produce p < 0.001 for a tiny, meaningless effect, while a study with 30 participants might miss a clinically important effect with p > 0.05.
Inputs
Results
P-Value
0.049996
Significant?
Yes
Interpretation
Statistically significant (p < 0.05). Reject the null hypothesis.
Confidence Level
95%
Formula
How to use this calculator
- Enter your test statistic (z or t-score).
- Select tail type: two-tailed (most common), left or right.
- Set significance level (α; usually 0.05).
- Calculator returns p-value.
- Compare to α: if p < α, reject null hypothesis.
- Always report effect size alongside p-value.
Worked examples
New drug effectiveness test
**Scenario:** New drug tested vs placebo. Z = 2.3 for difference in cure rates. Two-tailed test, α = 0.05. **Calculation:** Two-tailed p = 2 × (1 - 0.989) = 2 × 0.011 = 0.022. Compare to α = 0.05. Since p < α: reject null hypothesis. **Result:** Drug shows statistically significant effect (p = 0.022 < 0.05). But also report: effect size (cure rate difference), confidence interval, and sample size for full interpretation.
Quality control measurement
**Scenario:** Sample of 50 parts has mean weight 0.5 grams below specification. Z = -2.1. Testing if production is off-target (two-tailed). α = 0.05. **Calculation:** |z| = 2.1. Two-tailed p = 2 × (1 - 0.982) = 2 × 0.018 = 0.036. **Result:** P = 0.036 < α = 0.05. Reject null hypothesis (production is on target). Conclude: production is statistically significantly below specification. Action: adjust process.
Survey result interpretation
**Scenario:** Pre-election poll: candidate A leads B by 5 points. Z = 1.65 for one-tailed test of A leading. α = 0.05. **Calculation:** One-tailed p = 1 - 0.951 = 0.049. Just below significance threshold. **Result:** P = 0.049 just barely meets significance. Lead is statistically significant at 5% level. But: just-barely-significant is unstable. Election outcome uncertain. Report confidence interval and margin of error.
When to use this calculator
**Use p-values for:**
- **Hypothesis testing**: standard statistical inference. - **Clinical trials**: efficacy testing of treatments. - **A/B testing**: comparing two versions. - **Quality control**: process monitoring. - **Research papers**: standard reporting practice. - **Scientific evidence**: framework for evaluating findings.
**P-value interpretation:**
| P-value | Convention | |---|---| | > 0.10 | Insufficient evidence to reject null | | 0.05-0.10 | Marginal evidence; often called "trend" | | 0.01-0.05 | Statistically significant | | 0.001-0.01 | Highly significant | | < 0.001 | Very strong evidence |
**Effect size vs p-value:**
A p-value answers "is there an effect?" but not "how big?" Effect size measures magnitude: - **Small p, small effect**: real but trivial. - **Small p, large effect**: meaningful and unlikely chance. - **Large p, small effect**: possibly real but not detected. - **Large p, large effect**: insufficient evidence; may need more data.
**Statistical power:**
Power = 1 - β = probability of detecting a true effect.
- High power → small p when effect exists. - Low power → likely to miss real effects. - Increase power: larger sample, larger true effect, more sensitive test. - Target power: 80% conventional.
**Type I and Type II errors:**
- **Type I (α)**: rejecting null when it's true. Controlled by significance level. - **Type II (β)**: failing to reject null when it's false. Controlled by power. - **Trade-off**: reducing α increases β; balance based on consequences.
**Multiple testing:**
When running multiple tests, family-wise alpha inflates: - Single test at α = 0.05: 5% Type I error rate. - 10 tests at α = 0.05: ~40% chance of at least one false positive. - 100 tests: ~99% chance of at least one false positive.
Solutions: - **Bonferroni**: α / k (very conservative). - **Benjamini-Hochberg**: controls false discovery rate. - **Permutation tests**: robust alternative.
**Bayesian alternatives:**
Bayesian methods report: - **Posterior probability**: P(hypothesis | data). - **Bayes factor**: ratio of evidence for/against hypotheses. - **Credible intervals**: probability the parameter is in a range.
These avoid some misinterpretations of frequentist p-values.
**Best practices in reporting:**
1. **Always report p-value** with exact value (not just "p < 0.05"). 2. **Include effect size** (Cohen's d, Pearson r, odds ratio). 3. **Include confidence interval** for the effect. 4. **State sample size**. 5. **Note one-tailed vs two-tailed**. 6. **Don't claim significance** without effect size context.
**Avoiding p-hacking:**
- **Pre-register hypotheses**. - **Pre-specify analysis methods**. - **Don't selectively report tests** that worked. - **Avoid "garden of forking paths"** (changing methods based on results). - **Reproduce in replication study**.
**Modern statistical practice:**
- Move away from binary significance tests. - Emphasize effect sizes and confidence intervals. - Use cross-validation and replication. - Bayesian approaches gaining acceptance. - Pre-registration becoming standard.
**Common pitfalls:**
- Treating non-significant as "no effect." - Treating significant as "important effect." - Comparing p-values directly (different sample sizes change interpretation). - Multiple testing without correction. - Stopping data collection at significance. - HARKing (hypothesizing after results).
Common mistakes to avoid
- Treating p-value as probability null hypothesis is true. It's conditional probability of seeing such data IF null is true.
- Treating p = 0.05 as universally meaningful. Threshold is arbitrary; effect size matters more.
- Confusing statistical significance with practical importance.
- Using one-tailed when two-tailed is appropriate. Conservative default is two-tailed.
- Multiple testing without correction. Increases Type I error rate.
- Stopping when p = 0.05. P-hacking; should set sample size in advance.
- Comparing p-values across studies with different sample sizes. Direct comparison is misleading.