Correlation Calculator

Calculate the Pearson correlation coefficient between two variables. Includes R-squared, statistical significance testing, and detailed interpretation.

r = sum[(x-xbar)(y-ybar)] / sqrt[sum(x-xbar)^2 * sum(y-ybar)^2]; R^2 = r^2; t = r*sqrt(n-2)/sqrt(1-r^2) with df=n-2
X: 1,2,3,4,5 and Y: 2,4,5,4,6 -> r=0.832 (strong positive correlation), R^2=69.2%, t=3.32, p~=0.03 (significant)

What is correlation and what does the correlation coefficient mean?

Correlation measures the strength and direction of linear relationship between two variables. Pearson's r ranges from -1 to +1. r = +1: perfect positive (as X increases, Y increases proportionally). r = -1: perfect negative (as X increases, Y decreases). r = 0: no linear relationship. Interpretation: |r| 0.0-0.3 weak, 0.3-0.7 moderate, 0.7-1.0 strong. Example: Height and weight r=0.7 (strong positive). Temperature and heating bills r=-0.8 (strong negative). Shoe size and IQ r≈0 (no relationship).

What is the difference between correlation and causation?

CRITICAL: Correlation does NOT imply causation! Example: Ice cream sales and drowning deaths are correlated (r=0.9), but ice cream doesn't cause drowning - both increase in summer (confounding variable). Famous examples: (1) Nicolas Cage movies and pool drownings, (2) Cheese consumption and deaths by bedsheet tangling. To establish causation need: (1) Correlation, (2) Temporal precedence (cause before effect), (3) Control for confounds, (4) Mechanism. Correlation is necessary but not sufficient for causation. Use experiments, not just observation.

How do I interpret the p-value for correlation?

P-value tests if correlation is statistically different from zero. p < 0.05 (5%) = statistically significant correlation. Example: r=0.4, n=50, p=0.003 → significant correlation. But r=0.4, n=10, p=0.25 → not significant (too small sample). Important: Statistical significance ≠ practical significance. With huge samples, r=0.1 can be "significant" but meaningless. With tiny samples, r=0.8 might not be "significant." Always report both r and p-value. Effect size (r) matters more than p-value!

What are the assumptions of Pearson correlation?

Pearson's r requires: (1) LINEAR relationship (not curved), (2) Both variables continuous, (3) No extreme outliers, (4) Bivariate normal distribution (for significance tests). Violations: Use Spearman's rho for: ordinal data, non-linear monotonic relationships, outliers. Use Kendall's tau for: small samples, many tied ranks. Example: Income and happiness might be logarithmic → use Spearman. Class rank (ordinal) → use Spearman. Always plot data first - scatter plot reveals non-linearity, outliers, patterns.

How does sample size affect correlation?

Larger samples give more reliable correlations and lower p-values. Same r with different n: r=0.3, n=20, p=0.20 (not significant). r=0.3, n=100, p=0.002 (significant). r=0.3, n=1000, p<0.001 (highly significant). Minimum samples: n>=30 for basic analysis, n>=50 for reliable estimates, n>=100 for publication. Small samples: unstable correlations, wide confidence intervals, low power. Large samples: detect tiny correlations (which may not matter practically). Rule: need ~85 samples to detect r=0.3 with 80% power.

What is R-squared (coefficient of determination)?

R^2 = r^2 = proportion of variance in Y explained by X. r=0.8 → R^2=0.64 → 64% of variance explained. Example: Height and weight r=0.7 → R^2=0.49. Height explains 49% of weight variation; 51% due to other factors (diet, muscle mass, etc.). Interpretation: R^2=0.25 (25% explained), R^2=0.50 (50% explained), R^2=0.75 (75% explained). Used in regression more than simple correlation. Higher R^2 = better prediction. But low R^2 doesn't mean no relationship - could be non-linear or multifactorial.

How do outliers affect correlation?

Outliers can drastically change correlation. Example: 20 points with r=0.1. Add one outlier → r jumps to 0.8 (misleading!). Types: (1) Increase correlation (extends trend), (2) Decrease correlation (breaks pattern), (3) Create correlation where none exists. Detection: scatter plot, standardized residuals >3, Cook's distance. Solutions: (1) Remove if data error, (2) Transform variables (log, sqrt), (3) Use robust methods (Spearman's rho), (4) Report with/without outlier. Never silently remove outliers - always justify and report impact.

What is the difference between Pearson, Spearman, and Kendall correlation?

PEARSON (r): Linear relationship, continuous data, sensitive to outliers. Most common. Example: height vs weight. SPEARMAN (rho�): Monotonic relationship, ordinal or continuous, robust to outliers. Uses ranks instead of values. Example: class rank vs test score. KENDALL (rho�): Similar to Spearman but better for small samples and tied ranks. More conservative. Relationship: Pearson measures linear, Spearman/Kendall measure monotonic (can be non-linear but consistently increasing/decreasing). If all three similar → robust relationship. If Pearson differs → check for non-linearity or outliers.

How do I test if a correlation is statistically significant?

Test H₀: rho�=0 (no correlation) vs H₁: rho�≠0. Test statistic: t = rsqrt(n-2)/sqrt(1-r^2) with df=n-2. Compare to t-table or calculate p-value. Example: r=0.5, n=30 → t=0.5sqrt28/sqrt0.75=3.06, df=28, p=0.005 (significant). Critical values at alpha=0.05: n=10 (r>=0.632), n=20 (r>=0.444), n=30 (r>=0.361), n=50 (r>=0.279), n=100 (r>=0.197). Confidence interval: CI = tanh(arctanh(r) +/- z/sqrt(n-3)). Always report: r, p-value, n, and confidence interval for publication.

Can correlation be used for prediction?

Correlation shows association but regression is better for prediction. Correlation: symmetric (r_xy = r_yx), no distinction between predictor/outcome. Regression: asymmetric, predicts Y from X. Example: Height and weight r=0.7. Can say "70% correlation" but for "predict weight from height" use regression: Weight = a + b*Height. Correlation gives strength of relationship, regression gives prediction equation. Both related: regression slope b = r * (SD_y/SD_x). Use correlation for: exploring relationships, preliminary analysis. Use regression for: prediction, controlling variables, effect sizes.

What are common mistakes in interpreting correlation?

MISTAKE 1: Assuming causation (correlation ≠ causation!). MISTAKE 2: Ignoring non-linearity (plot data first!). MISTAKE 3: Extrapolating beyond data range. MISTAKE 4: Confusing correlation with agreement (use ICC or kappa). MISTAKE 5: Comparing correlations without testing difference. MISTAKE 6: Using Pearson for ordinal data (use Spearman). MISTAKE 7: Ignoring restricted range (correlation weaker in homogeneous groups). MISTAKE 8: Multiple testing without correction (20 tests at p<0.05 → expect 1 false positive by chance).

How do I compare two correlation coefficients?

To test if r₁ ≠ r₂, use Fisher's Z transformation. Transform: Z = 0.5*ln[(1+r)/(1-r)]. Test: z = (Z₁-Z₂)/sqrt(1/(n₁-3) + 1/(n₂-3)). Example: Group A (n=50, r=0.6), Group B (n=60, r=0.3). Z_A=0.693, Z_B=0.310. z=(0.693-0.310)/sqrt(1/47+1/57)=1.96, p=0.05 (marginally significant). Applications: (1) Compare correlations across groups (men vs women), (2) Test if correlation changed over time, (3) Meta-analysis combining correlations. Online calculators available for complex comparisons.