Non-parametric Alternatives
Rank-based group comparison tests available in Licklider: Mann-Whitney U, Wilcoxon signed-rank, Kruskal-Wallis, and Friedman. Includes the design-to-test map, what each test actually tests, practical caveats, and the current implementation boundary, including the design errors Licklider cannot detect from the table alone.
What this method is
Licklider's group comparison runtime includes four rank-based tests: Mann-Whitney U, Wilcoxon signed-rank, Kruskal-Wallis, and Friedman. When Shapiro-Wilk flags non-normal data, the auto-selection logic routes to these tests instead of their parametric counterparts. Each test has an active API route, a Python engine implementation, and panel display in the figure canvas.
This page covers the design-to-test map, what each test actually tests, practical caveats from the statistical literature, and the current implementation boundary.
These routes automate common decisions, but they do not certify that your study design is correct. Licklider can route between rank-based tests from the table structure and assumption checks, yet it cannot infer the true observation unit, cannot tell whether a pairing variable is scientifically valid, and cannot detect hidden clustering when rows appear independent in the dataset.
Reference map: design to rank-based test
The first fork is: independent groups vs paired / repeated measures, then two groups vs three or more conditions. Licklider's auto-selection follows this same map. SciPy separates these tests the same way (for example mannwhitneyu for independent two-sample, wilcoxon for paired two conditions, kruskal for independent multi-group, and friedmanchisquare for repeated measures across three or more conditions).
| Study design | Rank-based test | Licklider panel |
|---|---|---|
| Two independent groups, non-normal | Mann–Whitney U | TTestPanel (U statistic, rank-biserial r) |
| Two paired conditions, non-normal | Wilcoxon signed-rank | TTestPanel (W statistic, rank-biserial r) |
| Three or more independent groups, non-normal | Kruskal–Wallis | SignificancePanel (H statistic, df, epsilon<sup>2</sup>) |
| Three or more repeated conditions, non-normal | Friedman test | SignificancePanel (chi<sup>2</sup>, df, Kendall's W) |
Note: Kruskal–Wallis is defined for two or more independent samples, but with exactly two groups the Mann–Whitney formulation is the usual pairwise choice; the table above follows Licklider's current selection policy.
What each test is actually asking
Rank-based tests are not interchangeable labels for "mean difference without normality." Under the hood they target different null hypotheses:
- Mann–Whitney U — Tests whether two independent samples appear to come from the same distribution. It is often interpreted as a location or stochastic-ordering question, but that interpretation needs extra assumptions; see the SciPy discussion in
mannwhitneyunotes. - Wilcoxon signed-rank — For paired data, works on the distribution of within-pair differences. SciPy states it tests whether the distribution of those differences is symmetric about zero.
- Kruskal–Wallis — Omnibus test across independent groups that at least one distribution differs. By itself it does not say which groups differ.
- Friedman — For repeated measurements on the same subjects or blocks across three or more conditions. It tests whether those related samples plausibly come from the same distribution.
Friedman's test is the non-parametric counterpart to repeated measures ANOVA. When all groups pass the normality check, Licklider selects repeated measures ANOVA; when any group fails, Friedman's test is selected instead. See Repeated Measures ANOVA.
For mixed designs that combine one between-subjects factor with one within-subjects factor, Licklider currently has no standard non-parametric mixed-design fallback. When that design is declared, the product stays on the Mixed ANOVA path rather than switching to a rank-based omnibus test.
Practical caveats
These points mirror SciPy's own notes and the current runtime behavior.
Important: Licklider does not automatically detect pseudoreplication or hidden non-independence in rank-based analyses. If multiple rows come from the same animal, plate, well, cage, litter, or technical replicate set, the software may still run a test even though the effective sample size is smaller than the row count suggests, which can make uncertainty look smaller than it really is.
- Kruskal–Wallis — The reported p-value uses a chi-square approximation for H. SciPy notes that each group should not be too small; a typical rule is at least 5 observations per group.
- Friedman — SciPy notes that the chi-square approximation is most reliable with larger complete-block designs. Small repeated-measures datasets should be interpreted cautiously.
- Mann–Whitney U — Choice of exact vs asymptotic inference matters for very small samples, and ties complicate the null distribution. Licklider reports the tie count in the panel output.
- Wilcoxon signed-rank — Zero differences and tied absolute differences affect the null distribution. Licklider reports zero-difference counts and uses pair-column based alignment for paired rows.
When to use or avoid
Use a non-parametric test when you need to move away from the parametric path.
- Use when the data are difficult to summarize with a mean-based comparison alone.
- Use when the sample size is small enough that assumption diagnostics need extra caution.
- Use when the reader's main question is about order or rank rather than a mean difference.
When Licklider's auto-selection routes to a non-parametric test, it is because Shapiro-Wilk flagged non-normality in the data. You can still override the selection if you have a defensible reason to prefer a parametric test.
Auto-selection should be read as a conservative routing rule, not as proof that the non-parametric path is always the scientifically best one. In larger samples, or when the main question is about a mean difference under unequal variances, a parametric method such as Welch's t-test or Welch-style ANOVA may still be the more defensible analysis path [1, 9].
Required inputs
Each test requires:
- A value column (continuous measurement)
- A group column (categorical factor)
- For paired tests (Wilcoxon, Friedman): a pair column linking the same subject or block across groups
For paired tests, Licklider can use a pair_column if you provide one, but it cannot verify that the pairing is scientifically correct. A column that happens to match rows is not enough; the pairing must reflect the real repeated-measures or matched-block design.
The current stats-meta path samples up to 20,000 rows for these routes before dispatching to the engine.
Outputs
Licklider reports the following for each test:
| Test | Panel | Statistic | Effect size | Additional fields |
|---|---|---|---|---|
| Mann–Whitney U | TTestPanel | U | Rank-biserial r | p-value, tie count, notes |
| Wilcoxon signed-rank | TTestPanel | W | Rank-biserial r | p-value, zero-difference count, pair-column notes |
| Kruskal–Wallis | SignificancePanel | H | epsilon<sup>2</sup> | df, p-value, pairwise comparisons with Holm default or Bonferroni override |
| Friedman | SignificancePanel | chi<sup>2</sup> | Kendall's W | df, p-value, pairwise Wilcoxon follow-up with Holm default or Bonferroni override |
A few implementation details matter here:
- Mann-Whitney and Wilcoxon currently return
ci_lowandci_highasnull. - Kruskal-Wallis pairwise comparisons are computed from Dunn-style rank comparisons, with pairwise effect size filled from the corresponding Mann-Whitney rank-biserial estimate.
- Friedman pairwise comparisons are computed as pairwise Wilcoxon comparisons across complete blocks.
Kruskal-Wallis and Friedman test results appear in the Statistical Results Table with their test statistics, degrees of freedom, p-values, and effect sizes (epsilon-squared for Kruskal-Wallis, Kendall's W for Friedman).
When the omnibus test is significant, post hoc pairwise comparisons are shown in the Pairwise Comparison Table: Dunn-style comparisons for Kruskal-Wallis and pairwise Wilcoxon comparisons on within-block differences for Friedman, both with Holm correction by default (Bonferroni optional).
Effect size interpretation
Effect size for Kruskal-Wallis (epsilon-squared):
| Value | Interpretation |
|---|---|
| < 0.01 | Negligible |
| 0.01 – 0.06 | Small |
| 0.06 – 0.14 | Medium |
| ≥ 0.14 | Large |
Effect size for Friedman (Kendall's W):
| Value | Interpretation |
|---|---|
| < 0.1 | Negligible |
| 0.1 – 0.3 | Small |
| 0.3 – 0.5 | Medium |
| ≥ 0.5 | Large |
Related checks
- Normality and Homoscedasticity Checks feed into auto-selection. When Shapiro-Wilk flags non-normality, auto-selection routes to the corresponding rank-based test.
- Observation Unit Declaration helps you define what each row represents before choosing an independent or paired rank-based test.
- Paired vs Unpaired Guard helps you review whether a paired design is actually justified.
- For the broader support boundary, read Known Limitations.
Recommended figures
- If you need readers to see the actual observations rather than a summary alone, consider a Strip Plot.
- Point-level figures can help explain why a simple parametric summary may deserve a second look.
Design Rationale & References
Licklider's design choices
Licklider routes to rank-based tests when Shapiro-Wilk flags non-normality because the central limit theorem cannot be assumed to apply in small samples, and parametric tests carry stronger assumptions about the data-generating process than rank-based tests require [1]. The routing follows the standard design-to-test map: independent two-group designs use Mann-Whitney U, paired two-condition designs use Wilcoxon signed-rank, independent multi-group designs use Kruskal-Wallis, and repeated-measures multi-condition designs use Friedman [1].
Mann-Whitney U is described throughout Licklider as a test of distributional equality - not a test of medians - because interpreting U as a median comparison requires the additional assumption that the two distributions differ only in location, which is often unverifiable in practice [2, 3].
Effect sizes are reported alongside every rank-based test: rank-biserial r for two-group tests [4], epsilon-squared for Kruskal-Wallis [5], and Kendall's W for Friedman [6]. These are the standard measures for their respective tests and carry the same motivation as effect size reporting for parametric tests - p-values alone do not convey the magnitude or practical relevance of a finding.
Effect sizes are reported alongside every rank-based test: rank-biserial r for two-group tests [4], epsilon-squared for Kruskal-Wallis [5], and Kendall's W for Friedman [6]. Confidence intervals on rank-biserial r are not yet reported - the ci fields in the current output are null for Mann-Whitney U and Wilcoxon signed-rank.
For Kruskal-Wallis post hoc comparisons, Licklider defaults to Dunn-style rank comparisons with Holm correction [7, 8]. For Friedman post hoc comparisons, Licklider uses pairwise Wilcoxon comparisons across complete blocks, also with Holm correction as the default [8]. Holm correction provides better power than Bonferroni while maintaining familywise error control [8].
One important limitation applies to all rank-based tests: they are not universally safer than their parametric counterparts. When group variances are substantially unequal, Mann-Whitney U can produce higher Type I error rates than Welch's t-test [9]. Licklider flags unequal variances in the Assumptions panel regardless of which test is selected.
Methodological foundations
Fagerland, M. W. (2012). t-tests, non-parametric tests, and large studies - a paradox of statistical practice? BMC Medical Research Methodology, 12(1), 78. -> In large samples, t-tests are robust to non-normality via the central limit theorem, while non-parametric tests may test more than just location - the basis for Licklider's recommendation to consider parametric alternatives when samples are large.
Divine, G. W., Norton, H. J., Baron, A. E., & Juarez-Colunga, E. (2018). The many faces of the Wilcoxon-Mann-Whitney test. The American Statistician, 72(2), 191-198. -> Establishes that the Mann-Whitney null hypothesis is distributional equality, not equality of medians, and clarifies when a location-shift interpretation is and is not justified.
Hart, A. (2001). Mann-Whitney test is not just a test of medians. BMJ, 323(7310), 450. -> Demonstrates that when group variances differ, Mann-Whitney U does not correctly test for median differences - the clinical-audience case for Licklider's distributional framing.
Kerby, D. S. (2014). The simple difference formula: An approach to teaching nonparametric correlation. Comprehensive Psychology, 3, 11-IT. -> Proposes the simple difference formula for rank-biserial r as a directly interpretable effect size for Mann-Whitney U and Wilcoxon signed-rank tests; the basis for Licklider's effect size calculation for two-group rank tests.
Tomczak, M., & Tomczak, E. (2014). The need to report effect size estimates revisited. Trends in Sport Sciences, 1(21), 19-25. -> Provides the epsilon-squared formula and interpretation benchmarks for Kruskal-Wallis, establishing it as the rank-test analogue of eta-squared.
Kendall, M. G., & Smith, B. B. (1939). The problem of m rankings. The Annals of Mathematical Statistics, 10(3), 275-287. -> Original derivation of the coefficient of concordance W, the effect size reported by Licklider for Friedman tests.
Dinno, A. (2015). Nonparametric pairwise multiple comparisons in independent groups using Dunn's test. The Stata Journal, 15(1), 292-300. -> Establishes Dunn's test as the rank-consistent post hoc procedure for Kruskal-Wallis and recommends multiple comparison adjustment; the direct basis for Licklider's default post hoc selection.
Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2), 65-70. -> Establishes the Holm step-down procedure, which controls familywise error rate with greater power than Bonferroni - Licklider's default correction method for all non-parametric post hoc comparisons.
Known limitations
- Zimmerman, D. W. (1998). Invalidation of parametric and nonparametric statistical tests by modification of sample variance. The Journal of Experimental Education, 67(1), 55-68. -> Shows that Mann-Whitney U can produce higher Type I error rates than Welch's t-test when group variances are substantially unequal - the basis for Licklider's variance assumption check even when a non-parametric test is selected.
Current support boundary
- All four rank-based tests (Mann-Whitney U, Wilcoxon signed-rank, Kruskal-Wallis, Friedman) have active API routes, Python engine implementations, and panel display.
- Auto-selection currently routes to these tests when Shapiro-Wilk flags non-normal data: Mann-Whitney U for 2 unpaired groups, Wilcoxon for 2 paired groups, Kruskal-Wallis for 3+ unpaired groups, and Friedman for 3+ paired groups.
- The stats-meta path currently samples up to 20,000 rows before dispatching these requests.
- Wilcoxon and Friedman require a pair column and skip when complete pairs or complete blocks cannot be formed. Friedman additionally requires at least 2 complete blocks.
- Kruskal-Wallis pairwise follow-up uses Dunn-style rank comparisons with Holm default or Bonferroni override. Friedman follow-up uses pairwise Wilcoxon comparisons with the same correction choices.
- Licklider does not automatically detect whether rows that look independent are actually clustered or repeated measures from the same underlying unit.
- Licklider does not automatically verify that a provided
pair_columnrepresents the real scientific pairing rather than an after-the-fact convenience match. - Rank-based routing does not guarantee that the non-parametric path is safer than the parametric one for every dataset; review the variance flags and design assumptions before treating the auto-selected route as final.