Outliers and Researcher Degrees of Freedom

Every decision a researcher makes about the data — which observations to include, which subgroups to examine, which statistical method to use — is a degree of freedom. When these decisions are made after seeing the data, and especially when they are made repeatedly until a desired result appears, the false positive rate of the analysis is no longer controlled by the nominal alpha level.

This is researcher degrees of freedom, sometimes called p-hacking or the garden of forking paths. It is not always intentional — researchers often explore their data legitimately — but it produces results that overstate their own reliability.

What Licklider tracks

Licklider records exploratory operations that expand researcher degrees of freedom. It groups these into three categories:

Outlier exclusion and row removal Applying an outlier criterion, changing the criterion, winsorizing values, or removing rows from the analysis. Each time an exclusion is applied or changed, it is counted.

Subgroup selection and filtering Restricting the analysis to a subset of the data, examining different subgroups, or changing the group definition.

Retest and method changes Changing the statistical test after seeing the results, switching between parametric and non-parametric methods, or adjusting test parameters.

Licklider can only track decisions that occur inside the recorded analysis workflow. If a researcher tries several approaches outside the system, inspects results in another tool, or makes manual decisions before uploading the final table, those choices are not available to this tracker.

How severity is assessed

As exploratory operations accumulate, Licklider evaluates the severity based on the total number of operations and how many categories are involved:

Severity	Threshold
Low	Fewer than 3 operations, fewer than 2 categories
Medium	3 or more operations, or 2 or more categories
High	5 or more operations, or 3 categories represented

Licklider also estimates the approximate alpha inflation — how much the true false positive rate may have increased above the nominal level — using the formula: 1 − 0.95^n, where n is the number of independent exploratory decisions.

This estimate is an approximation. The actual inflation depends on whether the decisions were truly independent and on what was done at each step.

The severity thresholds are meant as practical risk signals, not as formal statistical corrections. They help distinguish a small amount of routine exploration from repeated result-shaping behavior, but they do not replace a principled multiplicity adjustment or a pre-specified analysis plan.

Important: Licklider does not automatically know whether the tracked decisions are statistically independent, whether some branches were abandoned before being recorded, or whether the exploration started outside the current session. In those cases, the displayed alpha inflation can understate the real researcher degrees of freedom.

Where to find it

The exploratory pattern summary is visible in the Inspector alongside the figure's other quality check results. It shows:

The number of analysis variants explored
The number of exclusion and retest cycles
The number of subgroups explored
The estimated alpha inflation
The overall risk level
A timeline of the exploratory operations

Taken together, these outputs tell you not only that exploration happened, but also what kind of exploration happened and how concentrated it was.

Effect on export

The effect depends on the severity and the analysis intent:

Exploratory analyses Low and medium severity are automatically disclosed. The disclosure is included in the figure's export output and no confirmation is required. High severity also produces an automatic disclosure.

Confirmatory and publication-ready analyses High severity requires confirmation before the figure can be used in a claim-bearing export. This means acknowledging that the exploratory operations occurred and documenting how the reported result was selected.

This stricter behavior is intentional. Exploration is normal, but once a figure is presented as a claim-bearing result, undocumented analysis flexibility becomes part of the evidential problem rather than just part of the workflow history.

What this means for your analysis

A high severity rating does not mean the result is wrong. It means that the result was reached through a process that inflates the false positive rate, and that the inflation has not been accounted for.

The appropriate response depends on what was done:

If the exploratory decisions were pre-specified or driven by domain knowledge rather than by the data, document this in the methods text
If multiple analyses were run and only one is being reported, consider whether the selection was based on significance — and if so, whether a correction for multiple testing is appropriate
If the result is intended for confirmatory use, ensure that the analysis plan was specified before the exploratory operations began

Design rationale and references

Licklider tracks outlier exclusions, subgroup changes, and retest cycles together because these are common ways that analysis flexibility accumulates. Each individual step may look harmless, but repeated branching after seeing the data can substantially increase the chance of a false-positive result [1, 2].

The severity bands are intentionally simple. Their job is to communicate escalating risk to non-specialist readers, not to pretend that one exact threshold marks the boundary between valid and invalid inference.

The alpha-inflation display is also intentionally approximate. It gives readers an intuition for why repeated decision-making matters, while still making clear that a heuristic display cannot substitute for formal multiplicity handling or strict preregistration [1, 3].

Methodological foundations

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359-1366. -> Canonical demonstration that repeated analytic flexibility inflates false-positive findings.
Gelman, A., & Loken, E. (2014). The statistical crisis in science: Data-dependent analysis - a "garden of forking paths" - explains why many statistically significant comparisons don't hold up. American Scientist, 102(6), 460-465. -> Explains why even non-malicious analytic branching can invalidate nominal error rates.
Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a world beyond "p < 0.05". The American Statistician, 73(sup1), 1-19. -> Supports the emphasis on transparent analysis process and disclosure rather than over-trusting nominal thresholds.

Current support boundary

Licklider only tracks exploratory operations that happen inside the recorded workflow; it does not automatically recover unlogged decisions made in notebooks, spreadsheets, scripts, or prior sessions.
Licklider does not automatically know whether the tracked decisions are statistically independent, so the displayed alpha inflation is a heuristic rather than an exact corrected error rate.
The Low / Medium / High severity thresholds are communication-oriented risk bands, not formal hypothesis-testing cutoffs.
A high severity rating does not prove that the final result is false, and a low severity rating does not prove that all relevant analytic flexibility has been captured.
This page describes within-analysis flexibility tracking. Project-level multiplicity across several claim-bearing figures is handled separately on Multiplicity and Analysis Families.

What this page does not cover

How outlier exclusions are defined and recorded → see Outlier and Exclusion Policy
How the Outlier Exclusion Log is read → see Outlier Exclusion Log
How sensitivity to outlier decisions is evaluated → see Outlier Sensitivity Report
How project-level multiplicity is handled → see Multiplicity and Analysis Families