Why We Built Licklider

Research has a statistics problem.

Not a shortage of statistical tools — there are dozens. Not a shortage of guidance — every textbook covers the basics. The problem is that none of the existing tools make a verifiable commitment that the analysis in front of you is correct. They make it easy. They make it fast. But correctness has always been left as an exercise for the researcher.

That gap is what Licklider is built to close.

The problem with "convenient"

GraphPad Prism, R, Python, SPSS — these are capable tools. Researchers have built careers using them, and many excellent papers have come out the other side. But ask any one of these tools whether the t-test you just ran was the right choice, and they will say nothing. Ask whether your outlier handling changed the conclusion. Ask whether your error bars actually mean what you think they mean. Silence.

The tools that exist today were designed to be fast and approachable. That was the right problem to solve in the 1990s, when the alternative was writing statistical code from scratch. But approachable and correct are not the same thing, and the research reproducibility crisis has made the cost of that difference visible.

An estimated half of published preclinical findings do not replicate. The causes are varied, but a consistent thread runs through them: statistical decisions made without a record, assumptions untested, analyses repeated until the p-value cooperated. None of these failures required dishonesty. They required only the absence of a system that would catch them.

Why AI does not solve this

It would be reasonable to assume that large language models change this picture. If a model can write code, draft a paper, and explain a concept, surely it can check a statistical analysis.

It cannot — not in the way that matters.

LLMs generate plausible outputs. They are trained to produce text that looks correct, and in many domains that is good enough. But statistics is one of the few fields where "looks correct" and "is correct" are meaningfully different. A Shapiro–Wilk test either passed or it did not. A variance assumption either holds or it does not. Welch's t-test either should have been used or it should not have been. These are not matters of judgment that a language model can approximate. They are computations with right and wrong answers.

More importantly: an LLM cannot produce a verifiable record of what it did. It cannot show you that the test was selected for documented reasons. It cannot demonstrate that the outlier sensitivity was computed, not assumed. It cannot guarantee that running the same analysis tomorrow on the same data will produce the same result.

This is not a criticism of LLMs. It is a structural observation about what they are built to do. Licklider is built to do something different.

What we are building

Licklider is statistical infrastructure for research.

Every figure it produces comes with a complete audit trail: which normality test ran, what it found, which test was selected as a result, whether any outliers were detected and whether removing them changed the conclusion, which multiple comparison correction was applied and why. These are not optional disclosures. They are the product.

The goal is a world where a researcher can say "Licklider ran this analysis" and a reviewer can verify exactly what that means — not because they trust the tool, but because they can see the record.

That is what statistical integrity means in practice. Not a promise. A proof.

How we think about the design

Some of our design decisions are worth explaining, because they reflect a particular view of what research tooling should do.

Welch's t-test is the default, not Student's. Student's t-test assumes equal variance across groups. Most experimental data does not satisfy this assumption reliably, and testing for it before choosing adds its own error. Welch's t-test is valid whether or not variances are equal. Defaulting to the safer choice and documenting why is more honest than asking the researcher to make a choice they may not be equipped to evaluate.

Outlier sensitivity is computed, not acknowledged. When an outlier is detected, Licklider does not ask whether to remove it. It computes the analysis both ways, compares the p-value, the effect size, and the confidence interval, and records the difference. If the conclusion changes, that is disclosed. If it does not, that is also disclosed. The figure reflects the full dataset unless the researcher explicitly decides otherwise.

Export is blocked when semantics are unresolved. An error bar with no declared meaning is not an error bar. It is a shape. Licklider will not produce a claim-bearing export if the error bar type has not been confirmed. This feels restrictive until you have had a reviewer ask what your error bars represent and realized you are not certain.

These decisions make Licklider slower than a tool that asks nothing. That is the point.

What we cannot do

Licklider checks the statistical validity of analyses within its supported method set. It does not validate the scientific validity of a research question. It cannot detect pseudoreplication when no subject ID column exists in the data. It does not verify that a stated pre-registration matches what was actually analyzed. It cannot determine causation from association.

We document these limits in detail in the Limitations and Guard Interpretation section of the reference documentation, because a tool that hides its limits is not a trustworthy tool.

Who we are

I am Tasuku Kimura. I started building Licklider because I kept running into the same gap: researchers who wanted to do rigorous work but had no tooling that met them there. The existing options were either too blunt or too specialized, and none of them treated correctness as a first-class output.

Licklider is still early. The method coverage is not complete. Some features are partial. The public audit trail that will let reviewers verify analyses directly is planned but not yet shipped.

I am building in the open about this. The reference documentation lists what is supported, what is partial, and what is not yet available. The limitations page lists what the system cannot catch. If something is missing or wrong, I want to know.

Where this goes

Research deserves infrastructure that makes statistical integrity the default, not the exception. That is the only goal.