Two-Way ANOVA
Use this page to decide when two-way ANOVA fits your design, what outputs Licklider returns, how to read the interaction-first workflow, and where the current implementation still has limits.
When to use two-way ANOVA
Use two-way ANOVA when you have:
- A continuous outcome variable
- Two independent categorical factors (for example, treatment and dose, genotype and timepoint, sex and drug)
- A question about whether the two factors interact, that is, whether the effect of one factor changes depending on the level of the other
Common two-way designs in life science research:
- Dose (
0/10/50uM) x Genotype (wild-type / knockout) - Treatment (control / treated) x Timepoint (day 1 / day 3 / day 7)
- Sex (male / female) x Drug (vehicle / compound A / compound B)
If you have only one categorical factor, use One-Way ANOVA. If your data are ordinal or clearly non-normal with small samples, see Non-Parametric Alternatives for rank-based alternatives. Repeated measures designs (the same subjects at each timepoint) require repeated-measures ANOVA, which is coming soon.
Assumptions
Licklider checks the following automatically for two-way ANOVA when they can be evaluated from the submitted cell values and factor labels. Results for those automatic checks appear in the Assumptions panel.
Not every important design risk can be inferred from the uploaded table alone. In particular, independence problems such as repeated measures, pseudoreplication, clustering by litter or plate, or rows that already average over technical replicates are not reliably auto-detectable unless the study design is declared explicitly upstream. Read the design-structure notes below as limits, not as checks that Licklider can guarantee for you.
Normality
Licklider runs Shapiro-Wilk on each cell, each unique combination of Factor A and Factor B levels. For example, a 2 x 3 design has six cells, each checked independently. Groups with n < 3 or n > 5,000 are skipped.
Two-way ANOVA is moderately robust to normality violations when cell sizes are sufficient (roughly n ≥ 10 per cell). With small cells, violations matter more. If multiple cells are flagged, consider whether a transformation or a non-parametric alternative is appropriate.
Variance homogeneity
Licklider runs Levene's test across all cells. A flag appears when p < 0.05. Unequal variances across cells are more consequential in two-way designs than in one-way designs because the error term is shared across all comparisons.
Post hoc procedures such as Tukey HSD are typically derived under a common (equal) variance assumption across the groups being compared. When Levene suggests heterogeneity, treat the omnibus result and pairwise post hoc output as more uncertain: a non-significant Levene test does not prove equal variances, and a significant Levene test does not specify which cells differ. For pairwise contrasts under unequal variances, references such as statsmodels document Games-Howell as the unequal-variance counterpart to Tukey; Licklider's available post hoc menu and defaults match the One-Way ANOVA path, so check that page for which methods are offered and any planned extensions.
Cell balance
Two-way ANOVA is sensitive to severely unequal cell sizes. For fixed-effects factorial analyses with unbalanced counts, Licklider defaults to Type II sum of squares because it aligns with hypotheses that are often prioritized when interactions are absent or weak [1, 2]; large imbalances (largest cell more than twice the smallest) still reduce power and can distort estimates. A warning appears when substantial imbalance is detected.
Independence
Each observation must come from a different subject. If the same subjects appear at multiple levels of either factor, use a repeated-measures or mixed design.
Licklider cannot reliably infer from the analysis table alone whether two rows came from the same animal, donor, litter, plate, or other higher-level unit if that structure is not encoded in the study setup. That means Licklider cannot automatically detect many independence violations, including repeated-measures data entered as if they were independent rows and pseudoreplication caused by treating technical replicates as separate biological observations. If that happens, the standard errors can be too small and both main-effect and interaction p-values can look more convincing than they should.
Within-cell replication
Ordinary two-way ANOVA estimates within-cell error from replication: you need more than one independent observation in at least some cells. If each cell contains only a single value (for example one well per condition after plate-level summarisation, or one aggregate per factor combination), there is no within-cell residual in the usual sense and you cannot fit or test the interaction in the standard two-way ANOVA framework. In that situation, redesign the data collection to retain replicates, use a model appropriate to the measurement scale, or seek a design-specific analysis rather than forcing a full factorial ANOVA on aggregated summaries.
Licklider can warn about the visible end state of a no-replication table, but it cannot always tell whether the table has already collapsed over technical replicates or nested structure before upload. If the uploaded rows are summaries rather than independent observational units, the ANOVA result can appear cleaner and more stable than the underlying experiment really supports.
Reading the ANOVA table
The two-way ANOVA table has three inferential rows, one per source of variation. Always read the interaction row first.
| Source | F | df | p | Partial eta^2 | Generalized eta^2 |
|---|---|---|---|---|---|
| Factor A | |||||
| Factor B | |||||
| A x B (interaction) |
Step 1 - Check the interaction (A x B)
The interaction F-test asks: does the effect of Factor A differ depending on which level of Factor B you are in?
If A x B is statistically significant:
- The effect of each factor is not uniform across the levels of the other.
- Main effects of A and B exist in the table, but interpreting them as simple summaries is misleading. Proceed to simple effects analysis.
If A x B is not statistically significant:
- You have not detected an interaction at the chosen alpha in this sample. That is not the same as proving that effects are identical across levels of the other factor: it may reflect low power, small cell sizes, or a real but small interaction. A large p-value is not evidence that there is no interaction. If the interaction term is not statistically significant, you may proceed to main-effect follow-ups, but keep the usual caution about over-interpreting null results.
Warning When the interaction is statistically significant, Licklider displays a warning in the panel. Do not interpret the main effect F-tests as standalone summaries. They describe marginal (averaged) effects that may mask opposite-direction effects within subgroups.
Step 2a - When interaction is statistically significant: Simple effects
Simple effects analysis tests the effect of Factor A separately at each level of Factor B (and optionally the reverse). This is the correct follow-up to a statistically significant interaction.
Example: If treatment x dose is statistically significant, simple effects would test:
- Effect of treatment at dose = 0 uM
- Effect of treatment at dose = 10 uM
- Effect of treatment at dose = 50 uM
Each simple effect is an independent F-test using the pooled MS_within from the full two-way model. Licklider runs simple effects automatically when the interaction is statistically significant and displays them as a separate table below the main ANOVA output.
Licklider also displays post hoc comparisons for the main effects of Factor A and Factor B. When the interaction is statistically significant, these comparisons reflect marginal (averaged) means across all levels of the other factor — they are shown with a warning and should be read alongside simple effects, not as standalone results. They answer "is there an overall difference between treatments, averaging across all doses?" rather than "is there a difference between treatments specifically at 0 uM?"
Pairwise comparisons within each simple effect — identifying which specific pairs differ at each level of Factor B — are coming soon.
Step 2b - When interaction is not statistically significant: Main effects
Interpret Factor A and Factor B F-tests as you would in separate one-way ANOVAs, with one advantage: the shared error term (MS_within) from the full model increases power relative to running two independent one-way ANOVAs.
Proceed to post hoc tests for any statistically significant main effect with three or more levels.
Interaction plot
Licklider generates an interaction plot automatically: a line plot with Factor B levels on the x-axis, a separate line per level of Factor A, and group means on the y-axis (with standard error bars). Parallel lines indicate no interaction; crossing or diverging lines indicate an interaction. The plot is included in the export bundle.
Effect size
Licklider reports two effect size measures for each term in the ANOVA table.
Partial eta^2
The proportion of variance in the outcome attributable to each term, after removing the variance explained by all other terms. Partial eta^2 values for individual terms in a two-way model do not sum to the total R^2.
Partial eta^2 is the most commonly reported effect size for two-way ANOVA and enables within-study comparisons between terms. However, it is not directly comparable across studies with different designs, because the value depends on which other factors are in the model [3, 4].
| Partial eta^2 | Conventional label |
|---|---|
| 0.01 | Small |
| 0.06 | Medium |
| 0.14 | Large |
Generalized eta^2 (eta_G^2), recommended for cross-study comparison
Generalized eta^2 estimates the proportion of variance in the outcome attributable to a factor across studies, regardless of whether other factors are manipulated or measured. When comparing effect sizes across publications with different factorial structures, use generalized eta^2 [3, 4].
Licklider displays both measures. For reporting within a single study, either is acceptable. For meta-analyses or comparisons across labs, report generalized eta^2.
Post hoc tests
Post hoc tests compare the marginal means of each factor — that is, the mean of each level averaged across all levels of the other factor.
In the current product, two-way ANOVA returns three follow-up output types: the main ANOVA table, an interaction plot included in the export bundle, and post hoc comparisons for the main effects of Factor A and Factor B. When the interaction is statistically significant, Licklider also adds a separate simple-effects table. Pairwise comparisons inside each simple effect are not yet returned in the current panel.
When the interaction is not statistically significant, marginal means are a valid summary and post hoc comparisons are directly interpretable. Run post hoc tests for any statistically significant main effect with three or more levels.
When the interaction is statistically significant, Licklider still runs post hoc tests on marginal means and displays them with a warning. These results can indicate the direction of an overall trend, but they do not replace simple effects analysis and should not be the basis for specific pairwise claims.
Post hoc method options are the same as in One-Way ANOVA. Methods such as Tukey HSD assume roughly equal variances among the means being compared; when Levene flags heterogeneity, read those contrasts together with the variance-homogeneity section above rather than treating them as automatically interchangeable with unequal-variance alternatives.
Post hoc comparison table (main effects)
| Column | What it means |
|---|---|
| Factor | Which main effect this comparison belongs to (A or B). |
| Group A / Group B | The two levels being compared. |
| Mean difference | Marginal mean of group A minus marginal mean of group B. |
| 95% CI | Confidence interval on the mean difference. |
| p (adjusted) | p-value after correction for multiple comparisons. |
| Cohen's d | Standardised effect size for this pair. |
When the interaction is statistically significant, Licklider displays a warning above this table. Marginal means collapse across one factor and may obscure opposite-direction effects in subgroups. Read these comparisons alongside the simple effects table, not as a substitute for it.
A note on Type II sum of squares
Licklider uses Type II sum of squares for all two-way ANOVA analyses. That default targets common fixed-effects factorial analyses in life science, where main effects are often read in settings where the interaction is absent or not the primary focus. IBM SPSS Statistics and GraphPad Prism instead default to Type III sums of squares for factorial models; your results will not match those packages unless you align the SS type and contrast coding across tools.
Type II SS tests each main effect after accounting for the other main effect, but not the interaction. Under unbalanced counts, Langsrud and related work argue that Type II can be preferable to Type III in some conditions because of power and the hypotheses being tested [1, 2]. That does not make Type II universally "more correct": Type III is widely used when researchers want each effect adjusted for every other term in the model, and pre-specified analysis plans or journal or regulator expectations may require Type III.
If your analysis plan pre-specifies Type III SS (for example, to match a regulatory submission format), note that Licklider's results will differ from software that defaults to Type III.
Example - interaction present
Scenario
A researcher measures cell viability (%) in a 2 x 3 design: Treatment (control / treated) x Dose (0 uM / 10 uM / 50 uM). n = 8 per cell.
Result (hypothetical)
Two-way ANOVA (Type II SS):
| Factor | F | df | p | Partial eta^2 | Generalized eta^2 |
|---|---|---|---|---|---|
| Treatment | 4.21 | 1, 42 | .046 | 0.09 | 0.07 |
| Dose | 6.83 | 2, 42 | .003 | 0.25 | 0.18 |
| Treatment x Dose | 5.14 | 2, 42 | .010 | 0.20 | 0.14 |
Statistically significant interaction detected (p = .010). Main effect comparisons should be interpreted with caution.
Interpretation
The statistically significant interaction (F(2, 42) = 5.14, p = .010, generalized eta^2 = 0.14) indicates that the effect of treatment on viability depends on dose. Simple effects analysis showed that the treatment effect was statistically detectable at 50 uM (p = .002) but not at 0 uM (p = .41) or 10 uM (p = .18).
Licklider also displays post hoc comparisons for the main effects of treatment and dose (marginal means). These are shown with a warning because averaging across dose levels obscures the dose-dependent nature of the treatment effect. The main effect F-tests and their marginal comparisons describe the average pattern across all conditions and should not be reported as evidence that treatment has a uniform effect.
To identify which specific treatment pairs differ within a given dose level, pairwise comparisons within simple effects will be available in a coming update.
Example - no statistically significant interaction
Scenario
Same design. Interaction p = .43.
Result (hypothetical)
Two-way ANOVA (Type II SS):
| Factor | F | df | p | Partial eta^2 | Generalized eta^2 |
|---|---|---|---|---|---|
| Treatment | 9.72 | 1, 42 | .003 | 0.19 | 0.14 |
| Dose | 7.44 | 2, 42 | .002 | 0.26 | 0.19 |
| Treatment x Dose | 0.84 | 2, 42 | .437 | 0.04 | 0.03 |
Interpretation
In this sample, the interaction was not statistically significant (F(2, 42) = 0.84, p = .437, generalized eta^2 = 0.03): the data did not provide strong evidence of a non-additive treatment-by-dose pattern at conventional alpha, but that does not establish that effects are identical at every dose (non-significance is not evidence of no interaction). Both main effects were statistically significant: treatment (F(1, 42) = 9.72, p = .003, generalized eta^2 = 0.14) and dose (F(2, 42) = 7.44, p = .002, generalized eta^2 = 0.19). Post hoc testing on the dose main effect (Tukey HSD) showed that viability differed between 0 uM and 50 uM (p = .003, d = 0.91) but not between adjacent doses.
Design Rationale & References
Licklider's design choices
Licklider defaults to Type II sum of squares for two-way ANOVA in fixed-effects factorial analyses common in life science, emphasising main-effect questions when interactions are absent or secondary [1, 2]. That choice trades off against Type III, which many packages default to and which partials out other terms differently; neither is universally appropriate for every design or reporting standard. Generalized eta^2 is reported alongside partial eta^2 because partial eta^2 is not directly comparable across studies with different factorial structures; generalized eta^2 remains interpretable regardless of how many factors are manipulated [3, 4]. The interaction term is always included in the model: running a main-effects-only two-way ANOVA would obscure a potentially meaningful biological signal. Exact p-values are reported throughout rather than replacing inference with dichotomous labels alone.
Methodological foundations
Shaw, R. G., & Mitchell-Olds, T. (1993). ANOVA for unbalanced data: An overview. Ecology, 74(6), 1638-1645.
→ A foundational primer for biologists on the consequences of unequal cell sizes and when Type I, II, and III SS address biologically appropriate hypotheses.
Langsrud, O. (2003). ANOVA for unbalanced data: Use Type II instead of Type III sums of squares. Statistics and Computing, 13(2), 163-167.
→ Demonstrates mathematically that Type II SS yields greater power and tests more realistic hypotheses than Type III when interactions are absent, the direct basis for Licklider's SS default.
Olejnik, S., & Algina, J. (2003). Generalized eta and omega squared statistics: Measures of effect size for some common research designs. Psychological Methods, 8(4), 434-447.
→ Introduces generalized eta^2 to solve the cross-study comparability problem of partial eta^2; the theoretical basis for Licklider's generalized eta^2 reporting.
Bakeman, R. (2005). Recommended effect size statistics for repeated measures designs. Behavior Research Methods, 37(3), 379-384.
→ Applied companion to Olejnik and Algina, explicitly recommending generalized eta^2 over partial eta^2 for multi-factor designs to ensure comparability across the literature.
Known limitations
Rosnow, R. L., & Rosenthal, R. (1989). Definition and interpretation of interaction effects. Psychological Bulletin, 105(1), 143-146.
→ Exposes the pervasive error of interpreting main effects when the interaction is statistically significant, directly motivating Licklider's interaction warning and simple effects follow-up.
McClelland, G. H., & Judd, C. M. (1993). Statistical difficulties of detecting interactions and moderator effects. Psychological Bulletin, 114(2), 376-390.
→ Demonstrates that detecting a two-way interaction requires substantially larger samples than detecting main effects; most life science experiments are underpowered for interaction detection.
Paradigm shifts worth knowing
Sommet, N., Weissman, D. L., Chentsova-Dutton, Y. E., & Elliot, A. J. (2023). How many participants do I need to test an interaction?. Advances in Methods and Practices in Psychological Science, 6(3).
→ Establishes that detecting an interaction reliably requires 4 to 16 times the sample size needed for main effects, a result with direct implications for how two-way ANOVA results should be interpreted when cell sizes are small.
Implementation boundaries
- The current page describes the implemented two-way ANOVA path: an ANOVA table with
F,df,p,partial eta^2, andgeneralized eta^2; an interaction plot in the export bundle; a simple-effects table when the interaction is statistically significant; and post hoc comparisons for the main effects of Factor A and Factor B. - Pairwise comparisons within each simple effect are not yet available in the current panel, so a statistically significant interaction does not yet expand into dose-specific or subgroup-specific pairwise tables automatically.
- Repeated-measures factorial designs are not supported in this route. If the same subject appears across timepoints or other within-subject factor levels, use a repeated-measures or mixed-model path instead.
- Licklider does not automatically detect every structural misuse of two-way ANOVA. In particular, pseudoreplication, nested data entered as flat rows, and independence violations that depend on study design metadata rather than observed values can pass through unless the observation unit and replicate structure are declared correctly upstream.
- Automatic assumption checks help with distributional and balance diagnostics, but they do not certify that the design itself is appropriate. Treat the warnings as decision support, not as a guarantee that the model choice is safe.
See also
- One-Way ANOVA - single categorical factor
- t-Test - exactly two groups
- Non-Parametric Alternatives - rank-based alternatives for non-normal data
- Group Comparison overview - test selection guide