Skip to content
Documentation GitHub
Database Issues

SQLite GROUP BY/HAVING on Non-Aggregate Columns Is Undefined

SQLite GROUP BY/HAVING on Non-Aggregate Columns Is Undefined

Problem

A point-in-time page history reconstruction query used GROUP BY field HAVING changed_at = MAX(changed_at) to select the most recent value of each field from a shadow table. The query appeared to work in tests but returned indeterminate results because the non-aggregate column new_value in the SELECT is not guaranteed to come from the row with MAX(changed_at).

Symptoms:

  • Query returns plausible-looking data, making the bug invisible in simple tests
  • Under certain row orderings or SQLite optimizer decisions, wrong new_value is returned
  • Only detectable when the “latest” value differs from adjacent rows for the same field

Investigation

Steps Tried

  1. Initial implementation used GROUP BY/HAVING pattern because it’s syntactically compact
  2. Code review flagged it as undefined SQL behavior — standard SQL requires all SELECT columns to be either aggregated or in the GROUP BY clause
  3. Verified that SQLite documents this as a “bare column” extension with non-deterministic behavior unless MIN/MAX optimization applies to a specific column pattern

Root Cause

SQLite has a bare columns in aggregate queries extension that sometimes returns values from the row containing MIN/MAX — but this is not guaranteed and depends on internal optimizer behavior. The query:

SELECT field, new_value FROM page_history
WHERE page_id = ?1 AND changed_at <= ?2
GROUP BY field
HAVING changed_at = MAX(changed_at)

Has two problems:

  1. new_value is a bare column — SQLite may return it from any row in the group
  2. The HAVING clause compares changed_at (bare) to MAX(changed_at) — the bare changed_at is itself indeterminate

Solution

Replace with a window function that deterministically selects the latest row per field:

-- Before (WRONG — undefined behavior)
SELECT field, new_value FROM page_history
WHERE page_id = ?1 AND changed_at <= ?2
GROUP BY field
HAVING changed_at = MAX(changed_at)
-- After (CORRECT — deterministic)
SELECT field, new_value FROM (
SELECT field, new_value,
ROW_NUMBER() OVER (PARTITION BY field ORDER BY changed_at DESC) AS rn
FROM page_history
WHERE page_id = ?1 AND changed_at <= ?2
) WHERE rn = 1

Why This Works

  • ROW_NUMBER() OVER (PARTITION BY field ORDER BY changed_at DESC) assigns rank 1 to the most recent entry per field
  • The outer WHERE rn = 1 selects exactly that row
  • new_value comes from the same row as the max changed_at — guaranteed

Alternative Approaches

  • Correlated subquery: WHERE changed_at = (SELECT MAX(changed_at) FROM page_history WHERE ...) — works but less efficient for multiple fields
  • Self-join on MAX: Possible but verbose for this pattern
  • Window function is the cleanest and most performant approach

Prevention

Best Practices

  • Never use bare (non-aggregated) columns in GROUP BY queries, even in SQLite
  • Use window functions (ROW_NUMBER, RANK, DENSE_RANK) for “latest per category” queries
  • SQLite supports window functions since 3.25.0 (2018) — safe to use everywhere

Warning Signs

  • Any query with GROUP BY where SELECT includes columns not in the GROUP BY and not wrapped in an aggregate function
  • Queries that “happen to work” in tests but aren’t deterministic by SQL standard

Pattern Template

For any “get the latest/first row per group” query:

SELECT <columns> FROM (
SELECT <columns>,
ROW_NUMBER() OVER (PARTITION BY <group_col> ORDER BY <sort_col> DESC) AS rn
FROM <table>
WHERE <conditions>
) WHERE rn = 1

References

  • Commit: e496bb1 — fix(checkpoint): address P1/P2/P3 review findings
  • SQLite docs: Bare columns in aggregate queries
  • File: crates/infrastructure/sqlite/src/checkpoint/checkpoint_repository.rs (removed — checkpoint system eliminated per ADR-013)

Was this page helpful?