Database Issues

SQLite GROUP BY/HAVING on Non-Aggregate Columns Is Undefined

Problem

A point-in-time page history reconstruction query used GROUP BY field HAVING changed_at = MAX(changed_at) to select the most recent value of each field from a shadow table. The query appeared to work in tests but returned indeterminate results because the non-aggregate column new_value in the SELECT is not guaranteed to come from the row with MAX(changed_at).

Symptoms:

Query returns plausible-looking data, making the bug invisible in simple tests
Under certain row orderings or SQLite optimizer decisions, wrong new_value is returned
Only detectable when the “latest” value differs from adjacent rows for the same field

Investigation

Steps Tried

Initial implementation used GROUP BY/HAVING pattern because it’s syntactically compact
Code review flagged it as undefined SQL behavior — standard SQL requires all SELECT columns to be either aggregated or in the GROUP BY clause
Verified that SQLite documents this as a “bare column” extension with non-deterministic behavior unless MIN/MAX optimization applies to a specific column pattern

Root Cause

SQLite has a bare columns in aggregate queries extension that sometimes returns values from the row containing MIN/MAX — but this is not guaranteed and depends on internal optimizer behavior. The query:

SELECT field, new_value FROM page_history
WHERE page_id = ?1 AND changed_at <= ?2
GROUP BY field
HAVING changed_at = MAX(changed_at)

Has two problems:

new_value is a bare column — SQLite may return it from any row in the group
The HAVING clause compares changed_at (bare) to MAX(changed_at) — the bare changed_at is itself indeterminate

Solution

Replace with a window function that deterministically selects the latest row per field:

-- Before (WRONG — undefined behavior)
SELECT field, new_value FROM page_history
WHERE page_id = ?1 AND changed_at <= ?2
GROUP BY field
HAVING changed_at = MAX(changed_at)

-- After (CORRECT — deterministic)
SELECT field, new_value FROM (
    SELECT field, new_value,
           ROW_NUMBER() OVER (PARTITION BY field ORDER BY changed_at DESC) AS rn
    FROM page_history
    WHERE page_id = ?1 AND changed_at <= ?2
) WHERE rn = 1

Why This Works

ROW_NUMBER() OVER (PARTITION BY field ORDER BY changed_at DESC) assigns rank 1 to the most recent entry per field
The outer WHERE rn = 1 selects exactly that row
new_value comes from the same row as the max changed_at — guaranteed

Alternative Approaches

Correlated subquery: WHERE changed_at = (SELECT MAX(changed_at) FROM page_history WHERE ...) — works but less efficient for multiple fields
Self-join on MAX: Possible but verbose for this pattern
Window function is the cleanest and most performant approach

Prevention

Best Practices

Never use bare (non-aggregated) columns in GROUP BY queries, even in SQLite
Use window functions (ROW_NUMBER, RANK, DENSE_RANK) for “latest per category” queries
SQLite supports window functions since 3.25.0 (2018) — safe to use everywhere

Warning Signs

Any query with GROUP BY where SELECT includes columns not in the GROUP BY and not wrapped in an aggregate function
Queries that “happen to work” in tests but aren’t deterministic by SQL standard

Pattern Template

For any “get the latest/first row per group” query:

SELECT <columns> FROM (
    SELECT <columns>,
           ROW_NUMBER() OVER (PARTITION BY <group_col> ORDER BY <sort_col> DESC) AS rn
    FROM <table>
    WHERE <conditions>
) WHERE rn = 1

References

Commit: e496bb1 — fix(checkpoint): address P1/P2/P3 review findings
SQLite docs: Bare columns in aggregate queries
File: crates/infrastructure/sqlite/src/checkpoint/checkpoint_repository.rs (removed — checkpoint system eliminated per ADR-013)

Previous
SQL Migration Column Evolution — Silent Data Loss on INSERT Next
SQLite Performance Query Silently Zeroes Fields

Was this page helpful?