SQLite GROUP BY/HAVING on Non-Aggregate Columns Is Undefined
SQLite GROUP BY/HAVING on Non-Aggregate Columns Is Undefined
Problem
A point-in-time page history reconstruction query used GROUP BY field HAVING changed_at = MAX(changed_at) to select
the most recent value of each field from a shadow table. The query appeared to work in tests but returned
indeterminate results because the non-aggregate column new_value in the SELECT is not guaranteed to come from the
row with MAX(changed_at).
Symptoms:
- Query returns plausible-looking data, making the bug invisible in simple tests
- Under certain row orderings or SQLite optimizer decisions, wrong
new_valueis returned - Only detectable when the “latest” value differs from adjacent rows for the same field
Investigation
Steps Tried
- Initial implementation used GROUP BY/HAVING pattern because it’s syntactically compact
- Code review flagged it as undefined SQL behavior — standard SQL requires all SELECT columns to be either aggregated or in the GROUP BY clause
- Verified that SQLite documents this as a “bare column” extension with non-deterministic behavior unless
MIN/MAXoptimization applies to a specific column pattern
Root Cause
SQLite has a bare columns in aggregate queries extension that sometimes returns values from the row containing MIN/MAX — but this is not guaranteed and depends on internal optimizer behavior. The query:
SELECT field, new_value FROM page_historyWHERE page_id = ?1 AND changed_at <= ?2GROUP BY fieldHAVING changed_at = MAX(changed_at)Has two problems:
new_valueis a bare column — SQLite may return it from any row in the group- The
HAVINGclause compareschanged_at(bare) toMAX(changed_at)— the barechanged_atis itself indeterminate
Solution
Replace with a window function that deterministically selects the latest row per field:
-- Before (WRONG — undefined behavior)SELECT field, new_value FROM page_historyWHERE page_id = ?1 AND changed_at <= ?2GROUP BY fieldHAVING changed_at = MAX(changed_at)
-- After (CORRECT — deterministic)SELECT field, new_value FROM ( SELECT field, new_value, ROW_NUMBER() OVER (PARTITION BY field ORDER BY changed_at DESC) AS rn FROM page_history WHERE page_id = ?1 AND changed_at <= ?2) WHERE rn = 1Why This Works
ROW_NUMBER() OVER (PARTITION BY field ORDER BY changed_at DESC)assigns rank 1 to the most recent entry per field- The outer
WHERE rn = 1selects exactly that row new_valuecomes from the same row as the maxchanged_at— guaranteed
Alternative Approaches
- Correlated subquery:
WHERE changed_at = (SELECT MAX(changed_at) FROM page_history WHERE ...)— works but less efficient for multiple fields - Self-join on MAX: Possible but verbose for this pattern
- Window function is the cleanest and most performant approach
Prevention
Best Practices
- Never use bare (non-aggregated) columns in GROUP BY queries, even in SQLite
- Use window functions (
ROW_NUMBER,RANK,DENSE_RANK) for “latest per category” queries - SQLite supports window functions since 3.25.0 (2018) — safe to use everywhere
Warning Signs
- Any query with GROUP BY where SELECT includes columns not in the GROUP BY and not wrapped in an aggregate function
- Queries that “happen to work” in tests but aren’t deterministic by SQL standard
Pattern Template
For any “get the latest/first row per group” query:
SELECT <columns> FROM ( SELECT <columns>, ROW_NUMBER() OVER (PARTITION BY <group_col> ORDER BY <sort_col> DESC) AS rn FROM <table> WHERE <conditions>) WHERE rn = 1References
- Commit:
e496bb1— fix(checkpoint): address P1/P2/P3 review findings - SQLite docs: Bare columns in aggregate queries
- File:
crates/infrastructure/sqlite/src/checkpoint/checkpoint_repository.rs(removed — checkpoint system eliminated per ADR-013)
SQL Migration Column Evolution — Silent Data Loss on INSERT Next
SQLite Performance Query Silently Zeroes Fields
Was this page helpful?
Thanks for your feedback!