FTS5 unicode61 Tokenizer Implicitly Indexes Custom Syntax
FTS5 unicode61 Tokenizer Implicitly Indexes Custom Syntax
Problem
When introducing custom inline syntax like {{property_name:current_value}} into page content, the question arises: do
we need a preprocessing step to strip the syntax delimiters before feeding text to FTS5, or does the tokenizer handle it
naturally?
Symptoms:
- Uncertainty about whether
{{age:34}}is searchable by “34” or “age” - Concern that braces/colons might not be treated as token separators
- Potential need for a regex strip in
raw_markdowngeneration pipeline
Investigation
Steps Tried
- Reviewed FTS5 unicode61 tokenizer documentation — unicode61 treats all non-alphanumeric, non-underscore characters as token separators by default
- Wrote integration tests against real SQLite — verified actual tokenization behavior at runtime
- Tested edge cases — bare
{{is not a searchable token (all separators), phrase matching works
Root Cause
Not a bug — this is an architectural decision. The unicode61 tokenizer (FTS5’s default) categorizes characters by
Unicode category. Characters like {, }, and : fall into punctuation categories and are treated as token
separators. This means {{age:34}} is tokenized into:
["age", "34"]Both tokens are independently searchable. No preprocessing required.
Solution
Do nothing. The existing FTS5 pipeline with unicode61 tokenizer handles the {{property:value}} syntax correctly
without any changes. The format was deliberately chosen to be compatible.
Verification Tests
Integration tests in tests/core/tests/property_ref_fts5.rs prove this:
// Test 1: Repository search finds page by resolved valuelet results = ws.page_repo.search(&ws.path, "34", 10)?;assert!(!results.is_empty()); // Finds page with "{{age:34}}"
// Test 2: Raw FTS5 MATCH finds individual tokenslet age_results = fts_token_search(&conn, "age");assert!(!age_results.is_empty()); // "age" is a standalone token
let num_results = fts_token_search(&conn, "34");assert!(!num_results.is_empty()); // "34" is a standalone token
// Test 3: Bare braces are NOT searchable tokenslet brace_results = conn.query_row( r#"SELECT COUNT(*) FROM pages_fts ... WHERE pages_fts MATCH '"{{" '"#, [], |row| row.get::<_, i64>(0),).unwrap_or(0); // FTS5 may error on all-separator queriesassert_eq!(brace_results, 0);Key Detail: unwrap_or(0) for Empty-Token Queries
FTS5 may raise an error (rather than returning 0 rows) when a MATCH query contains only separator characters. The
unwrap_or(0) defensive pattern handles both behaviors:
.unwrap_or(0); // FTS5 may raise an error for empty-token queries — treat as 0Implementation Notes
- The
search()repository method wraps queries in double quotes for phrase matching — this still works for single tokens like"34" - Raw FTS5 MATCH queries (without phrase quotes) also work for individual token lookup
- Property names and values in surrounding prose are all independently searchable: “The hero is
{{age:34}}years old” → tokens include “hero”, “age”, “34”, “years”, “old”
Prevention
Design Principle
When designing custom inline syntax for content stored in FTS5-indexed fields, prefer delimiter characters that are
naturally treated as token separators by unicode61:
Safe delimiters (token separators in unicode61):
{,},[,],(,)— punctuation:,;,,,.— punctuation<,>,|,/— symbols
Unsafe delimiters (would merge with adjacent tokens):
_— treated as a word character in unicode61- Alphanumeric characters — obviously part of tokens
When This Assumption Breaks
- If
unicode61is replaced with a custom tokenizer that treats braces differently - If
tokencharsorseparatorsoptions are added to the FTS5 table definition - If the syntax changes to use underscores or alphanumeric delimiters
Test as Documentation
The integration tests serve as executable documentation of the tokenizer assumption. If FTS5 configuration changes, these tests will fail and surface the incompatibility immediately.
References
- INK-253: Property Value Insertion in Content Blocks (INK-296 sub-issue)
tests/core/tests/property_ref_fts5.rs— verification tests- SQLite FTS5 unicode61 tokenizer: https://www.sqlite.org/fts5.html#unicode61_tokenizer
Frontend as Dumb Pipe: No Business Logic in React Layer Next
Identifier Entity-Type Dispatch: Caller-Provided vs Embedded
Was this page helpful?
Thanks for your feedback!