Architecture

FTS5 unicode61 Tokenizer Implicitly Indexes Custom Syntax

Problem

When introducing custom inline syntax like {{property_name:current_value}} into page content, the question arises: do we need a preprocessing step to strip the syntax delimiters before feeding text to FTS5, or does the tokenizer handle it naturally?

Symptoms:

Uncertainty about whether {{age:34}} is searchable by “34” or “age”
Concern that braces/colons might not be treated as token separators
Potential need for a regex strip in raw_markdown generation pipeline

Investigation

Steps Tried

Reviewed FTS5 unicode61 tokenizer documentation — unicode61 treats all non-alphanumeric, non-underscore characters as token separators by default
Wrote integration tests against real SQLite — verified actual tokenization behavior at runtime
Tested edge cases — bare {{ is not a searchable token (all separators), phrase matching works

Root Cause

Not a bug — this is an architectural decision. The unicode61 tokenizer (FTS5’s default) categorizes characters by Unicode category. Characters like {, }, and : fall into punctuation categories and are treated as token separators. This means {{age:34}} is tokenized into:

["age", "34"]

Both tokens are independently searchable. No preprocessing required.

Solution

Do nothing. The existing FTS5 pipeline with unicode61 tokenizer handles the {{property:value}} syntax correctly without any changes. The format was deliberately chosen to be compatible.

Verification Tests

Integration tests in tests/core/tests/property_ref_fts5.rs prove this:

// Test 1: Repository search finds page by resolved value
let results = ws.page_repo.search(&ws.path, "34", 10)?;
assert!(!results.is_empty()); // Finds page with "{{age:34}}"

// Test 2: Raw FTS5 MATCH finds individual tokens
let age_results = fts_token_search(&conn, "age");
assert!(!age_results.is_empty()); // "age" is a standalone token

let num_results = fts_token_search(&conn, "34");
assert!(!num_results.is_empty()); // "34" is a standalone token

// Test 3: Bare braces are NOT searchable tokens
let brace_results = conn.query_row(
    r#"SELECT COUNT(*) FROM pages_fts ... WHERE pages_fts MATCH '"{{" '"#,
    [], |row| row.get::<_, i64>(0),
).unwrap_or(0); // FTS5 may error on all-separator queries
assert_eq!(brace_results, 0);

Key Detail: `unwrap_or(0)` for Empty-Token Queries

FTS5 may raise an error (rather than returning 0 rows) when a MATCH query contains only separator characters. The unwrap_or(0) defensive pattern handles both behaviors:

.unwrap_or(0); // FTS5 may raise an error for empty-token queries — treat as 0

Implementation Notes

The search() repository method wraps queries in double quotes for phrase matching — this still works for single tokens like "34"
Raw FTS5 MATCH queries (without phrase quotes) also work for individual token lookup
Property names and values in surrounding prose are all independently searchable: “The hero is {{age:34}} years old” → tokens include “hero”, “age”, “34”, “years”, “old”

Prevention

Design Principle

When designing custom inline syntax for content stored in FTS5-indexed fields, prefer delimiter characters that are naturally treated as token separators by unicode61:

Safe delimiters (token separators in unicode61):

{, }, [, ], (, ) — punctuation
:, ;, ,, . — punctuation
<, >, |, / — symbols

Unsafe delimiters (would merge with adjacent tokens):

_ — treated as a word character in unicode61
Alphanumeric characters — obviously part of tokens

When This Assumption Breaks

If unicode61 is replaced with a custom tokenizer that treats braces differently
If tokenchars or separators options are added to the FTS5 table definition
If the syntax changes to use underscores or alphanumeric delimiters

Test as Documentation

The integration tests serve as executable documentation of the tokenizer assumption. If FTS5 configuration changes, these tests will fail and surface the incompatibility immediately.

References

INK-253: Property Value Insertion in Content Blocks (INK-296 sub-issue)
tests/core/tests/property_ref_fts5.rs — verification tests
SQLite FTS5 unicode61 tokenizer: https://www.sqlite.org/fts5.html#unicode61_tokenizer

Previous
Frontend as Dumb Pipe: No Business Logic in React Layer Next
Identifier Entity-Type Dispatch: Caller-Provided vs Embedded

Was this page helpful?

FTS5 unicode61 Tokenizer Implicitly Indexes Custom Syntax

FTS5 unicode61 Tokenizer Implicitly Indexes Custom Syntax

Problem

Investigation

Steps Tried

Root Cause

Solution

Verification Tests

Key Detail: unwrap_or(0) for Empty-Token Queries

Implementation Notes

Prevention

Design Principle

When This Assumption Breaks

Test as Documentation

References

Key Detail: `unwrap_or(0)` for Empty-Token Queries