Skip to content
Documentation GitHub
Architecture

FTS5 unicode61 Tokenizer Implicitly Indexes Custom Syntax

FTS5 unicode61 Tokenizer Implicitly Indexes Custom Syntax

Problem

When introducing custom inline syntax like {{property_name:current_value}} into page content, the question arises: do we need a preprocessing step to strip the syntax delimiters before feeding text to FTS5, or does the tokenizer handle it naturally?

Symptoms:

  • Uncertainty about whether {{age:34}} is searchable by “34” or “age”
  • Concern that braces/colons might not be treated as token separators
  • Potential need for a regex strip in raw_markdown generation pipeline

Investigation

Steps Tried

  1. Reviewed FTS5 unicode61 tokenizer documentation — unicode61 treats all non-alphanumeric, non-underscore characters as token separators by default
  2. Wrote integration tests against real SQLite — verified actual tokenization behavior at runtime
  3. Tested edge cases — bare {{ is not a searchable token (all separators), phrase matching works

Root Cause

Not a bug — this is an architectural decision. The unicode61 tokenizer (FTS5’s default) categorizes characters by Unicode category. Characters like {, }, and : fall into punctuation categories and are treated as token separators. This means {{age:34}} is tokenized into:

["age", "34"]

Both tokens are independently searchable. No preprocessing required.

Solution

Do nothing. The existing FTS5 pipeline with unicode61 tokenizer handles the {{property:value}} syntax correctly without any changes. The format was deliberately chosen to be compatible.

Verification Tests

Integration tests in tests/core/tests/property_ref_fts5.rs prove this:

// Test 1: Repository search finds page by resolved value
let results = ws.page_repo.search(&ws.path, "34", 10)?;
assert!(!results.is_empty()); // Finds page with "{{age:34}}"
// Test 2: Raw FTS5 MATCH finds individual tokens
let age_results = fts_token_search(&conn, "age");
assert!(!age_results.is_empty()); // "age" is a standalone token
let num_results = fts_token_search(&conn, "34");
assert!(!num_results.is_empty()); // "34" is a standalone token
// Test 3: Bare braces are NOT searchable tokens
let brace_results = conn.query_row(
r#"SELECT COUNT(*) FROM pages_fts ... WHERE pages_fts MATCH '"{{" '"#,
[], |row| row.get::<_, i64>(0),
).unwrap_or(0); // FTS5 may error on all-separator queries
assert_eq!(brace_results, 0);

Key Detail: unwrap_or(0) for Empty-Token Queries

FTS5 may raise an error (rather than returning 0 rows) when a MATCH query contains only separator characters. The unwrap_or(0) defensive pattern handles both behaviors:

.unwrap_or(0); // FTS5 may raise an error for empty-token queries — treat as 0

Implementation Notes

  • The search() repository method wraps queries in double quotes for phrase matching — this still works for single tokens like "34"
  • Raw FTS5 MATCH queries (without phrase quotes) also work for individual token lookup
  • Property names and values in surrounding prose are all independently searchable: “The hero is {{age:34}} years old” → tokens include “hero”, “age”, “34”, “years”, “old”

Prevention

Design Principle

When designing custom inline syntax for content stored in FTS5-indexed fields, prefer delimiter characters that are naturally treated as token separators by unicode61:

Safe delimiters (token separators in unicode61):

  • {, }, [, ], (, ) — punctuation
  • :, ;, ,, . — punctuation
  • <, >, |, / — symbols

Unsafe delimiters (would merge with adjacent tokens):

  • _ — treated as a word character in unicode61
  • Alphanumeric characters — obviously part of tokens

When This Assumption Breaks

  • If unicode61 is replaced with a custom tokenizer that treats braces differently
  • If tokenchars or separators options are added to the FTS5 table definition
  • If the syntax changes to use underscores or alphanumeric delimiters

Test as Documentation

The integration tests serve as executable documentation of the tokenizer assumption. If FTS5 configuration changes, these tests will fail and surface the incompatibility immediately.

References

Was this page helpful?