Search System
Status: Implemented Depends On: Page System, Embedding System
Overview
Section titled “Overview”The Search System provides a unified search experience combining SQLite FTS5 keyword search with ONNX-powered semantic search. Users interact with a single search input; the system transparently classifies query intent, dispatches to the appropriate backends, and merges results via Reciprocal Rank Fusion (RRF).
The SearchRouter is the single entry point for all page search. It is cached in AppState as search_router and
shared with the MCP server.
Diagram
Section titled “Diagram”Architecture
Section titled “Architecture”Framework (Tauri) └── search_pages command apps/desktop/src-tauri/src/commands/ Delegates to SearchRouter
Application └── SearchRouter crates/application/src/search/search_router.rs ├── classify_intent() Rule-based query classification ├── search() Unified dispatch + merge └── merge_rrf() Reciprocal Rank Fusion
Infrastructure ├── SqlitePageRepository crates/infrastructure/sqlite/src/workspace/page_repository.rs │ FTS5 search (pages_fts virtual table, bm25 ranking) ├── SqliteEmbeddingRepository crates/infrastructure/sqlite/src/workspace/embedding_repository.rs │ Brute-force cosine similarity over stored embeddings └── OnnxEmbeddingProvider crates/infrastructure/onnx/src/provider.rs Query embedding generation (snowflake-arctic-embed-m-v2.0)Dependencies flow inward: Framework -> Application (SearchRouter) -> Infrastructure (FTS5 + embedding repos).
Search Flow
Section titled “Search Flow”SearchRouter::search(workspace_path, query, limit) │ ├─ classify_intent(query) │ Keyword → FTS5 only │ Semantic → FTS5 + semantic + RRF merge │ ├─ Always: PageRepository::search() (FTS5) │ └─ If Semantic AND provider present: EmbeddingProvider::embed(query) → query vector EmbeddingRepository::query_similar() → Vec<SimilarPage> merge_rrf(fts_results, semantic_results, limit)Every search always runs FTS5. Semantic search is additive — it only activates when intent is classified as semantic
AND an embedding provider is available. If the embedding provider is None (model not downloaded), or if
embedding/similarity fails at runtime, the router falls back to FTS5 results with a logged warning.
Intent Classification
Section titled “Intent Classification”SearchRouter::classify_intent() uses rule-based heuristics (no ML model):
| Condition | Intent | Rationale |
|---|---|---|
| Empty or whitespace | Keyword | Nothing to embed |
Surrounded by quotes ("..." or '...') | Keyword | Exact match intent |
Contains FTS5 operators (AND, OR, NOT, NEAR) | Keyword | User wants boolean logic |
Contains a date pattern (YYYY-MM-DD or YYYY/MM/DD) | Keyword | Structured lookup |
| 1-2 words | Keyword | Too short for meaningful embedding |
Single hyphenated lowercase token (my-page-slug) | Keyword | Slug-like lookup |
| 3+ words, no special patterns | Semantic | Natural language query |
Known limitation: Natural-language use of AND/OR/NOT (e.g., “Pros AND Cons”) is classified as Keyword because the detector treats uppercase AND/OR/NOT/NEAR as FTS5 boolean operators.
FTS5 Configuration
Section titled “FTS5 Configuration”Virtual Table
Section titled “Virtual Table”The pages_fts virtual table is a 3-column contentless FTS5 index:
CREATE VIRTUAL TABLE IF NOT EXISTS pages_fts USING fts5( title, content, tags, content='', contentless_delete=1);title: Page title.content: Materialized markdown frompages.raw_markdown.tags: Space-separated tag names frompage_tagsjoin.content='': Contentless mode — FTS5 stores only the index, not the original text. Saves storage at the cost of requiring triggers to keep the index synchronized.contentless_delete=1: Enables row deletion without the original content.
BM25 Weights
Section titled “BM25 Weights”bm25(pages_fts, 10.0, 1.0, 5.0) as score| Column | Weight | Rationale |
|---|---|---|
title | 10.0 | Title matches are most relevant |
content | 1.0 | Baseline body text weight |
tags | 5.0 | Tag matches are high-signal metadata |
BM25 returns negative values (lower is better); the application takes the absolute value for a positive score.
Snippet Priority
Section titled “Snippet Priority”Snippets are extracted with a priority cascade:
- Content snippet (column 1) — body text context.
- Tags snippet (column 2) — matching tag names.
- Title snippet (column 0) — fallback to title.
COALESCE( NULLIF(snippet(pages_fts, 1, '<mark>', '</mark>', '...', 32), ''), NULLIF(snippet(pages_fts, 2, '<mark>', '</mark>', '...', 32), ''), snippet(pages_fts, 0, '<mark>', '</mark>', '...', 32), '') as snippetTokenizer
Section titled “Tokenizer”FTS5 uses the default unicode61 tokenizer, which treats non-alphanumeric characters as separators. This means custom
syntax like {{age:34}} is tokenized to ["age", "34"] without preprocessing — braces and colons are natural
separators.
Trigger-Based Synchronization
Section titled “Trigger-Based Synchronization”The FTS5 index is kept in sync via SQLite triggers:
pages_fts_insert: Indexes new pages (skipsis_deleted = 1).pages_fts_update: Re-indexes on page update.pages_fts_delete: Removes from index on page delete.page_tags_fts_insert/page_tags_fts_delete: Re-indexes the page when tags are added or removed.tags_fts_name_update: Re-indexes all pages with a tag when the tag is renamed.
Semantic Search
Section titled “Semantic Search”Query Embedding
Section titled “Query Embedding”When a query is classified as semantic, SearchRouter calls EmbeddingProvider::embed(query) to generate a
768-dimensional vector using the snowflake-arctic-embed-m-v2.0 model via ONNX Runtime. See
Embedding System for model details.
Similarity Search
Section titled “Similarity Search”SqliteEmbeddingRepository::query_similar() performs brute-force cosine similarity:
- Loads all non-deleted page embeddings (capped at 10,000 rows).
- Computes cosine similarity in Rust (not in SQL — no
sqlite-vecdependency). - Filters below
MIN_SIMILARITY_THRESHOLD(0.3). - Sorts descending by score, truncates to
limit.
This is O(n) over stored embeddings. The 10,000-row cap prevents unbounded memory usage. For workspace-scale datasets (< 10k pages) this is adequate.
Reciprocal Rank Fusion (RRF)
Section titled “Reciprocal Rank Fusion (RRF)”When both FTS5 and semantic results are available, they are merged using RRF with k=60 (the standard value from the original research paper):
RRF score = sum over lists of 1 / (60 + rank_i)Each page accumulates contributions from every list it appears in. Pages appearing in both lists are boosted above pages in only one list.
Score semantics: After merging, the score field contains an RRF score (typically ~0.01-0.03 range), NOT a cosine
similarity or BM25 score. The FTS5 snippet is preserved for merged results; semantic results have no snippet.
Example with k=60:
- Page in FTS5 at rank 1:
1/61 = 0.0164 - Same page in semantic at rank 2:
+ 1/62 = 0.0161 - Combined:
0.0164 + 0.0161 = 0.0325 - A page only in FTS5 at rank 2:
1/62 = 0.0161
The intersection page ranks first.
SearchRouter Construction
Section titled “SearchRouter Construction”The SearchRouter is generic over its dependencies:
pub struct SearchRouter<PR, ER, EP>where PR: PageRepository, ER: EmbeddingRepository, EP: EmbeddingProvider,{ page_repo: Arc<PR>, embedding_repo: Arc<ER>, embedding_provider: Option<Arc<EP>>,}embedding_provider may be None if the ONNX model is not yet downloaded or loaded. In this case, all queries fall
back to FTS5 only.
The router is constructed during workspace open (in start_embedding_pipeline) and cached in AppState::search_router.
It is also shared with McpState for MCP search operations.
Graceful Degradation
Section titled “Graceful Degradation”The search system degrades gracefully at every layer:
| Failure | Behavior |
|---|---|
| No embedding model downloaded | FTS5 only (no semantic search) |
| Embedding provider fails to embed query | FTS5 only + warning logged |
| Similarity search fails | FTS5 only + warning logged |
| No workspace open | Error returned to caller |
Missing SearchUse capability | InvalidOperation error |
This ensures search always returns results as long as the workspace is accessible, even if the embedding subsystem is completely unavailable.
Permission Gating
Section titled “Permission Gating”SearchRouter::search() requires the SearchUse capability:
guard .require(Capability::SearchUse) .map_err(|e| PageRepositoryError::InvalidOperation(e.to_string()))?;This is checked before any database access. The capability is part of the standard owner permission set (always granted for local users).
Key Code Paths
Section titled “Key Code Paths”| Scenario | Entry Point |
|---|---|
| Search query dispatched | SearchRouter::search() in crates/application/src/search/search_router.rs |
| Intent classified | SearchRouter::classify_intent() in crates/application/src/search/search_router.rs |
| FTS5 search executed | SqlitePageRepository::search() in crates/infrastructure/sqlite/src/workspace/page_repository.rs |
| Query embedded | OnnxEmbeddingProvider::embed() in crates/infrastructure/onnx/src/provider.rs |
| Similarity query | SqliteEmbeddingRepository::query_similar() in crates/infrastructure/sqlite/src/workspace/embedding_repository.rs |
| Results merged | merge_rrf() in crates/application/src/search/search_router.rs |
| MCP search tool | tools::discovery::search() in apps/desktop/src-tauri/src/mcp/tools/discovery.rs |
Related
Section titled “Related”- Embedding System — Provides the embedding provider and repository used for semantic search.
- Page System — FTS5 index is maintained via triggers on the
pagestable. - MCP System —
searchtool and search resource delegate toSearchRouter. - Tags System — Tags are indexed as the third FTS5 column with weight 5.0.
Was this page helpful?
Thanks for your feedback!