Skip to content
Documentation GitHub
Platform

Search System

Status: Implemented Depends On: Page System, Embedding System


The Search System provides a unified search experience combining SQLite FTS5 keyword search with ONNX-powered semantic search. Users interact with a single search input; the system transparently classifies query intent, dispatches to the appropriate backends, and merges results via Reciprocal Rank Fusion (RRF).

The SearchRouter is the single entry point for all page search. It is cached in AppState as search_router and shared with the MCP server.



Framework (Tauri)
└── search_pages command apps/desktop/src-tauri/src/commands/
Delegates to SearchRouter
Application
└── SearchRouter crates/application/src/search/search_router.rs
├── classify_intent() Rule-based query classification
├── search() Unified dispatch + merge
└── merge_rrf() Reciprocal Rank Fusion
Infrastructure
├── SqlitePageRepository crates/infrastructure/sqlite/src/workspace/page_repository.rs
│ FTS5 search (pages_fts virtual table, bm25 ranking)
├── SqliteEmbeddingRepository crates/infrastructure/sqlite/src/workspace/embedding_repository.rs
│ Brute-force cosine similarity over stored embeddings
└── OnnxEmbeddingProvider crates/infrastructure/onnx/src/provider.rs
Query embedding generation (snowflake-arctic-embed-m-v2.0)

Dependencies flow inward: Framework -> Application (SearchRouter) -> Infrastructure (FTS5 + embedding repos).


SearchRouter::search(workspace_path, query, limit)
├─ classify_intent(query)
│ Keyword → FTS5 only
│ Semantic → FTS5 + semantic + RRF merge
├─ Always: PageRepository::search() (FTS5)
└─ If Semantic AND provider present:
EmbeddingProvider::embed(query) → query vector
EmbeddingRepository::query_similar() → Vec<SimilarPage>
merge_rrf(fts_results, semantic_results, limit)

Every search always runs FTS5. Semantic search is additive — it only activates when intent is classified as semantic AND an embedding provider is available. If the embedding provider is None (model not downloaded), or if embedding/similarity fails at runtime, the router falls back to FTS5 results with a logged warning.


SearchRouter::classify_intent() uses rule-based heuristics (no ML model):

ConditionIntentRationale
Empty or whitespaceKeywordNothing to embed
Surrounded by quotes ("..." or '...')KeywordExact match intent
Contains FTS5 operators (AND, OR, NOT, NEAR)KeywordUser wants boolean logic
Contains a date pattern (YYYY-MM-DD or YYYY/MM/DD)KeywordStructured lookup
1-2 wordsKeywordToo short for meaningful embedding
Single hyphenated lowercase token (my-page-slug)KeywordSlug-like lookup
3+ words, no special patternsSemanticNatural language query

Known limitation: Natural-language use of AND/OR/NOT (e.g., “Pros AND Cons”) is classified as Keyword because the detector treats uppercase AND/OR/NOT/NEAR as FTS5 boolean operators.


The pages_fts virtual table is a 3-column contentless FTS5 index:

CREATE VIRTUAL TABLE IF NOT EXISTS pages_fts USING fts5(
title,
content,
tags,
content='',
contentless_delete=1
);
  • title: Page title.
  • content: Materialized markdown from pages.raw_markdown.
  • tags: Space-separated tag names from page_tags join.
  • content='': Contentless mode — FTS5 stores only the index, not the original text. Saves storage at the cost of requiring triggers to keep the index synchronized.
  • contentless_delete=1: Enables row deletion without the original content.
bm25(pages_fts, 10.0, 1.0, 5.0) as score
ColumnWeightRationale
title10.0Title matches are most relevant
content1.0Baseline body text weight
tags5.0Tag matches are high-signal metadata

BM25 returns negative values (lower is better); the application takes the absolute value for a positive score.

Snippets are extracted with a priority cascade:

  1. Content snippet (column 1) — body text context.
  2. Tags snippet (column 2) — matching tag names.
  3. Title snippet (column 0) — fallback to title.
COALESCE(
NULLIF(snippet(pages_fts, 1, '<mark>', '</mark>', '...', 32), ''),
NULLIF(snippet(pages_fts, 2, '<mark>', '</mark>', '...', 32), ''),
snippet(pages_fts, 0, '<mark>', '</mark>', '...', 32),
''
) as snippet

FTS5 uses the default unicode61 tokenizer, which treats non-alphanumeric characters as separators. This means custom syntax like {{age:34}} is tokenized to ["age", "34"] without preprocessing — braces and colons are natural separators.

The FTS5 index is kept in sync via SQLite triggers:

  • pages_fts_insert: Indexes new pages (skips is_deleted = 1).
  • pages_fts_update: Re-indexes on page update.
  • pages_fts_delete: Removes from index on page delete.
  • page_tags_fts_insert / page_tags_fts_delete: Re-indexes the page when tags are added or removed.
  • tags_fts_name_update: Re-indexes all pages with a tag when the tag is renamed.

When a query is classified as semantic, SearchRouter calls EmbeddingProvider::embed(query) to generate a 768-dimensional vector using the snowflake-arctic-embed-m-v2.0 model via ONNX Runtime. See Embedding System for model details.

SqliteEmbeddingRepository::query_similar() performs brute-force cosine similarity:

  1. Loads all non-deleted page embeddings (capped at 10,000 rows).
  2. Computes cosine similarity in Rust (not in SQL — no sqlite-vec dependency).
  3. Filters below MIN_SIMILARITY_THRESHOLD (0.3).
  4. Sorts descending by score, truncates to limit.

This is O(n) over stored embeddings. The 10,000-row cap prevents unbounded memory usage. For workspace-scale datasets (< 10k pages) this is adequate.


When both FTS5 and semantic results are available, they are merged using RRF with k=60 (the standard value from the original research paper):

RRF score = sum over lists of 1 / (60 + rank_i)

Each page accumulates contributions from every list it appears in. Pages appearing in both lists are boosted above pages in only one list.

Score semantics: After merging, the score field contains an RRF score (typically ~0.01-0.03 range), NOT a cosine similarity or BM25 score. The FTS5 snippet is preserved for merged results; semantic results have no snippet.

Example with k=60:

  • Page in FTS5 at rank 1: 1/61 = 0.0164
  • Same page in semantic at rank 2: + 1/62 = 0.0161
  • Combined: 0.0164 + 0.0161 = 0.0325
  • A page only in FTS5 at rank 2: 1/62 = 0.0161

The intersection page ranks first.


The SearchRouter is generic over its dependencies:

pub struct SearchRouter<PR, ER, EP>
where
PR: PageRepository,
ER: EmbeddingRepository,
EP: EmbeddingProvider,
{
page_repo: Arc<PR>,
embedding_repo: Arc<ER>,
embedding_provider: Option<Arc<EP>>,
}

embedding_provider may be None if the ONNX model is not yet downloaded or loaded. In this case, all queries fall back to FTS5 only.

The router is constructed during workspace open (in start_embedding_pipeline) and cached in AppState::search_router. It is also shared with McpState for MCP search operations.


The search system degrades gracefully at every layer:

FailureBehavior
No embedding model downloadedFTS5 only (no semantic search)
Embedding provider fails to embed queryFTS5 only + warning logged
Similarity search failsFTS5 only + warning logged
No workspace openError returned to caller
Missing SearchUse capabilityInvalidOperation error

This ensures search always returns results as long as the workspace is accessible, even if the embedding subsystem is completely unavailable.


SearchRouter::search() requires the SearchUse capability:

guard
.require(Capability::SearchUse)
.map_err(|e| PageRepositoryError::InvalidOperation(e.to_string()))?;

This is checked before any database access. The capability is part of the standard owner permission set (always granted for local users).


ScenarioEntry Point
Search query dispatchedSearchRouter::search() in crates/application/src/search/search_router.rs
Intent classifiedSearchRouter::classify_intent() in crates/application/src/search/search_router.rs
FTS5 search executedSqlitePageRepository::search() in crates/infrastructure/sqlite/src/workspace/page_repository.rs
Query embeddedOnnxEmbeddingProvider::embed() in crates/infrastructure/onnx/src/provider.rs
Similarity querySqliteEmbeddingRepository::query_similar() in crates/infrastructure/sqlite/src/workspace/embedding_repository.rs
Results mergedmerge_rrf() in crates/application/src/search/search_router.rs
MCP search tooltools::discovery::search() in apps/desktop/src-tauri/src/mcp/tools/discovery.rs

  • Embedding System — Provides the embedding provider and repository used for semantic search.
  • Page System — FTS5 index is maintained via triggers on the pages table.
  • MCP Systemsearch tool and search resource delegate to SearchRouter.
  • Tags System — Tags are indexed as the third FTS5 column with weight 5.0.

Was this page helpful?