Skip to content
Documentation GitHub
Data Flow

Search

How a search query is classified, executed in parallel, and merged into ranked results.



The user types in the search input. The React component debounces input and calls:

invoke<SearchResult[]>("search_pages", { query, limit: 20 })

Code: apps/desktop/src-tauri/src/commands/search.rs

The search_pages command validates the query string, resolves the workspace and permission guard (requires SearchUse capability), then delegates to the SearchRouter cached in AppState:

let router = state.search_router.lock();
let results = router.search(&guard, &query, limit)?;

The router is constructed once during workspace open (start_embedding_pipeline) and shared as an Arc-wrapped instance. It is also shared with the MCP server for MCP search tool calls.

Code: crates/application/src/search/search_router.rsSearchRouter::classify_intent

classify_intent applies rule-based heuristics with no ML model:

ConditionIntent
Empty or whitespaceKeyword
Surrounded by quotes ("..." or '...')Keyword
Contains FTS5 operators (AND, OR, NOT, NEAR)Keyword
Contains a date pattern (YYYY-MM-DD or YYYY/MM/DD)Keyword
1-2 wordsKeyword
Single hyphenated lowercase token (my-page-slug)Keyword
3+ words, none of the aboveSemantic

Classification is synchronous and executes in microseconds. It runs before any I/O.

Known limitation: Uppercase AND/OR/NOT in natural-language queries (e.g., “Pros AND Cons”) are classified as Keyword because the detector treats them as FTS5 boolean operators.

Code: crates/infrastructure/sqlite/src/workspace/page/search.rsSqlitePageRepository::search

FTS5 search runs for every query regardless of intent. The pages_fts virtual table uses a 3-column contentless index:

SELECT pages.id, pages.slug, pages.title, pages.page_type,
bm25(pages_fts, 10.0, 1.0, 5.0) as score,
COALESCE(snippet(...), ...) as snippet
FROM pages_fts
JOIN pages ON pages.id = pages_fts.rowid
WHERE pages_fts MATCH ?
AND pages.is_deleted = 0
ORDER BY score
LIMIT ?

BM25 weights: title=10.0, content=1.0, tags=5.0. The tokenizer is unicode61 (default), which treats non-alphanumeric characters as separators.

The score is negative (lower is better in BM25); the application takes the absolute value before returning.

If the query is classified as Semantic and an embedding provider is available:

Code: crates/infrastructure/onnx/src/provider.rsOnnxEmbeddingProvider::embed

The query string is tokenized and run through the ONNX Runtime with the snowflake-arctic-embed-m-v2.0 model. The output is a 768-dimensional Vec<f32> vector. This step takes approximately 5-15ms (model inference; dominates the total search latency).

If embedding fails at runtime, SearchRouter logs a warn! and returns FTS5 results only — search never errors out due to an embedding failure.

Code: crates/infrastructure/sqlite/src/workspace/embedding_repository.rsSqliteEmbeddingRepository::query_similar

  1. Load all non-deleted page embeddings from the page_embeddings table (capped at 10,000 rows).
  2. Compute cosine similarity in Rust against the query vector (not in SQL — no sqlite-vec dependency).
  3. Filter results below MIN_SIMILARITY_THRESHOLD (0.3).
  4. Sort descending by score, truncate to limit.

This is O(n) over stored embeddings. The 10,000-row cap prevents unbounded memory usage. At workspace-scale datasets (fewer than 10k pages) this is adequate without an ANN index.

Code: crates/application/src/search/search_router.rsmerge_rrf

When both FTS5 and semantic results are available, they are merged using Reciprocal Rank Fusion with k=60 (the standard value from the original RRF research paper):

RRF score for page P = sum over all lists L of:
1 / (60 + rank_of_P_in_L)

Pages that appear in both lists accumulate contributions from both and rank above pages found in only one list. The FTS5 snippet is preserved for any page that has one (semantic results carry no snippet).

After merging, score on each SearchResult is the RRF score (typically 0.01-0.03), not a BM25 or cosine value.

The command returns Vec<SearchResult> to the frontend via Tauri IPC. Each result contains:

FieldSource
idPage UUID
slugPage slug
titlePage title
snippetHighlighted excerpt from FTS5 (empty for semantic-only results)
scoreRRF score (hybrid) or BM25 absolute value (keyword-only)
page_typePage type from the pages table

ScenarioBehavior
No embedding model downloadedFTS5 only; no error to user
Embedding provider fails to embed queryFTS5 only; warn! logged
Similarity search failsFTS5 only; warn! logged
No SearchUse capabilityInvalidOperation error

The SearchRouter is generic over its dependencies:

pub struct SearchRouter<PR: PageRepository, ER: EmbeddingRepository, EP: EmbeddingProvider> {
page_repo: Arc<PR>,
embedding_repo: Arc<ER>,
embedding_provider: Option<Arc<EP>>, // None if model not loaded
}

It is constructed during workspace open and stored in AppState. The same instance is shared with McpState — MCP search tool calls use the same router with no duplication.


  • Search System — Full search system reference including FTS5 configuration and intent classification details
  • Embedding System — ONNX embedding provider, model download, and background embedding pipeline
  • MCP Systemsearch MCP tool delegates to the same SearchRouter
  • Tag System — Tags are indexed as the third FTS5 column with BM25 weight 5.0

Was this page helpful?