Data Flow

Search

How a search query is classified, executed in parallel, and merged into ranked results.

Overview

Step-by-Step Details

1. Frontend Query

The user types in the search input. The React component debounces input and calls:

invoke<SearchResult[]>("search_pages", { query, limit: 20 })

2. Tauri Command

Code: apps/desktop/src-tauri/src/commands/search.rs

The search_pages command validates the query string, resolves the workspace and permission guard (requires SearchUse capability), then delegates to the SearchRouter cached in AppState:

let router = state.search_router.lock();
let results = router.search(&guard, &query, limit)?;

The router is constructed once during workspace open (start_embedding_pipeline) and shared as an Arc-wrapped instance. It is also shared with the MCP server for MCP search tool calls.

3. Intent Classification

Code: crates/application/src/search/search_router.rs — SearchRouter::classify_intent

classify_intent applies rule-based heuristics with no ML model:

Condition	Intent
Empty or whitespace	Keyword
Surrounded by quotes (`"..."` or `'...'`)	Keyword
Contains FTS5 operators (`AND`, `OR`, `NOT`, `NEAR`)	Keyword
Contains a date pattern (`YYYY-MM-DD` or `YYYY/MM/DD`)	Keyword
1-2 words	Keyword
Single hyphenated lowercase token (`my-page-slug`)	Keyword
3+ words, none of the above	Semantic

Classification is synchronous and executes in microseconds. It runs before any I/O.

Known limitation: Uppercase AND/OR/NOT in natural-language queries (e.g., “Pros AND Cons”) are classified as Keyword because the detector treats them as FTS5 boolean operators.

4. FTS5 Search (always runs)

Code: crates/infrastructure/sqlite/src/workspace/page/search.rs — SqlitePageRepository::search

FTS5 search runs for every query regardless of intent. The pages_fts virtual table uses a 3-column contentless index:

SELECT pages.id, pages.slug, pages.title, pages.page_type,
       bm25(pages_fts, 10.0, 1.0, 5.0) as score,
       COALESCE(snippet(...), ...) as snippet
FROM pages_fts
JOIN pages ON pages.id = pages_fts.rowid
WHERE pages_fts MATCH ?
  AND pages.is_deleted = 0
ORDER BY score
LIMIT ?

BM25 weights: title=10.0, content=1.0, tags=5.0. The tokenizer is unicode61 (default), which treats non-alphanumeric characters as separators.

The score is negative (lower is better in BM25); the application takes the absolute value before returning.

5. Semantic Search (Semantic intent only)

If the query is classified as Semantic and an embedding provider is available:

5a. Query Embedding

Code: crates/infrastructure/onnx/src/provider.rs — OnnxEmbeddingProvider::embed

The query string is tokenized and run through the ONNX Runtime with the snowflake-arctic-embed-m-v2.0 model. The output is a 768-dimensional Vec<f32> vector. This step takes approximately 5-15ms (model inference; dominates the total search latency).

If embedding fails at runtime, SearchRouter logs a warn! and returns FTS5 results only — search never errors out due to an embedding failure.

5b. Similarity Search

Code: crates/infrastructure/sqlite/src/workspace/embedding_repository.rs — SqliteEmbeddingRepository::query_similar

Load all non-deleted page embeddings from the page_embeddings table (capped at 10,000 rows).
Compute cosine similarity in Rust against the query vector (not in SQL — no sqlite-vec dependency).
Filter results below MIN_SIMILARITY_THRESHOLD (0.3).
Sort descending by score, truncate to limit.

This is O(n) over stored embeddings. The 10,000-row cap prevents unbounded memory usage. At workspace-scale datasets (fewer than 10k pages) this is adequate without an ANN index.

6. RRF Merge

Code: crates/application/src/search/search_router.rs — merge_rrf

When both FTS5 and semantic results are available, they are merged using Reciprocal Rank Fusion with k=60 (the standard value from the original RRF research paper):

RRF score for page P = sum over all lists L of:
    1 / (60 + rank_of_P_in_L)

Pages that appear in both lists accumulate contributions from both and rank above pages found in only one list. The FTS5 snippet is preserved for any page that has one (semantic results carry no snippet).

After merging, score on each SearchResult is the RRF score (typically 0.01-0.03), not a BM25 or cosine value.

7. Results Returned

The command returns Vec<SearchResult> to the frontend via Tauri IPC. Each result contains:

Field	Source
`id`	Page UUID
`slug`	Page slug
`title`	Page title
`snippet`	Highlighted excerpt from FTS5 (empty for semantic-only results)
`score`	RRF score (hybrid) or BM25 absolute value (keyword-only)
`page_type`	Page type from the `pages` table

Graceful Degradation

Scenario	Behavior
No embedding model downloaded	FTS5 only; no error to user
Embedding provider fails to embed query	FTS5 only; `warn!` logged
Similarity search fails	FTS5 only; `warn!` logged
No `SearchUse` capability	`InvalidOperation` error

SearchRouter Construction

The SearchRouter is generic over its dependencies:

pub struct SearchRouter<PR: PageRepository, ER: EmbeddingRepository, EP: EmbeddingProvider> {
    page_repo: Arc<PR>,
    embedding_repo: Arc<ER>,
    embedding_provider: Option<Arc<EP>>,  // None if model not loaded
}

It is constructed during workspace open and stored in AppState. The same instance is shared with McpState — MCP search tool calls use the same router with no duplication.

Search System — Full search system reference including FTS5 configuration and intent classification details
Embedding System — ONNX embedding provider, model download, and background embedding pipeline
MCP System — search MCP tool delegates to the same SearchRouter
Tag System — Tags are indexed as the third FTS5 column with BM25 weight 5.0

Previous
Import Next
Sync

Was this page helpful?