Data Flow

Embedding

How page content is vectorized and indexed for semantic search, covering both incremental and bulk paths.

Overview

Step-by-Step Details

1. Trigger: Incremental Embed After Save

Code: apps/desktop/src-tauri/src/side_effects.rs, apps/desktop/src-tauri/src/embedding.rs

After SaveBlockContentUseCase succeeds, WriteEffectCoordinator::on_block_content_saved calls:

task.events.try_send(EmbeddingEvent::EmbedPage { page_id })

try_send is non-blocking. If the channel is full (capacity 256), the event is silently dropped — the next save will re-queue it. This prevents backpressure from slow embedding from blocking the write path.

2. Trigger: Bulk Workspace Index

Code: apps/desktop/src-tauri/src/embedding.rs

On workspace open or after a model upgrade, EmbeddingEvent::IndexWorkspace is sent to the same bounded channel. The EmbeddingTask prioritizes bulk indexing: if both individual page events and an IndexWorkspace event appear in the same batch, the workspace index runs after all individual pages are processed.

3. Debounce and Deduplication

Code: apps/desktop/src-tauri/src/embedding.rs — EmbeddingTask::handle_batch

The EmbeddingTask accumulates events with a 2-second debounce window. Within a batch, page IDs from EmbedPage events are deduplicated via HashSet — if a page is saved 10 times within the debounce window, it is embedded only once.

4. spawn_blocking for CPU Work

ONNX inference is CPU-bound and can take 10-50ms per page. Running it on a Tokio async thread would block the executor. EmbeddingTask uses tokio::task::spawn_blocking to move all embedding work to a dedicated thread pool thread.

5. EmbeddingPipeline: Load Text Content

Code: crates/application/src/embedding/pipeline.rs — EmbeddingPipeline::embed_page

The pipeline loads page content via PageRepository::get_text_content(page_id), which returns (title, Vec<String>) — the page title and block text contents. This method intentionally avoids loading content_loro BLOBs; only the materialized text columns are needed for embedding.

Text is assembled as:

{title}\n\n{block_text_1}\n\n{block_text_2}\n...

Pages with empty assembled text are skipped (no embedding stored).

6. ONNX Inference: Tokenize -> Infer -> Pool -> Normalize

Code: crates/infrastructure/onnx/src/provider.rs

OnnxEmbeddingProvider uses the snowflake-arctic-embed-m-v2.0 model with 768-dimensional output and a 512-token context window.

Pipeline for a single text:

Tokenize: HuggingFace tokenizers crate truncates to 512 tokens and pads for batch alignment
ONNX inference: Session runs with GraphOptimizationLevel::Level3 and 1 intra-thread (no parallelism per call — the Mutex ensures single-threaded session access)
Pool: If the model outputs [batch, seq_len, hidden_size] (token-level), mean pooling over non-masked tokens is applied. If the model outputs [batch, hidden_size] (pre-pooled), pooling is skipped
L2 normalize: The pooled vector is divided by its L2 norm so cosine similarity can be computed as a dot product

For batch indexing, embed_batch tokenizes all texts together and runs a single ONNX session call.

7. Upsert to SQLite

Code: crates/infrastructure/sqlite/src/workspace/embedding_repository.rs

The 768-dimensional Vec<f32> is serialized as a raw byte BLOB (f32 little-endian) and upserted into the page_embeddings table with (page_id, model_id, model_version) as the key. Existing rows are updated on conflict.

For bulk indexing, upsert_batch wraps all inserts in a single transaction.

8. Bulk Index: Stale Page Loop

Code: crates/application/src/embedding/pipeline.rs — EmbeddingPipeline::index_workspace

The pipeline fetches stale pages in batches of 100 (pages missing an embedding for the current model_id+model_version). Each batch is chunked into groups of 16 (batch_size) for ONNX batch inference. Progress is reported via a FnMut(completed, total) callback published to a watch::Sender<IndexingStatus> channel visible to the UI.

9. Search Integration

Code: crates/application/src/search/search_router.rs

At search time, SearchRouter::classify_intent determines whether to run semantic search (queries of 3+ natural-language words). For semantic queries:

The query text is embedded via the same OnnxEmbeddingProvider (5-15ms)
All non-deleted page embeddings are loaded from SQLite (capped at 10,000 rows)
Cosine similarity is computed in Rust as a dot product (vectors are pre-normalized)
Results below MIN_SIMILARITY_THRESHOLD (0.3) are filtered out
Semantic results are merged with FTS5 BM25 results via Reciprocal Rank Fusion (k=60)

Error Handling

Failure	Behavior
Channel full (`try_send` fails)	Event dropped; page embedded on next save
Page not found in `get_text_content`	`warn!` logged; page skipped; no embedding stored
Empty page content	Embedding skipped silently; no row upserted
ONNX model not downloaded	Provider not constructed; `SearchRouter` falls back to FTS5 only
ONNX inference error	`PipelineError::Embedding` logged; page skipped during bulk; query falls back to FTS5
`spawn_blocking` panics	`error!` logged; `TaskError` returned; task continues processing next batch
SQLite upsert failure	`PipelineError::Repository` logged; page skipped
Semantic search failure at query time	`warn!` logged; FTS5 results returned; no error to user

Embedding System — Model download, ONNX provider configuration, and page_embeddings schema
Search System — SearchRouter intent classification and RRF merge details
Write Path — WriteEffectCoordinator triggers the embedding pipeline after every block save
Search Data Flow — Full search query flow from frontend to ranked results

Previous
Authentication Next
Import

Was this page helpful?