Architecture

Sync Engine Safety: Cursor Advancement, Poison Pill Recovery, and Self-Update Prevention

Problem

Three interrelated cursor and queue safety issues discovered during the INK-72 Multi-Device Sync code review:

Cursor advances past merge failures — The pull phase computed new_cursor from all pulled update IDs, even if some blocks failed to merge. On the next cycle, those updates are skipped permanently.
Poison pill queue entries — A malformed or permanently-failing sync queue entry retries forever with exponential backoff, blocking all newer updates behind it.
Self-update re-merge — After pushing updates, the device’s own realtime subscription fires, triggering a pull that re-fetches and re-merges its own updates. Wasteful and can cause cursor confusion.

Symptoms:

Blocks appear to lose remote edits after a merge failure
Sync queue drains slowly or stalls entirely
CPU/network waste from redundant self-merge cycles
sync_pull_merge_failed errors in logs with no retry

Root Cause

Cursor Advancement

The original pull logic advanced the cursor based on the fetched update ID, not the successfully merged update ID:

// WRONG: cursor advances regardless of merge success
let new_cursor = updates.iter().map(|u| u.id).max().unwrap_or(cursor);
// ... merge loop with `continue` on failure ...
self.block_storage.set_sync_cursor(workspace_path, new_cursor);

When a block’s merge fails (corrupt data, schema mismatch, etc.), the continue skips it but the cursor has already been computed to include it. On the next pull, the cursor is past that block’s updates — they’re permanently lost.

Poison Pill

The sync queue had no retry cap. A permanently-failing entry (e.g., block deleted on server) would be retried on every sync cycle, incrementing retry_count but never being removed. Combined with the exponential backoff state machine, this could stall the entire sync pipeline.

Self-Update

When Device A pushes an update to Supabase, Postgres Realtime fires an INSERT event on block_updates. If Device A subscribes to that table, it receives its own update notification, triggers a pull, and re-fetches/re-merges its own changes. While Loro handles idempotent merges, this wastes computation and network.

Solution

1. Track Max Successfully Merged ID

// In sync_engine.rs pull_phase()
let mut max_success_id = cursor;

for (block_id, block_updates) in &grouped {
    // Filter self-updates (see #3 below)
    let remote_updates: Vec<_> = block_updates
        .iter()
        .filter(|u| u.device_id != device_id_str)
        .collect();

    if remote_updates.is_empty() {
        // Still advance cursor past self-updates
        for update in block_updates {
            if update.id > max_success_id {
                max_success_id = update.id;
            }
        }
        continue;
    }

    // Attempt merge...
    match self.merge_block_updates(workspace_path, block_id, &remote_updates) {
        Ok(()) => {
            // Only advance cursor for successfully merged blocks
            for update in block_updates {
                if update.id > max_success_id {
                    max_success_id = update.id;
                }
            }
        }
        Err(e) => {
            tracing::warn!(
                block_id = block_id,
                error = %e,
                "sync_pull_merge_failed — cursor NOT advanced for this block"
            );
            // Do NOT advance cursor — these updates will be retried
            continue;
        }
    }
}

// Only update cursor if we had any successes
if max_success_id > cursor {
    if let Err(e) = self.block_storage.set_sync_cursor(workspace_path, max_success_id) {
        tracing::warn!(error = %e, "sync_pull_local_cursor_update_failed");
    }
}

Key semantic: The cursor represents “all updates up to this point have been successfully processed”, not “seen”. Failed blocks stay below the cursor watermark and are re-fetched on the next cycle.

2. Max Retry Cap for Sync Queue

// In offline_queue.rs dequeue_batch()
fn dequeue_batch(&self, workspace_path: &Path, limit: usize) -> SyncResult<Vec<QueuedUpdate>> {
    let db = Self::open_db(workspace_path)?;
    db.with_connection(|conn| {
        let mut stmt = conn.prepare(
            "SELECT id, block_id, update_bytes, created_at, retry_count \
             FROM sync_queue \
             WHERE retry_count < 10 \
             ORDER BY created_at ASC LIMIT ?1"
        )?;
        // ...
    })
}

Entries with retry_count >= 10 are effectively dead-lettered — they remain in the table for debugging but don’t block the queue. A future improvement could move them to a sync_dead_letters table.

3. Filter Self-Updates During Pull

let device_id_str = device_id.to_string();

for (block_id, block_updates) in &grouped {
    let remote_updates: Vec<_> = block_updates
        .iter()
        .filter(|u| u.device_id != device_id_str)
        .collect();

    if remote_updates.is_empty() {
        // Advance cursor past self-updates (no merge needed)
        for update in block_updates {
            if update.id > max_success_id {
                max_success_id = update.id;
            }
        }
        continue;
    }
    // Merge only remote_updates...
}

Important: Self-updates still advance the cursor. The device knows it already has these changes locally — skipping the merge is safe, but the cursor must move past them to avoid re-fetching on the next cycle.

Prevention

Design Principles

Cursor = success watermark: In any cursor-based sync system, the cursor must represent the last successfully processed item, never just the last fetched item.
Poison pill defense: Any retry/queue system needs a max retry cap. Without it, one bad entry can stall the entire pipeline.
Self-filtering in distributed systems: When a device both writes to and reads from a shared data store, it will receive its own writes. Filter them explicitly.

Warning Signs

let _ = patterns around cursor update calls (silently discarding errors)
Cursor computed from updates.iter().max() instead of from a success accumulator
Retry loops without max iteration bounds
Realtime subscriptions without device_id != self filtering

Code Review Checklist

Cursor advancement is gated on successful processing
Failed items remain below cursor watermark for retry
Cursor update errors are logged (not silently discarded)
Queue has max retry cap
Self-updates are filtered during pull
Cursor still advances past filtered self-updates

References

Commit 9afe07e — sync engine safety remediation (INK-72)
crates/application/src/sync/sync_engine.rs — pull_phase implementation
crates/infrastructure/sqlite/src/sync/offline_queue.rs — max retry cap
apps/desktop/src-react/lib/realtime-sync.ts — 500ms debounce for realtime events

Previous
SQLite Multi-Database Checkpoint Patterns Next
Tauri Dev/Prod Resource Resolution Pattern

Was this page helpful?