Skip to content
Documentation GitHub
Architecture

Sync Engine Safety: Cursor Advancement, Poison Pill Recovery, and Self-Update Prevention

Sync Engine Safety: Cursor Advancement, Poison Pill Recovery, and Self-Update Prevention

Problem

Three interrelated cursor and queue safety issues discovered during the INK-72 Multi-Device Sync code review:

  1. Cursor advances past merge failures — The pull phase computed new_cursor from all pulled update IDs, even if some blocks failed to merge. On the next cycle, those updates are skipped permanently.
  2. Poison pill queue entries — A malformed or permanently-failing sync queue entry retries forever with exponential backoff, blocking all newer updates behind it.
  3. Self-update re-merge — After pushing updates, the device’s own realtime subscription fires, triggering a pull that re-fetches and re-merges its own updates. Wasteful and can cause cursor confusion.

Symptoms:

  • Blocks appear to lose remote edits after a merge failure
  • Sync queue drains slowly or stalls entirely
  • CPU/network waste from redundant self-merge cycles
  • sync_pull_merge_failed errors in logs with no retry

Root Cause

Cursor Advancement

The original pull logic advanced the cursor based on the fetched update ID, not the successfully merged update ID:

// WRONG: cursor advances regardless of merge success
let new_cursor = updates.iter().map(|u| u.id).max().unwrap_or(cursor);
// ... merge loop with `continue` on failure ...
self.block_storage.set_sync_cursor(workspace_path, new_cursor);

When a block’s merge fails (corrupt data, schema mismatch, etc.), the continue skips it but the cursor has already been computed to include it. On the next pull, the cursor is past that block’s updates — they’re permanently lost.

Poison Pill

The sync queue had no retry cap. A permanently-failing entry (e.g., block deleted on server) would be retried on every sync cycle, incrementing retry_count but never being removed. Combined with the exponential backoff state machine, this could stall the entire sync pipeline.

Self-Update

When Device A pushes an update to Supabase, Postgres Realtime fires an INSERT event on block_updates. If Device A subscribes to that table, it receives its own update notification, triggers a pull, and re-fetches/re-merges its own changes. While Loro handles idempotent merges, this wastes computation and network.

Solution

1. Track Max Successfully Merged ID

// In sync_engine.rs pull_phase()
let mut max_success_id = cursor;
for (block_id, block_updates) in &grouped {
// Filter self-updates (see #3 below)
let remote_updates: Vec<_> = block_updates
.iter()
.filter(|u| u.device_id != device_id_str)
.collect();
if remote_updates.is_empty() {
// Still advance cursor past self-updates
for update in block_updates {
if update.id > max_success_id {
max_success_id = update.id;
}
}
continue;
}
// Attempt merge...
match self.merge_block_updates(workspace_path, block_id, &remote_updates) {
Ok(()) => {
// Only advance cursor for successfully merged blocks
for update in block_updates {
if update.id > max_success_id {
max_success_id = update.id;
}
}
}
Err(e) => {
tracing::warn!(
block_id = block_id,
error = %e,
"sync_pull_merge_failed — cursor NOT advanced for this block"
);
// Do NOT advance cursor — these updates will be retried
continue;
}
}
}
// Only update cursor if we had any successes
if max_success_id > cursor {
if let Err(e) = self.block_storage.set_sync_cursor(workspace_path, max_success_id) {
tracing::warn!(error = %e, "sync_pull_local_cursor_update_failed");
}
}

Key semantic: The cursor represents “all updates up to this point have been successfully processed”, not “seen”. Failed blocks stay below the cursor watermark and are re-fetched on the next cycle.

2. Max Retry Cap for Sync Queue

// In offline_queue.rs dequeue_batch()
fn dequeue_batch(&self, workspace_path: &Path, limit: usize) -> SyncResult<Vec<QueuedUpdate>> {
let db = Self::open_db(workspace_path)?;
db.with_connection(|conn| {
let mut stmt = conn.prepare(
"SELECT id, block_id, update_bytes, created_at, retry_count \
FROM sync_queue \
WHERE retry_count < 10 \
ORDER BY created_at ASC LIMIT ?1"
)?;
// ...
})
}

Entries with retry_count >= 10 are effectively dead-lettered — they remain in the table for debugging but don’t block the queue. A future improvement could move them to a sync_dead_letters table.

3. Filter Self-Updates During Pull

let device_id_str = device_id.to_string();
for (block_id, block_updates) in &grouped {
let remote_updates: Vec<_> = block_updates
.iter()
.filter(|u| u.device_id != device_id_str)
.collect();
if remote_updates.is_empty() {
// Advance cursor past self-updates (no merge needed)
for update in block_updates {
if update.id > max_success_id {
max_success_id = update.id;
}
}
continue;
}
// Merge only remote_updates...
}

Important: Self-updates still advance the cursor. The device knows it already has these changes locally — skipping the merge is safe, but the cursor must move past them to avoid re-fetching on the next cycle.

Prevention

Design Principles

  • Cursor = success watermark: In any cursor-based sync system, the cursor must represent the last successfully processed item, never just the last fetched item.
  • Poison pill defense: Any retry/queue system needs a max retry cap. Without it, one bad entry can stall the entire pipeline.
  • Self-filtering in distributed systems: When a device both writes to and reads from a shared data store, it will receive its own writes. Filter them explicitly.

Warning Signs

  • let _ = patterns around cursor update calls (silently discarding errors)
  • Cursor computed from updates.iter().max() instead of from a success accumulator
  • Retry loops without max iteration bounds
  • Realtime subscriptions without device_id != self filtering

Code Review Checklist

  • Cursor advancement is gated on successful processing
  • Failed items remain below cursor watermark for retry
  • Cursor update errors are logged (not silently discarded)
  • Queue has max retry cap
  • Self-updates are filtered during pull
  • Cursor still advances past filtered self-updates

References

  • Commit 9afe07e — sync engine safety remediation (INK-72)
  • crates/application/src/sync/sync_engine.rs — pull_phase implementation
  • crates/infrastructure/sqlite/src/sync/offline_queue.rs — max retry cap
  • apps/desktop/src-react/lib/realtime-sync.ts — 500ms debounce for realtime events

Was this page helpful?