Skip to content

Production Safeguards

The Production Safeguards service monitors system resources during vector database operations and automatically intervenes when memory, CPU, or error rates exceed safe thresholds.

The service tracks:

MetricSourceWhat it measures
Heap usedprocess.memoryUsage()V8 heap memory consumption
Heap totalprocess.memoryUsage()V8 allocated heap size
RSSprocess.memoryUsage()Resident set size (total process)
Externalprocess.memoryUsage()C++ objects bound to V8
CPU userprocess.cpuUsage()User-mode CPU time
CPU systemprocess.cpuUsage()Kernel-mode CPU time

Monitoring runs on a periodic interval, checking resource usage and triggering recovery strategies when thresholds are exceeded.

LimitDefaultDescription
maxMemoryMB1,024 MBTotal RSS ceiling
maxHeapMB512 MBV8 heap ceiling
maxCpuPercent80%CPU usage ceiling
gcThresholdMB256 MBHeap size that triggers GC suggestion
alertThresholdMB400 MBHeap size that triggers first recovery

Recovery actions are tried in priority order. Each has a cooldown to prevent thrashing:

PriorityActionTrigger conditionCooldownMax retries
1Clear cacheHeap > alertThresholdMB30s3
3Reduce batch sizeHeap > 80% of alert AND indexing active60s2
4Pause indexingHeap > 90% of maxHeapMB AND indexing2min1
5Restart workerHeap > maxHeapMB AND indexing5min1
6Emergency stopRSS > maxMemoryMB1

Recovery is context-aware: REDUCE_BATCH_SIZE, PAUSE_INDEXING, and RESTART_WORKER only trigger when indexing is actually in progress (checked via ServiceStatusChecker). CLEAR_CACHE always triggers since it’s safe regardless of activity.

The service includes a circuit breaker pattern for operation execution:

graph LR A["CLOSED<br/>Normal operation"] -->|"Failures reach threshold"| B["OPEN<br/>All operations rejected"] B -->|"Timeout expires"| C["HALF_OPEN<br/>Single test request"] C -->|"Success"| A C -->|"Failure"| B

Operations wrapped in executeWithSafeguards() benefit from:

  • Timeout — Operations killed after a deadline
  • Retry — Configurable retry count
  • Circuit breaker bypass — Option to skip for critical operations

When RSS exceeds maxMemoryMB, the emergency stop activates:

  1. Sets emergencyStopActive = true
  2. All subsequent executeWithSafeguards() calls are rejected
  3. Requires manual recovery or extension restart