Aller au contenu

The Production Safeguards service monitors system resources during vector database operations and automatically intervenes when memory, CPU, or error rates exceed safe thresholds.

The service tracks:

MetricSourceWhat it measures
Heap usedprocess.memoryUsage()V8 heap memory consumption
Heap totalprocess.memoryUsage()V8 allocated heap size
RSSprocess.memoryUsage()Resident set size (total process)
Externalprocess.memoryUsage()C++ objects bound to V8
CPU userprocess.cpuUsage()User-mode CPU time
CPU systemprocess.cpuUsage()Kernel-mode CPU time

Monitoring runs on a periodic interval, checking resource usage and triggering recovery strategies when thresholds are exceeded.

LimitDefaultDescription
maxMemoryMB1,024 MBTotal RSS ceiling
maxHeapMB512 MBV8 heap ceiling
maxCpuPercent80%CPU usage ceiling
gcThresholdMB256 MBHeap size that triggers GC suggestion
alertThresholdMB400 MBHeap size that triggers first recovery

Recovery actions are tried in priority order. Each has a cooldown to prevent thrashing:

PriorityActionTrigger conditionCooldownMax retries
1Clear cacheHeap > alertThresholdMB30s3
3Reduce batch sizeHeap > 80% of alert AND indexing active60s2
4Pause indexingHeap > 90% of maxHeapMB AND indexing2min1
5Restart workerHeap > maxHeapMB AND indexing5min1
6Emergency stopRSS > maxMemoryMB1

Recovery is context-aware: REDUCE_BATCH_SIZE, PAUSE_INDEXING, and RESTART_WORKER only trigger when indexing is actually in progress (checked via ServiceStatusChecker). CLEAR_CACHE always triggers since it’s safe regardless of activity.

The service includes a circuit breaker pattern for operation execution:

graph LR A["CLOSED<br/>Normal operation"] -->|"Failures reach threshold"| B["OPEN<br/>All operations rejected"] B -->|"Timeout expires"| C["HALF_OPEN<br/>Single test request"] C -->|"Success"| A C -->|"Failure"| B

Operations wrapped in executeWithSafeguards() benefit from:

  • Timeout — Operations killed after a deadline
  • Retry — Configurable retry count
  • Circuit breaker bypass — Option to skip for critical operations

When RSS exceeds maxMemoryMB, the emergency stop activates:

  1. Sets emergencyStopActive = true
  2. All subsequent executeWithSafeguards() calls are rejected
  3. Requires manual recovery or extension restart