Retention BC runbook

Tier matrix

ResourceLocal TTLProd TTL
connector_events body2h12h
connector_events row72h720h
ingestion_debug_artifacts1h3h
ingestion_job_runs168h2160h
offer_observations720h8760h
offer_characteristic_raw720h4320h
canonical_assignment_events336h2160h
outbox_events24h168h
S3 raw expire1d7d
S3 cleaner older-than2h12h

Operations

Force-run cycle

docker restart tracium-api-1 — ticker re-runs at startup. Альтернатива: direct PSQL trigger через SELECT pg_advisory_lock(hashtext('retention_cleanup'))

  • manual cleanup — не рекомендуется (race с running ticker).

Disable retention

RETENTION_TICK_INTERVAL=0 → Ticker.Run logs retention.ticker.disabled и blocks until ctx done. После restart с этим env retention pauses.

Check live state

curl https://admin.tracium.dev:4444/api/v1/admin/retention/status

Требует admin JWT cookie. Response — JSON с per-resource policy + last tick result + table size + s3 status.

Slow cleanup / I/O pressure

Симптомы: retention DELETE висит несколько минут, Postgres показывает IO/DataFileRead, ingestion/facts/canonical workers выглядят “замершими”. Проверить активные cleanup-запросы:

SELECT pid, now() - query_start AS age, wait_event_type, wait_event, query
  FROM pg_stat_activity
 WHERE query ILIKE ' DELETE FROM %'
 ORDER BY query_start;

Для catalog_projector_queue и outbox_events cleanup обязан идти по partial timestamp index, иначе на больших таблицах будет full scan:

EXPLAIN SELECT ctid
  FROM catalog_projector_queue
 WHERE processed_at IS NOT NULL
   AND processed_at < now() - interval '7 days'
 LIMIT 10000;
 
EXPLAIN SELECT ctid
  FROM outbox_events
 WHERE sent_at IS NOT NULL
   AND sent_at < now() - interval '7 days'
 LIMIT 10000;

Ожидаемый план: Index Scan using catalog_projector_queue_processed_at_idx и Index Scan using outbox_events_sent_at_idx. Если индекса нет, его можно создать hot:

CREATE INDEX CONCURRENTLY IF NOT EXISTS catalog_projector_queue_processed_at_idx
    ON catalog_projector_queue (processed_at)
    WHERE processed_at IS NOT NULL;
 
CREATE INDEX CONCURRENTLY IF NOT EXISTS outbox_events_sent_at_idx
    ON outbox_events (sent_at)
    WHERE sent_at IS NOT NULL;

После hot-фикса обязательно добавить ту же DDL в миграции, иначе следующий environment rebuild потеряет защиту.

Default partition non-zero

Если connector_events_default или offer_observations_default содержит rows — prealloc lag. Manual SQL:

INSERT INTO connector_events_2026_05
  SELECT * FROM connector_events_default
   WHERE created_at >= '2026-05-01' AND created_at < '2026-06-01';
DELETE FROM connector_events_default
 WHERE created_at >= '2026-05-01' AND created_at < '2026-06-01';

Drop _legacy tables

После 30d (prod) / 3d (local):

SELECT COUNT(*) FROM connector_events_legacy; -- must be 0 несколько ticks
SELECT COUNT(*) FROM offer_observations_legacy;

Тогда — отдельная migration:

DROP TABLE connector_events_legacy CASCADE;
DROP TABLE offer_observations_legacy CASCADE;

Emergency rollback retention BC

  1. Comment out retention.Module() + di.RetentionStatus() в backend/cmd/api-server/main.go.
  2. Restore observability/connector_events_cleanup.go через git checkout <pre-cycle-sha>~1 -- backend/internal/platform/observability/connector_events_cleanup.go.
  3. Restore registerConnectorEventsCleanup invocation в backend/internal/platform/di/job_history.go.
  4. Re-enable canonical-events-retention-worker compose service.
  5. make local-prod-up.