Retention BC runbook
Tier matrix
| Resource | Local TTL | Prod TTL |
|---|---|---|
| connector_events body | 2h | 12h |
| connector_events row | 72h | 720h |
| ingestion_debug_artifacts | 1h | 3h |
| ingestion_job_runs | 168h | 2160h |
| offer_observations | 720h | 8760h |
| offer_characteristic_raw | 720h | 4320h |
| canonical_assignment_events | 336h | 2160h |
| outbox_events | 24h | 168h |
| S3 raw expire | 1d | 7d |
| S3 cleaner older-than | 2h | 12h |
Operations
Force-run cycle
docker restart tracium-api-1 — ticker re-runs at startup. Альтернатива:
direct PSQL trigger через SELECT pg_advisory_lock(hashtext('retention_cleanup'))
- manual cleanup — не рекомендуется (race с running ticker).
Disable retention
RETENTION_TICK_INTERVAL=0 → Ticker.Run logs retention.ticker.disabled
и blocks until ctx done. После restart с этим env retention pauses.
Check live state
curl https://admin.tracium.dev:4444/api/v1/admin/retention/status
Требует admin JWT cookie. Response — JSON с per-resource policy + last tick result + table size + s3 status.
Slow cleanup / I/O pressure
Симптомы: retention DELETE висит несколько минут, Postgres показывает
IO/DataFileRead, ingestion/facts/canonical workers выглядят “замершими”.
Проверить активные cleanup-запросы:
SELECT pid, now() - query_start AS age, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE query ILIKE ' DELETE FROM %'
ORDER BY query_start;Для catalog_projector_queue и outbox_events cleanup обязан идти по
partial timestamp index, иначе на больших таблицах будет full scan:
EXPLAIN SELECT ctid
FROM catalog_projector_queue
WHERE processed_at IS NOT NULL
AND processed_at < now() - interval '7 days'
LIMIT 10000;
EXPLAIN SELECT ctid
FROM outbox_events
WHERE sent_at IS NOT NULL
AND sent_at < now() - interval '7 days'
LIMIT 10000;Ожидаемый план: Index Scan using catalog_projector_queue_processed_at_idx и Index Scan using outbox_events_sent_at_idx. Если индекса нет, его можно создать hot:
CREATE INDEX CONCURRENTLY IF NOT EXISTS catalog_projector_queue_processed_at_idx
ON catalog_projector_queue (processed_at)
WHERE processed_at IS NOT NULL;
CREATE INDEX CONCURRENTLY IF NOT EXISTS outbox_events_sent_at_idx
ON outbox_events (sent_at)
WHERE sent_at IS NOT NULL;После hot-фикса обязательно добавить ту же DDL в миграции, иначе следующий environment rebuild потеряет защиту.
Default partition non-zero
Если connector_events_default или offer_observations_default содержит
rows — prealloc lag. Manual SQL:
INSERT INTO connector_events_2026_05
SELECT * FROM connector_events_default
WHERE created_at >= '2026-05-01' AND created_at < '2026-06-01';
DELETE FROM connector_events_default
WHERE created_at >= '2026-05-01' AND created_at < '2026-06-01';Drop _legacy tables
После 30d (prod) / 3d (local):
SELECT COUNT(*) FROM connector_events_legacy; -- must be 0 несколько ticks
SELECT COUNT(*) FROM offer_observations_legacy;Тогда — отдельная migration:
DROP TABLE connector_events_legacy CASCADE;
DROP TABLE offer_observations_legacy CASCADE;Emergency rollback retention BC
- Comment out
retention.Module()+di.RetentionStatus()вbackend/cmd/api-server/main.go. - Restore
observability/connector_events_cleanup.goчерезgit checkout <pre-cycle-sha>~1 -- backend/internal/platform/observability/connector_events_cleanup.go. - Restore
registerConnectorEventsCleanupinvocation вbackend/internal/platform/di/job_history.go. - Re-enable
canonical-events-retention-workercompose service. make local-prod-up.