How a solo archival system gets built in public
Last reviewed: 2026-04-20
This is the concise current-state note for maintainers. It should stay shorter and more current than the imported handoff/history docs.
The CSJ Collector is in a credible modular and archival state. It should now be treated as a maintained archival data product, not a one-off scraper.
Current picture:
csj/scripts/collector.py--force without --full, bare --dry-run, non-positive worker/hour counts)captured_by_run_idcsj_reconstructions/ without mutating immutable evidence--backfill-archive-metadata mode now exists to normalize existing current records and asset manifests safelyImportant caveat:
status is only a broad bucket; use lifecycle_status for true live-vs-missing-vs-withdrawn interpretationcsj_jobs/ projectionexpired_sid_redirect is medium-confidence evidence that a stored detail URL no longer yields a vacancy page, not hard withdrawal confirmationdocs/lifecycle-semantics-pressure-test-note.md for the focused maintainer write-upKey runtime/code surfaces:
scripts/collector.py — stable public entrypointscripts/collector_impl.py — compatibility/integration layercsj/cli.py — CLI and runtime contextcsj/run.py — orchestration, statuses, and summary writingcsj/native.py — source-specific acquisition/parsingcsj/lifecycle.py — lifecycle and repair logiccsj/assets.py — asset capture/manifests/history/eventscsj/history.py — version snapshots and event loggingcsj/hashing.py — diff/hash semanticscsj/telemetry.py — JSONL run telemetryThis is already a substantial archive rather than a small scrape output. The most important operational takeaway is structural:
csj_jobs/ and csj_asset_manifests/ are the mutable current projectioncsj_history/, csj_asset_history/, and csj_run_manifests/ are the stronger historical evidence layerscsj_runs/, csj_latest.json, and csj_run_events.jsonl are operational/derived run artifactsCurrent runtime behavior:
csj_latest.json and csj_runs/{run_id}.jsoncsj_run_events.jsonl remains the append-only run telemetry streamOpen semantics question:
csj_latest.json reflect every run mode, or only native collection runs?Most useful current review items are:
The most useful immediate follow-up would be:
csj_latest.json should remain all-modes or become native-collection-onlyThat keeps the operational story clear while preserving the archive/derived-layer boundary.