How a solo archival system gets built in public
The original Do now hardening tranche is complete and verified.
Completed work:
run_id for each collector run~/.hermes/workspace/csj/csj_runs/csj_latest.json as latest snapshotcsj/telemetry.py with JSONL event outputsuccess, degraded, failedcsj/run.py, csj/records.py, and csj/state.pyatomic_write_json() and applied it to critical JSON outputsVerification completed:
69 passedpython3 scripts/collector.py --help passedpython3 scripts/collector.py --repair-lifecycle --dry-run passedOnly optional cleanup remains from this plan:
Operational follow-up now matters more than further hardening:
uvx pytest ..., not plain pytest, because pytest was not installed directly.These are the changes with the best payoff-to-complexity ratio.
Implement:
run_id for every run~/.hermes/workspace/csj/csj_runs/{run_id}.jsoncsj_latest.json as the latest snapshotThis is the backbone for:
You do not need an elaborate run framework first. A lightweight summary object or even a disciplined dict is enough initially.
Add:
csj/telemetry.pyrun_startedrun_completedrun_warningrun_faileddetail_fetch_failedThis gives you real observability without changing the collector’s core logic much.
Start with:
print() outputDo not try to build a logging platform.
Every run should end as one of:
successdegradedfailedThis is more important than lots of sub-classification.
failed: collector could not meaningfully rundegraded: collector completed but with meaningful failure volume or subsystem losssuccess: normal completionClean up the highest-value silent failure points, especially in:
csj/run.pycsj/records.pycsj/state.pyFocus on places where the code currently does some form of:
except Exception: passThis is one of the biggest trust gaps in the current architecture.
Do not try to eliminate every broad exception immediately. Just make swallowed failures:
Add only a small initial fixture set:
This protects against the most likely real-world breakage: upstream CSJ drift.
Do not overbuild the fixture corpus initially.
Add 1–2 tests that cover:
normalize_job()finalize_job()and compare against stable expected JSON-ish outputs.
This catches silent output drift better than lots of helper-level tests.
Add a tiny helper, e.g.:
atomic_write_json(path, data)Use it for:
This is high-value and easy to justify.
Just:
os.replaceNo need for advanced durability engineering right now.
These are good ideas, but only after the core hardening lands.
Replacing build_run_context() dict with a typed CollectorRuntime object is nice.
It improves maintainability, but it is not the main operational risk today.
Adding:
FetchResultRunSummary dataclassesis good cleanup.
Useful, but not as urgent as:
Instrumenting:
csj/assets.pycsj/history.pycsj/state.pycsj/records.pywith richer structured events may be worth it later.
You’ll probably get 80% of the value from instrumenting:
csj/run.pycsj/native.pyfirst.
Good to do after the implementation settles:
Docs are more useful once the actual shape of the hardening is real.
These are not bad ideas, but I would not schedule them now.
csj/failures.pyA separate failure taxonomy module is only worth it if:
For now, structured error records inside telemetry/run summary are enough.
You do not need:
Right now that would be architecture overhead.
Do not try to build a giant test archive immediately.
Start with:
Expand only when real breakages justify it.
Documenting every theoretical failure mode can wait.
Only harden/document recovery paths that are:
run_idcsj/telemetry.pyrun_started / run_completedatomic_write_json()I’d define the hardening pass as complete when all of these are true:
run_idsuccess / degraded / failedOnce you reach that, stop and reassess before doing more architecture cleanup.