Building The Longhand Archive

How a solo archival system gets built in public

View My GitHub Profile

CSJ Collector hardening — trimmed plan

Status

Completed

The original Do now hardening tranche is complete and verified.

Completed work:

Verification completed:

Remaining

Only optional cleanup remains from this plan:

Operational follow-up now matters more than further hardening:

Environment notes

Must do now

These are the changes with the best payoff-to-complexity ratio.

1. Run identity and durable run artifacts

Implement:

Why

This is the backbone for:

Keep it simple

You do not need an elaborate run framework first. A lightweight summary object or even a disciplined dict is enough initially.


2. Structured telemetry

Add:

Why

This gives you real observability without changing the collector’s core logic much.

Keep it simple

Start with:

Do not try to build a logging platform.


3. Explicit run status

Every run should end as one of:

Why

This is more important than lots of sub-classification.

Suggested rule


4. Surface swallowed failures

Clean up the highest-value silent failure points, especially in:

Focus on places where the code currently does some form of:

Why

This is one of the biggest trust gaps in the current architecture.

Important

Do not try to eliminate every broad exception immediately. Just make swallowed failures:


5. Parser fixture tests

Add only a small initial fixture set:

Listing fixtures

Detail fixtures

Why

This protects against the most likely real-world breakage: upstream CSJ drift.

Important

Do not overbuild the fixture corpus initially.


6. Golden normalized-output tests

Add 1–2 tests that cover:

and compare against stable expected JSON-ish outputs.

Why

This catches silent output drift better than lots of helper-level tests.


7. Atomic writes for critical JSON files

Add a tiny helper, e.g.:

Use it for:

Why

This is high-value and easy to justify.

Keep it simple

Just:

No need for advanced durability engineering right now.


Later if still needed

These are good ideas, but only after the core hardening lands.

8. Typed runtime object

Replacing build_run_context() dict with a typed CollectorRuntime object is nice.

Why later

It improves maintainability, but it is not the main operational risk today.


9. Typed fetch/run result objects

Adding:

is good cleanup.

Why later

Useful, but not as urgent as:


10. Broader module-level telemetry

Instrumenting:

with richer structured events may be worth it later.

Why later

You’ll probably get 80% of the value from instrumenting:

first.


11. Richer run architecture docs

Good to do after the implementation settles:

Why later

Docs are more useful once the actual shape of the hardening is real.


Drop unless pain appears

These are not bad ideas, but I would not schedule them now.

12. Dedicated csj/failures.py

A separate failure taxonomy module is only worth it if:

For now, structured error records inside telemetry/run summary are enough.


13. Deep formal failure hierarchy

You do not need:

Right now that would be architecture overhead.


14. Large fixture corpus

Do not try to build a giant test archive immediately.

Start with:

Expand only when real breakages justify it.


15. Broad recovery-playbook engineering

Documenting every theoretical failure mode can wait.

Only harden/document recovery paths that are:


Recommended revised sequence

Phase 1 — observability MVP

  1. add run artifact paths
  2. add run_id
  3. write per-run summaries
  4. add csj/telemetry.py
  5. emit run_started / run_completed
  6. add timings
  7. add explicit run status

Phase 2 — trust and drift protection

  1. surface major swallowed failures
  2. add 3–4 parser fixtures
  3. add listing/detail parser tests
  4. add 1–2 golden normalized-output tests

Phase 3 — persistence hardening

  1. add atomic_write_json()
  2. use it for state/summaries/job records
  3. surface malformed local JSON/state as warnings

Phase 4 — optional cleanup

  1. decide whether typed runtime object is still worth doing
  2. decide whether typed result objects are still worth doing
  3. update architecture docs

Clear stopping condition

I’d define the hardening pass as complete when all of these are true:

Once you reach that, stop and reassess before doing more architecture cleanup.


Short version

Do now

Later if needed

Drop for now