Building The Longhand Archive

How a solo archival system gets built in public

View My GitHub Profile

CSJ Collector Reliability & Lifecycle Updates Plan

Execution note: This plan is being updated live during implementation if design or scope changes are discovered.

Goal: Make the collector classify missing/refresh failures more accurately, reduce false operational-failure signals, and close lifecycle gaps exposed by recent runs.

Architecture: Keep the existing single-file collector architecture, but tighten the decision points around refresh, missing-job verification, and end-of-run reporting. Add a small pytest test layer around lifecycle/verification helpers before changing behaviour.

Tech Stack: Python 3, collector at ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py, pytest for new tests.


Learnings driving this work

  1. Refresh failures for jobs absent from search are often expected because stored detail URLs contain expiring SID tokens.
  2. Jobs genuinely disappear and reappear, so the nuanced missing/reopened lifecycle needs to stay.
  3. Missing jobs with no closing date can get stuck in missing_unconfirmed limbo.
  4. Past-due active jobs can remain wrongly active due to lifecycle blind spots.
  5. Run reporting still lacks enough operational nuance for unattended cron summaries.

Planned tasks

Task 1: Add a minimal pytest harness

Task 2: Introduce explicit detail-fetch failure classification

Task 3: Detect expired-SID homepage redirects explicitly

Task 4: Tighten promotion from missing_unconfirmed to withdrawn_confirmed

Task 5: Add retroactive cleanup for stale active jobs whose closing date has passed

Task 6: Refactor end-of-run reporting into explicit counters

Task 7: Improve csj_latest.json semantics

Task 8: Update docs/skill guidance

Task 9: Consider a one-off lifecycle repair mode


Initial execution order

Phase 1 — High value, low risk

  1. Add tests
  2. Add failure classification
  3. Add expired-SID detection
  4. Add richer reporting

Phase 2 — Lifecycle correctness

  1. Improve missing → withdrawn promotion
  2. Fix stale-active close-date cleanup

Phase 3 — Usability

  1. Improve csj_latest.json
  2. Update docs
  3. Decide whether repair mode is warranted

Live implementation notes