Building The Longhand Archive

How a solo archival system gets built in public

View My GitHub Profile

CSJ Collector v2.3 Hardening Implementation Plan

For Hermes: Use subagent-driven-development skill to implement this plan task-by-task.

Goal: Evolve the Civil Service Jobs collector from a current-state collector into a historical archive and change-tracking pipeline.

Architecture: Keep the existing per-reference current snapshot file as the fast canonical view, then add two append-oriented layers: a per-reference version history store for meaningful changes and a global event log for lifecycle/change analytics. Extend lifecycle handling for missing-before-expiry jobs and add structured supporting-asset extraction.

Tech Stack: Python 3, requests, stdlib json/hashlib/pathlib/datetime/re, existing collector at ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py


File targets

Primary implementation file:

Plan storage:

Recommended new test files:

Recommended fixture directory:

Recommended new output directories created by collector:


Phase 0: Baseline and safety rails

Task 0.1: Add parser version constant and new directory constants

Objective: Establish versioned provenance and dedicated paths for history/events.

Files:

Step 1: Add constants near the existing output-path constants for:

Step 2: Ensure runtime startup creates HISTORY_DIR.

Step 3: Add parser_version to all newly written current snapshot records.

Verification:

Commit suggestion:

Task 0.2: Add a dry internal helper for stable record normalization

Objective: Create one normalization path used by hashing, diffing, and comparisons.

Files:

Step 1: Add helper functions:

Step 2: Normalize:

Step 3: Exclude purely operational fields from comparison:

Verification:

Commit suggestion:


Phase 1: Historical versioning and change detection

Task 1.1: Add field hashing helpers

Objective: Enable cheap per-field diffing and a stable overall content hash.

Files:

Step 1: Add helper:

Step 2: Use normalized/comparable records as hash input.

Step 3: Include on current snapshot:

Verification:

Commit suggestion:

Task 1.2: Add meaningful-change classification

Objective: Distinguish meaningful content changes from operational or cosmetic updates.

Files:

Step 1: Define field groups:

Step 2: Add helper:

Step 3: Ensure whitespace-only or formatting-only text changes do not count as meaningful changes.

Verification:

Commit suggestion:

Task 1.3: Add per-reference history snapshots

Objective: Preserve previous meaningful versions rather than overwriting everything in place.

Files:

Step 1: Add helper:

Step 2: On initial creation:

Step 3: On meaningful change:

Step 4: On no meaningful change:

Verification:

Commit suggestion:

Task 1.4: Add append-only event logging

Objective: Create a lightweight chronological ledger for analytics and auditing.

Files:

Step 1: Add helper:

Step 2: Emit events for:

Step 3: Write events as JSONL to csj_events.jsonl.

Verification:

Commit suggestion:


Phase 2: Lifecycle classification for historical integrity

Task 2.1: Expand lifecycle fields on current records

Objective: Track ambiguity around jobs that disappear before expiry.

Files:

Step 1: Add fields to current records:

Step 2: Keep status as the broad state for compatibility, but use lifecycle_status for nuance.

Step 3: Mapping:

Verification:

Commit suggestion:

Task 2.2: Add missing-before-expiry progression rules

Objective: Avoid both false closures and indefinite active states.

Files:

Step 1: In the lifecycle pass, if a previously known job is absent from results and close date is still future:

Step 2: Add threshold constant, e.g.:

Step 3: After threshold is met, queue direct URL verification.

Verification:

Commit suggestion:

Task 2.3: Add direct-URL verification for suspected withdrawals

Objective: Confirm likely withdrawals where possible.

Files:

Step 1: Add helper:

Step 2: Classify outcomes:

Step 3: Emit event with confidence level.

Verification:

Commit suggestion:

Task 2.4: Handle reappearance cleanly

Objective: Preserve a job’s continuity when it returns after a missing state.

Files:

Step 1: If a job previously had lifecycle_status of missing/withdrawn_candidate and appears again:

Step 2: Decide whether reopened should persist as a status or only as an event. Recommendation: keep active as current lifecycle_status after emitting a reopened event.

Verification:

Commit suggestion:


Phase 3: Supporting asset extraction

Objective: Capture supporting collateral metadata without downloading binaries.

Files:

Step 1: Add extraction helpers for:

Step 2: Store on each job:

Step 3: Categorize links by domain/extension heuristics:

Verification:

Commit suggestion:

Task 3.2: Add embed detection

Objective: Capture videos or embeds not visible as ordinary text links.

Files:

Step 1: Parse common embed carriers:

Step 2: Add provider inference:

Verification:

Commit suggestion:

Task 3.3: Emit asset-change events

Objective: Make collateral changes analytically visible.

Files:

Step 1: Compare old vs new asset sets.

Step 2: If changed:

Verification:

Commit suggestion:


Phase 4: Test harness and smoke-test scenarios

Task 4.1: Add fixture-driven parsing tests

Objective: Make parser changes safe and repeatable without live-site dependency.

Files:

Step 1: Save representative HTML or parsed-record fixtures for:

Step 2: Assert meaningful-change detection behaves correctly.

Verification:

Commit suggestion:

Task 4.2: Add lifecycle classification tests

Objective: Codify early-withdrawal and missing-before-expiry logic.

Files:

Step 1: Add tests for:

Verification:

Commit suggestion:

Task 4.3: Add supporting asset tests

Objective: Ensure rich collateral is captured for senior campaign pages.

Files:

Step 1: Add fixtures with:

Step 2: Assert assets are captured and categorized.

Verification:

Commit suggestion:

Task 4.4: Add version-history behavior tests

Objective: Ensure history is sparse and meaningful rather than noisy.

Files:

Step 1: Add tests for:

Verification:

Commit suggestion:


Phase 5: Live smoke test checklist

Task 5.1: Run controlled live smoke tests after implementation

Objective: Validate end-to-end behavior against the real site without waiting for organic edge cases.

Files:

Checklist:

  1. Run incremental scrape
  2. Run full + refresh scrape
  3. Manually set one active job’s last_seen to old timestamp and verify refresh re-fetches it
  4. Start one run and confirm lock blocks a second concurrent run
  5. Inspect one senior role manually for supporting assets
  6. Verify one meaningful field edit in fixture path generates history + event
  7. Verify csj_events.jsonl appends valid JSON lines
  8. Verify csj_history/{reference}/ only grows when meaningful changes occur

Suggested commands:

Success criteria:


Proposed schema additions

Add to current snapshot records:

History record shape:

Event record shape:


Meaningful change policy

Always create history/event on changes to:

Create history/event on substantive changes to:

Do not create history/event for:


Resume instructions if interrupted

If work stops mid-implementation, resume in this order:

  1. Read this file: /root/.hermes/workspace/csj/plans/2026-04-13-csj-v2.3-hardening-plan.md
  2. Inspect current collector: /root/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py
  3. Check whether content_hash, field_hashes, and lifecycle_status already exist
  4. Continue from the first incomplete phase above
  5. Re-run fixture tests before any live smoke test

Final outcome expected from v2.3

After this plan is implemented, the collector should support: