How a solo archival system gets built in public
For Hermes: Use subagent-driven-development skill to implement this plan task-by-task.
Goal: Evolve the Civil Service Jobs collector from a current-state collector into a historical archive and change-tracking pipeline.
Architecture: Keep the existing per-reference current snapshot file as the fast canonical view, then add two append-oriented layers: a per-reference version history store for meaningful changes and a global event log for lifecycle/change analytics. Extend lifecycle handling for missing-before-expiry jobs and add structured supporting-asset extraction.
Tech Stack: Python 3, requests, stdlib json/hashlib/pathlib/datetime/re, existing collector at ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py
Primary implementation file:
/root/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.pyPlan storage:
/root/.hermes/workspace/csj/plans/2026-04-13-csj-v2.3-hardening-plan.mdRecommended new test files:
/root/.hermes/workspace/csj/tests/test_change_detection.py/root/.hermes/workspace/csj/tests/test_lifecycle_classification.py/root/.hermes/workspace/csj/tests/test_supporting_assets.py/root/.hermes/workspace/csj/tests/test_version_history.pyRecommended fixture directory:
/root/.hermes/workspace/csj/tests/fixtures/Recommended new output directories created by collector:
/root/.hermes/workspace/csj/csj_history//root/.hermes/workspace/csj/csj_events.jsonlObjective: Establish versioned provenance and dedicated paths for history/events.
Files:
/root/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.pyStep 1: Add constants near the existing output-path constants for:
PARSER_VERSION = "2.3"HISTORY_DIR = OUTPUT_DIR / "csj_history"EVENTS_FILE = OUTPUT_DIR / "csj_events.jsonl"Step 2: Ensure runtime startup creates HISTORY_DIR.
Step 3: Add parser_version to all newly written current snapshot records.
Verification:
python3 /root/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py --details -w 1 -n 1 --full --forceparser_version: "2.3"Commit suggestion:
git commit -m "feat: add v2.3 output path and parser version constants"Objective: Create one normalization path used by hashing, diffing, and comparisons.
Files:
/root/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.pyStep 1: Add helper functions:
normalize_text_for_diff(value)normalize_list_for_diff(values)build_comparable_record(job_dict)Step 2: Normalize:
Step 3: Exclude purely operational fields from comparison:
scraped_atlast_seenfirst_seenstatusparser_versionschema_versionVerification:
Commit suggestion:
git commit -m "feat: add comparable-record normalization helpers"Objective: Enable cheap per-field diffing and a stable overall content hash.
Files:
/root/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.pyStep 1: Add helper:
stable_json_dumps(obj)hash_value(obj)compute_field_hashes(job_dict)compute_content_hash(job_dict)Step 2: Use normalized/comparable records as hash input.
Step 3: Include on current snapshot:
field_hashescontent_hashlast_changed_atVerification:
Commit suggestion:
git commit -m "feat: add content and field hashing for historical diffing"Objective: Distinguish meaningful content changes from operational or cosmetic updates.
Files:
/root/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.pyStep 1: Define field groups:
CRITICAL_FIELDSTEXT_FIELDSOPERATIONAL_FIELDSStep 2: Add helper:
diff_job_records(old_record, new_record) -> {changed_fields, critical_changed, text_changed, cosmetic_only}Step 3: Ensure whitespace-only or formatting-only text changes do not count as meaningful changes.
Verification:
Commit suggestion:
git commit -m "feat: classify meaningful CSJ record changes"Objective: Preserve previous meaningful versions rather than overwriting everything in place.
Files:
/root/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.pyStep 1: Add helper:
write_history_version(reference, snapshot_dict, changed_fields, change_type)Step 2: On initial creation:
change_type = "first_seen"Step 3: On meaningful change:
changed_fieldschange_type such as critical_fields_changed or text_fields_changedStep 4: On no meaningful change:
Verification:
csj_history/{reference}/Commit suggestion:
git commit -m "feat: add per-reference history snapshot storage"Objective: Create a lightweight chronological ledger for analytics and auditing.
Files:
/root/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.pyStep 1: Add helper:
append_event(reference, event_type, summary, changed_fields=None, old_values=None, new_values=None, confidence="high", source="collector")Step 2: Emit events for:
Step 3: Write events as JSONL to csj_events.jsonl.
Verification:
field_changed.Commit suggestion:
git commit -m "feat: add append-only CSJ event log"Objective: Track ambiguity around jobs that disappear before expiry.
Files:
/root/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.pyStep 1: Add fields to current records:
lifecycle_statusfirst_missing_atconsecutive_missing_runsStep 2: Keep status as the broad state for compatibility, but use lifecycle_status for nuance.
Step 3: Mapping:
status=active, lifecycle_status=activestatus=inactive, lifecycle_status=closedstatus=active, lifecycle_status=missing_unconfirmedVerification:
Commit suggestion:
git commit -m "feat: add nuanced lifecycle fields for missing roles"Objective: Avoid both false closures and indefinite active states.
Files:
/root/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.pyStep 1: In the lifecycle pass, if a previously known job is absent from results and close date is still future:
first_missing_at empty, set itconsecutive_missing_runslifecycle_status = "missing_unconfirmed"missing_from_results eventStep 2: Add threshold constant, e.g.:
MISSING_CONFIRMATION_RUNS = 3Step 3: After threshold is met, queue direct URL verification.
Verification:
Commit suggestion:
git commit -m "feat: track repeated missing-before-expiry jobs"Objective: Confirm likely withdrawals where possible.
Files:
/root/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.pyStep 1: Add helper:
verify_missing_job_url(job_record)Step 2: Classify outcomes:
withdrawn_confirmedmissing_unconfirmed or active_hiddenStep 3: Emit event with confidence level.
Verification:
Commit suggestion:
git commit -m "feat: verify suspected withdrawn jobs via direct URL"Objective: Preserve a job’s continuity when it returns after a missing state.
Files:
/root/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.pyStep 1: If a job previously had lifecycle_status of missing/withdrawn_candidate and appears again:
lifecycle_status = "reopened" briefly or set active + emit reopened eventfirst_missing_atconsecutive_missing_runsStep 2: Decide whether reopened should persist as a status or only as an event.
Recommendation: keep active as current lifecycle_status after emitting a reopened event.
Verification:
Commit suggestion:
git commit -m "feat: track reappearing jobs as reopened events"Objective: Capture supporting collateral metadata without downloading binaries.
Files:
/root/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.pyStep 1: Add extraction helpers for:
hrefStep 2: Store on each job:
supporting_linksattachmentsembedsStep 3: Categorize links by domain/extension heuristics:
Verification:
Commit suggestion:
git commit -m "feat: extract structured supporting links and assets"Objective: Capture videos or embeds not visible as ordinary text links.
Files:
/root/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.pyStep 1: Parse common embed carriers:
iframe srcvideo sourceStep 2: Add provider inference:
Verification:
embeds record captured.Commit suggestion:
git commit -m "feat: capture embedded collateral metadata"Objective: Make collateral changes analytically visible.
Files:
/root/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.pyStep 1: Compare old vs new asset sets.
Step 2: If changed:
supporting_asset_added / supporting_asset_removed / generic field_changedVerification:
Commit suggestion:
git commit -m "feat: log changes in supporting collateral"Objective: Make parser changes safe and repeatable without live-site dependency.
Files:
/root/.hermes/workspace/csj/tests/test_change_detection.py/root/.hermes/workspace/csj/tests/fixtures/Step 1: Save representative HTML or parsed-record fixtures for:
Step 2: Assert meaningful-change detection behaves correctly.
Verification:
python3 -m pytest /root/.hermes/workspace/csj/tests/test_change_detection.py -qCommit suggestion:
git commit -m "test: add fixture-driven change detection tests"Objective: Codify early-withdrawal and missing-before-expiry logic.
Files:
/root/.hermes/workspace/csj/tests/test_lifecycle_classification.pyStep 1: Add tests for:
Verification:
python3 -m pytest /root/.hermes/workspace/csj/tests/test_lifecycle_classification.py -qCommit suggestion:
git commit -m "test: add lifecycle classification coverage"Objective: Ensure rich collateral is captured for senior campaign pages.
Files:
/root/.hermes/workspace/csj/tests/test_supporting_assets.pyStep 1: Add fixtures with:
Step 2: Assert assets are captured and categorized.
Verification:
python3 -m pytest /root/.hermes/workspace/csj/tests/test_supporting_assets.py -qCommit suggestion:
git commit -m "test: add supporting collateral extraction coverage"Objective: Ensure history is sparse and meaningful rather than noisy.
Files:
/root/.hermes/workspace/csj/tests/test_version_history.pyStep 1: Add tests for:
Verification:
python3 -m pytest /root/.hermes/workspace/csj/tests/test_version_history.py -qCommit suggestion:
git commit -m "test: add historical versioning behavior tests"Objective: Validate end-to-end behavior against the real site without waiting for organic edge cases.
Files:
Checklist:
last_seen to old timestamp and verify refresh re-fetches itcsj_events.jsonl appends valid JSON linescsj_history/{reference}/ only grows when meaningful changes occurSuggested commands:
python3 /root/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py --details -w 5python3 /root/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py --details -w 5 --full --refreshSuccess criteria:
Add to current snapshot records:
parser_version: strcontent_hash: strfield_hashes: dict[str, str]last_changed_at: str | nulllifecycle_status: strfirst_missing_at: str | nullconsecutive_missing_runs: intsupporting_links: listattachments: listembeds: listHistory record shape:
history_recorded_atchange_typechanged_fieldsEvent record shape:
timestampreferenceevent_typesummarychanged_fieldsold_valuesnew_valuesconfidencesourceAlways create history/event on changes to:
Create history/event on substantive changes to:
Do not create history/event for:
If work stops mid-implementation, resume in this order:
/root/.hermes/workspace/csj/plans/2026-04-13-csj-v2.3-hardening-plan.md/root/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.pycontent_hash, field_hashes, and lifecycle_status already existAfter this plan is implemented, the collector should support: