CSJ Collector v2.3 Hardening Implementation Plan

For Hermes: Use subagent-driven-development skill to implement this plan task-by-task.

Goal: Evolve the Civil Service Jobs collector from a current-state collector into a historical archive and change-tracking pipeline.

Architecture: Keep the existing per-reference current snapshot file as the fast canonical view, then add two append-oriented layers: a per-reference version history store for meaningful changes and a global event log for lifecycle/change analytics. Extend lifecycle handling for missing-before-expiry jobs and add structured supporting-asset extraction.

Tech Stack: Python 3, requests, stdlib json/hashlib/pathlib/datetime/re, existing collector at ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py

File targets

Primary implementation file:

Modify: /root/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py

Plan storage:

This file: /root/.hermes/workspace/csj/plans/2026-04-13-csj-v2.3-hardening-plan.md

Phase 0: Baseline and safety rails

Task 0.1: Add parser version constant and new directory constants

Objective: Establish versioned provenance and dedicated paths for history/events.

Files:

Modify: /root/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py

Step 1: Add constants near the existing output-path constants for:

PARSER_VERSION = "2.3"
HISTORY_DIR = OUTPUT_DIR / "csj_history"
EVENTS_FILE = OUTPUT_DIR / "csj_events.jsonl"

Step 2: Ensure runtime startup creates HISTORY_DIR.

Step 3: Add parser_version to all newly written current snapshot records.

Verification:

Run: python3 /root/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py --details -w 1 -n 1 --full --force
Expected: current output file includes parser_version: "2.3"

Commit suggestion:

git commit -m "feat: add v2.3 output path and parser version constants"

Task 0.2: Add a dry internal helper for stable record normalization

Objective: Create one normalization path used by hashing, diffing, and comparisons.

Files:

Modify: /root/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py

Step 1: Add helper functions:

normalize_text_for_diff(value)
normalize_list_for_diff(values)
build_comparable_record(job_dict)

Step 2: Normalize:

whitespace collapsing for long text fields
line ending normalization
stable ordering for semantically unordered arrays where appropriate
null/empty handling consistency

Step 3: Exclude purely operational fields from comparison:

scraped_at
last_seen
first_seen
status
parser_version
schema_version

Verification:

Add a temporary local test snippet in Python shell to confirm equivalent whitespace-only variants produce identical comparable structures.

Commit suggestion:

git commit -m "feat: add comparable-record normalization helpers"

Phase 1: Historical versioning and change detection

Task 1.1: Add field hashing helpers

Objective: Enable cheap per-field diffing and a stable overall content hash.

Files:

Modify: /root/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py

Step 1: Add helper:

stable_json_dumps(obj)
hash_value(obj)
compute_field_hashes(job_dict)
compute_content_hash(job_dict)

Step 2: Use normalized/comparable records as hash input.

Step 3: Include on current snapshot:

field_hashes
content_hash
last_changed_at

Verification:

Run collector twice with no meaningful change.
Expected: content hash stays stable.

Commit suggestion:

git commit -m "feat: add content and field hashing for historical diffing"

Task 1.2: Add meaningful-change classification

Objective: Distinguish meaningful content changes from operational or cosmetic updates.

Files:

Modify: /root/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py

Step 1: Define field groups:

CRITICAL_FIELDS
TEXT_FIELDS
OPERATIONAL_FIELDS

Step 2: Add helper:

diff_job_records(old_record, new_record) -> {changed_fields, critical_changed, text_changed, cosmetic_only}

Step 3: Ensure whitespace-only or formatting-only text changes do not count as meaningful changes.

Verification:

Compare old/new records where only spacing differs.
Expected: no meaningful change flagged.

Commit suggestion:

git commit -m "feat: classify meaningful CSJ record changes"

Task 1.3: Add per-reference history snapshots

Objective: Preserve previous meaningful versions rather than overwriting everything in place.

Files:

Modify: /root/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py

Step 1: Add helper:

write_history_version(reference, snapshot_dict, changed_fields, change_type)

Step 2: On initial creation:

write current snapshot as normal
optionally write an initial history version with change_type = "first_seen"

Step 3: On meaningful change:

write a history version before or alongside updating current snapshot
include changed_fields
include change_type such as critical_fields_changed or text_fields_changed

Step 4: On no meaningful change:

update operational timestamps only
do not create redundant history entries

Verification:

Force one real field change in a test fixture.
Expected: exactly one new file appears in csj_history/{reference}/

Commit suggestion:

git commit -m "feat: add per-reference history snapshot storage"

Task 1.4: Add append-only event logging

Objective: Create a lightweight chronological ledger for analytics and auditing.

Files:

Modify: /root/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py

Step 1: Add helper:

append_event(reference, event_type, summary, changed_fields=None, old_values=None, new_values=None, confidence="high", source="collector")

Step 2: Emit events for:

first_seen
field_changed
refreshed
closed
missing_from_results
withdrawn_confirmed
reopened
supporting_asset_added

Step 3: Write events as JSONL to csj_events.jsonl.

Verification:

Run one test update that changes salary.
Expected: one JSONL event with event type field_changed.

Commit suggestion:

git commit -m "feat: add append-only CSJ event log"

Phase 2: Lifecycle classification for historical integrity

Task 2.1: Expand lifecycle fields on current records

Objective: Track ambiguity around jobs that disappear before expiry.

Files:

Modify: /root/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py

Step 1: Add fields to current records:

lifecycle_status
first_missing_at
consecutive_missing_runs

Step 2: Keep status as the broad state for compatibility, but use lifecycle_status for nuance.

Step 3: Mapping:

active listing in results -> status=active, lifecycle_status=active
missing after expiry -> status=inactive, lifecycle_status=closed
missing before expiry first time -> status=active, lifecycle_status=missing_unconfirmed

Verification:

Run lifecycle pass against controlled fixture set.
Expected: missing-before-expiry jobs are no longer left silently active without nuance.

Commit suggestion:

git commit -m "feat: add nuanced lifecycle fields for missing roles"

Task 2.2: Add missing-before-expiry progression rules

Objective: Avoid both false closures and indefinite active states.

Files:

Modify: /root/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py

Step 1: In the lifecycle pass, if a previously known job is absent from results and close date is still future:

if first_missing_at empty, set it
increment consecutive_missing_runs
set lifecycle_status = "missing_unconfirmed"
append missing_from_results event

Step 2: Add threshold constant, e.g.:

MISSING_CONFIRMATION_RUNS = 3

Step 3: After threshold is met, queue direct URL verification.

Verification:

Simulate 3 consecutive missing runs on a future-closing job.
Expected: job enters verification path rather than remaining plain active.

Commit suggestion:

git commit -m "feat: track repeated missing-before-expiry jobs"

Task 2.3: Add direct-URL verification for suspected withdrawals

Objective: Confirm likely withdrawals where possible.

Files:

Modify: /root/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py

Step 1: Add helper:

verify_missing_job_url(job_record)

Step 2: Classify outcomes:

404 / unavailable / explicit withdrawal copy -> withdrawn_confirmed
reachable detail page with valid reference -> remain missing_unconfirmed or active_hidden
ambiguous/broken fetch -> keep low-confidence missing state

Step 3: Emit event with confidence level.

Verification:

Test with synthetic HTML fixtures for:
- valid page
- unavailable page
- withdrawn wording

Commit suggestion:

git commit -m "feat: verify suspected withdrawn jobs via direct URL"

Task 2.4: Handle reappearance cleanly

Objective: Preserve a job’s continuity when it returns after a missing state.

Files:

Modify: /root/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py

Step 1: If a job previously had lifecycle_status of missing/withdrawn_candidate and appears again:

set lifecycle_status = "reopened" briefly or set active + emit reopened event
reset first_missing_at
reset consecutive_missing_runs

Step 2: Decide whether reopened should persist as a status or only as an event. Recommendation: keep active as current lifecycle_status after emitting a reopened event.

Verification:

Simulate missing then reappearing listing.
Expected: event written, counters reset.

Commit suggestion:

git commit -m "feat: track reappearing jobs as reopened events"

Phase 3: Supporting asset extraction

Task 3.1: Add structured link extraction from detail pages

Objective: Capture supporting collateral metadata without downloading binaries.

Files:

Modify: /root/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py

Step 1: Add extraction helpers for:

anchor href
visible link text
domain
inferred type/category

Step 2: Store on each job:

supporting_links
attachments
embeds

Step 3: Categorize links by domain/extension heuristics:

youtube
vimeo
pdf_candidate_pack
campaign_site
webinar
attachment_other

Verification:

Use fixture HTML containing:
- YouTube link
- PDF link
- external microsite link
Expected: all three appear in structured asset fields.

Commit suggestion:

git commit -m "feat: extract structured supporting links and assets"

Task 3.2: Add embed detection

Objective: Capture videos or embeds not visible as ordinary text links.

Files:

Modify: /root/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py

Step 1: Parse common embed carriers:

iframe src
video source
known embed div data attributes if present

Step 2: Add provider inference:

YouTube
Vimeo
other

Verification:

Fixture with iframe embed.
Expected: one embeds record captured.

Commit suggestion:

git commit -m "feat: capture embedded collateral metadata"

Task 3.3: Emit asset-change events

Objective: Make collateral changes analytically visible.

Files:

Modify: /root/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py

Step 1: Compare old vs new asset sets.

Step 2: If changed:

append supporting_asset_added / supporting_asset_removed / generic field_changed
create history snapshot

Verification:

Add a link in fixture on second fetch.
Expected: event logged and history version created.

Commit suggestion:

git commit -m "feat: log changes in supporting collateral"

Phase 4: Test harness and smoke-test scenarios

Task 4.1: Add fixture-driven parsing tests

Objective: Make parser changes safe and repeatable without live-site dependency.

Files:

Create: /root/.hermes/workspace/csj/tests/test_change_detection.py
Create: /root/.hermes/workspace/csj/tests/fixtures/

Step 1: Save representative HTML or parsed-record fixtures for:

unchanged record
salary change
location change
num_roles change
text-only meaningful change
whitespace-only cosmetic change

Step 2: Assert meaningful-change detection behaves correctly.

Verification:

Run: python3 -m pytest /root/.hermes/workspace/csj/tests/test_change_detection.py -q
Expected: all tests pass.

Commit suggestion:

git commit -m "test: add fixture-driven change detection tests"

Task 4.2: Add lifecycle classification tests

Objective: Codify early-withdrawal and missing-before-expiry logic.

Files:

Create: /root/.hermes/workspace/csj/tests/test_lifecycle_classification.py

Step 1: Add tests for:

close after expiry
missing before expiry first run
missing before expiry repeated runs
direct URL withdrawn confirmation
reappearance after missing state

Verification:

Run: python3 -m pytest /root/.hermes/workspace/csj/tests/test_lifecycle_classification.py -q

Commit suggestion:

git commit -m "test: add lifecycle classification coverage"

Task 4.3: Add supporting asset tests

Objective: Ensure rich collateral is captured for senior campaign pages.

Files:

Create: /root/.hermes/workspace/csj/tests/test_supporting_assets.py

Step 1: Add fixtures with:

YouTube link
PDF candidate pack
iframe embed
external campaign site

Step 2: Assert assets are captured and categorized.

Verification:

Run: python3 -m pytest /root/.hermes/workspace/csj/tests/test_supporting_assets.py -q

Commit suggestion:

git commit -m "test: add supporting collateral extraction coverage"

Task 4.4: Add version-history behavior tests

Objective: Ensure history is sparse and meaningful rather than noisy.

Files:

Create: /root/.hermes/workspace/csj/tests/test_version_history.py

Step 1: Add tests for:

first seen creates initial state
unchanged refresh creates no duplicate history version
meaningful field change creates one history version
asset change creates one history version
lifecycle change creates one history version/event

Verification:

Run: python3 -m pytest /root/.hermes/workspace/csj/tests/test_version_history.py -q

Commit suggestion:

git commit -m "test: add historical versioning behavior tests"

Phase 5: Live smoke test checklist

Task 5.1: Run controlled live smoke tests after implementation

Objective: Validate end-to-end behavior against the real site without waiting for organic edge cases.

Files:

No code changes required

Checklist:

Run incremental scrape
Run full + refresh scrape
Manually set one active job’s last_seen to old timestamp and verify refresh re-fetches it
Start one run and confirm lock blocks a second concurrent run
Inspect one senior role manually for supporting assets
Verify one meaningful field edit in fixture path generates history + event
Verify csj_events.jsonl appends valid JSON lines
Verify csj_history/{reference}/ only grows when meaningful changes occur

Suggested commands:

python3 /root/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py --details -w 5
python3 /root/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py --details -w 5 --full --refresh

Success criteria:

No redundant versions on unchanged refreshes
Changed content generates one history snapshot and one event
Missing-before-expiry jobs are classified instead of silently left active
Supporting asset metadata exists where present

Proposed schema additions

Add to current snapshot records:

parser_version: str
content_hash: str
field_hashes: dict[str, str]
last_changed_at: str | null
lifecycle_status: str
first_missing_at: str | null
consecutive_missing_runs: int
supporting_links: list
attachments: list
embeds: list

History record shape:

all structured fields from current snapshot
history_recorded_at
change_type
changed_fields

Event record shape:

timestamp
reference
event_type
summary
changed_fields
old_values
new_values
confidence
source

Meaningful change policy

Always create history/event on changes to:

title
department
business_area
grade
grade_normalized
salary
salary_min
salary_max
contract_type
working_pattern
location
location_primary
closes
closes_iso
num_roles
security_clearance
supporting asset sets
lifecycle_status

Create history/event on substantive changes to:

job_summary
job_description
person_spec
benefits
contact

Do not create history/event for:

whitespace-only changes
line-wrap only changes
reordered equivalent lists
operational timestamp changes alone

Resume instructions if interrupted

If work stops mid-implementation, resume in this order:

Read this file: /root/.hermes/workspace/csj/plans/2026-04-13-csj-v2.3-hardening-plan.md
Inspect current collector: /root/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py
Check whether content_hash, field_hashes, and lifecycle_status already exist
Continue from the first incomplete phase above
Re-run fixture tests before any live smoke test

Final outcome expected from v2.3

After this plan is implemented, the collector should support:

current-state querying
meaningful historical versioning
lifecycle classification beyond simple closed/active
supporting collateral capture for senior campaign roles
future analytics on departmental hiring behavior over time