How a solo archival system gets built in public
The Civil Service Jobs (CSJ) collector maintains a historical archive of UK government job postings from civilservicejobs.service.gov.uk. Unlike the public website (which only shows current listings at a point in time), this system captures, version-tracks, and archives every job over its full lifecycle — from first appearance to closure.
The system runs two scheduled jobs:
All data lives under ~/.hermes/workspace/csj/ in a flat directory structure:
csj/
├── csj_jobs/ # Current state of every job (one JSON per reference)
│ ├── 456882.json
│ ├── 457103.json
│ └── ...
├── csj_history/ # Versioned snapshots per job
│ ├── 456882/
│ │ ├── 2026-04-10T14:30:00.json
│ │ ├── 2026-04-12T09:15:00.json
│ │ └── ...
│ └── ...
├── csj_events.jsonl # Append-only lifecycle event log
├── csj_state.json # Collector state (last reference, last run, count)
├── csj_latest.json # Summary of most recent scrape run
├── csj_attachments/ # Downloaded attachments + transcripts
│ ├── _pool/ # Content-addressed deduplicated storage
│ │ ├── a3f2b1c9...pdf # Files stored by full SHA-256 hash
│ │ ├── a3f2b1c9...pdf.md # Markdown conversion alongside original
│ │ └── ...
│ ├── 456882/ # Per-job directory (symlinks into _pool/)
│ │ ├── role_profile__a3f2b1.pdf -> ../../_pool/a3f2b1c9...pdf
│ │ ├── role_profile__a3f2b1.pdf.md -> ../../_pool/a3f2b1c9...pdf.md
│ │ └── ...
│ └── ...
├── csj_asset_manifests/ # Per-job asset metadata (current state)
│ ├── 456882.json
│ └── ...
├── csj_asset_history/ # Immutable per-asset version snapshots
│ ├── 456882/
│ │ ├── asset_v1.json
│ │ └── ...
│ └── ...
└── csj_asset_events.jsonl # Asset-level lifecycle events
Each job file in csj_jobs/{reference}.json contains ~42 fields:
| Field | Type | Description |
|——-|——|————-|
| reference | string | Unique CSJ reference number (globally sequential, never reused) |
| title | string | Job title |
| url | string | Direct link to job detail page |
| schema_version | string | Record format version (currently “2.2”) |
| parser_version | string | Parser version (currently “2.7”) |
| Field | Type | Description |
|——-|——|————-|
| department | string | Hiring department (138 unique values) |
| grade | string | Raw grade string from listing |
| grade_normalized | string|null | Canonical grade: AA/AO/EO/HEO/SEO/G7/G6/SCS1-4 |
| salary | string | Raw salary text (e.g. “£48,350 - £57,500”) |
| salary_min | int|null | Lower bound parsed from salary string |
| salary_max | int|null | Upper bound parsed from salary string |
| contract_type | string | Permanent/Fixed term/etc |
| business_area | string|null | Business area within department |
| working_pattern | array | e.g. [“Flexible working”, “Full-time”] |
| location | string | Raw location text |
| location_primary | string|null | Cleaned city list (postcodes/regions stripped) |
| closes | string | Human-readable closing date |
| closes_iso | string|null | ISO datetime closing date |
| num_roles | int | Number of positions |
| security_clearance | string|null | BPSS/CTC/SC/DV extracted from text |
| Field | Type | Description |
|——-|——|————-|
| status | string | Broad state: “active” or “closed” |
| lifecycle_status | string | Specific: “active”, “closed”, “missing_unconfirmed”, “withdrawn_confirmed” |
| first_seen | string | ISO timestamp when reference was first scraped |
| last_seen | string | ISO timestamp when last confirmed live in search results |
| first_missing_at | string|null | When role first disappeared before confirmed closure |
| consecutive_missing_runs | int | Full unfiltered runs the role stayed missing |
| last_changed_at | string|null | Last time meaningful content changed |
| scraped_at | string | ISO timestamp of this particular scrape |
| Field | Type | Description |
|——-|——|————-|
| job_summary | string|null | Job summary section |
| job_description | string|null | Full job description |
| person_spec | string|null | Person specification |
| benefits | string|null | Benefits and pension info |
| contact | string|null | Contact details |
| full_text | string|null | Raw page text (source of truth for re-parsing) |
| Field | Type | Description |
|——-|——|————-|
| content_hash | string | Stable hash of the normalized comparable record |
| field_hashes | dict | Per-field stable hashes for granular diffing |
| asset_versions | array | Pointers to exact asset versions for this job snapshot |
| archive_completeness | string | “complete”, “partial_missing_assets”, “partial_failed_transcripts”, “no_auxiliary_assets” |
| attachments | array|null | Downloaded files (PDFs, DOCX, candidate packs) |
| supporting_links | array|null | External links found in job description |
| embeds | array|null | Embedded media (YouTube/Vimeo iframes) |
The CSJ site uses ALTCHA, a proof-of-work CAPTCHA. The collector solves it natively in pure Python (SHA-512 brute force, <0.5s) to establish an authenticated session.
markitdowncsj_jobs/{reference}.jsoncsj_attachments/_pool/ (content-addressed by SHA-256)The collector tracks each job through a defined lifecycle:
┌─────────────────┐
│ first_seen │
│ status: active │
└────────┬────────┘
│
┌──────────────┴──────────────┐
│ │
Job still in results Job gone from results
┌───────────────┐ ┌───────────────────┐
│ last_seen │ │ Closes date past? │
│ updated │ └─────────┬─────────┘
└───────────────┘ │ │
Yes No
┌───────┴──┐ ┌────┴───────────┐
│ closed │ │missing_unconfirmed│
│ status: │ │ consecutive_ │
│ closed │ │ missing_runs++ │
└──────────┘ └────┬───────────┘
│
After repeated runs
+ direct URL check
│
┌──────┴───────┐
│ withdrawn_ │
│ confirmed │
└──────────────┘
Key rules:
--limit, --grade, --department filters)missing_unconfirmed (not immediately closed)withdrawn_confirmedreopened event is emittedcsj_jobs/ regardless of status — filter by status field, not file presenceEvery job has:
content_hash — a stable hash of the full normalized recordfield_hashes — per-field hashes for granular change detectionWhen a re-scrape produces a different content_hash, the collector:
csj_history/{reference}/field_changed event with changed_fields, old_values, new_valueslast_changed_at on the current jobEach csj_history/{reference}/ directory contains immutable JSON snapshots of every meaningful version of the job. Comparing snapshots reveals:
csj_events.jsonl is an append-only log recording:
| Event Type | Meaning |
|————|———|
| first_seen | Job first captured by collector |
| refreshed | Job detail re-fetched, still active, no meaningful changes |
| field_changed | Meaningful content changed between scrapes |
| missing_from_results | Job vanished from search results |
csj_attachments/_pool/ using content-addressed naming ({sha256}.ext).md Markdown conversion alongside ityoutube-transcript-api.txt) and timestamped Markdown (.md)Each csj_asset_manifests/{reference}.json tracks:
first_seen_at, last_seen_at, statuscsj_asset_history/{reference}/ preserves immutable per-asset metadata versionscsj_asset_events.jsonl records: asset_added, asset_changed, asset_removed, transcript_added, transcript_changed, transcript_unavailableEach job gets an archive_completeness rating:
complete — all attachments and transcripts capturedpartial_missing_assets — some attachments failed to downloadpartial_failed_transcripts — some YouTube transcripts unavailable (often due to cloud IP blocking)no_auxiliary_assets — job had no attachments or embeds to capture| Metric | Count |
|---|---|
| Total job references captured | ~1,639 |
| Currently active | ~50% |
| Closed | ~50% |
| Unique departments | 138 |
| Unique raw grade strings | 150+ (mappable to ~10 canonical grades) |
| Lifecycle events | ~3,000+ |
| Jobs with attachments | ~244 |
| Jobs with supporting links | ~428 |
| Jobs with embedded video | ~43 |