CSJ v2.3 Phase 4 Checkpoint
Updated: 2026-04-13
Completed in this phase
- Added asset classification helper:
classify_asset_url(url, text='')
- Added asset extraction helper:
extract_supporting_assets_from_html(html)
- Added structured asset fields to normalized records:
supporting_links
attachments
embeds
- Added those asset fields into
CRITICAL_FIELDS so meaningful collateral changes can generate history/events
- Native detail fetch now parses supporting collateral directly from detail-page HTML and stores it in the raw record before normalization
Noise reduction refinement made during validation
- Initially the extractor also kept CSJ internal links, which produced too much obvious navigation noise.
- Refined extraction to skip
internal_link items so the stored asset set is focused on non-CSJ destinations and attachments.
Validation completed
- Synthetic validation:
- confirmed extractor correctly identifies:
- YouTube links
- PDF candidate pack links
- Vimeo iframe embeds
- Live SCS validation:
- fetched first five SCS role detail pages directly via
NativeCollector
- observed collateral extraction on live senior roles
- one sampled role (
452781) included a pdf_candidate_pack attachment
- no YouTube/video embeds were observed in the first ten SCS roles checked live
- Live write-path validation:
- ran:
python3 ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py -g SCS --details -w 1 -n 1 --full --force
- verified saved job file includes populated asset fields
Observed live behavior / caveat
- Senior roles do include meaningful outbound collateral, but many extracted links are still broad campaign/support links (e.g. GOV.UK guidance, pensions, Success Profiles, webinar/Teams links) rather than only bespoke campaign collateral.
- This is useful historically, but still not fully refined if the goal is to store only the most role-specific supporting material.
- A future refinement could add allow/deny heuristics to down-rank or exclude boilerplate domains/pages while preserving candidate packs, webinar links, external campaign sites, and media.
Current status after Phase 4
Done:
- historical versioning foundations
- event logging foundations
- lifecycle classification foundations
- supporting asset extraction foundations
Still to do
- fixture-based pytest coverage for:
- asset extraction
- lifecycle transitions
- versioning behavior
- optional refinement to reduce generic/supporting-link noise
- live validation of explicit withdrawn/reopened scenarios if/when they occur naturally or are simulated