CSJ v2.3 Phase 4 Checkpoint

Updated: 2026-04-13

Completed in this phase

Added asset classification helper:
- classify_asset_url(url, text='')
Added asset extraction helper:
- extract_supporting_assets_from_html(html)
Added structured asset fields to normalized records:
- supporting_links
- attachments
- embeds
Added those asset fields into CRITICAL_FIELDS so meaningful collateral changes can generate history/events
Native detail fetch now parses supporting collateral directly from detail-page HTML and stores it in the raw record before normalization

Noise reduction refinement made during validation

Initially the extractor also kept CSJ internal links, which produced too much obvious navigation noise.
Refined extraction to skip internal_link items so the stored asset set is focused on non-CSJ destinations and attachments.

Validation completed

Synthetic validation:
- confirmed extractor correctly identifies:
  - YouTube links
  - PDF candidate pack links
  - Vimeo iframe embeds
Live SCS validation:
- fetched first five SCS role detail pages directly via NativeCollector
- observed collateral extraction on live senior roles
- one sampled role (452781) included a pdf_candidate_pack attachment
- no YouTube/video embeds were observed in the first ten SCS roles checked live
Live write-path validation:
- ran: python3 ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py -g SCS --details -w 1 -n 1 --full --force
- verified saved job file includes populated asset fields

Observed live behavior / caveat

Senior roles do include meaningful outbound collateral, but many extracted links are still broad campaign/support links (e.g. GOV.UK guidance, pensions, Success Profiles, webinar/Teams links) rather than only bespoke campaign collateral.
This is useful historically, but still not fully refined if the goal is to store only the most role-specific supporting material.
A future refinement could add allow/deny heuristics to down-rank or exclude boilerplate domains/pages while preserving candidate packs, webinar links, external campaign sites, and media.

Current status after Phase 4 Done:

Still to do

fixture-based pytest coverage for:
- asset extraction
- lifecycle transitions
- versioning behavior
optional refinement to reduce generic/supporting-link noise
live validation of explicit withdrawn/reopened scenarios if/when they occur naturally or are simulated