How a solo archival system gets built in public
For Hermes: Use subagent-driven-development skill to implement this plan task-by-task.
Goal: Make attachments and YouTube transcripts first-class archival assets, managed with the same rigor as role description changes.
Architecture: Separate asset capture/versioning from main job-detail parsing. The final job JSON should reference stable asset metadata rather than volatile local filenames. Asset content hashes and stable identifiers must be computed before the final job record is hashed and saved, so job history reflects the actual saved record.
Tech Stack: Python collector at ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py, markitdown, youtube-transcript-api, JSON manifests/events, file hashing via hashlib.
The current collector can now:
But auxiliary assets are not yet managed to the same archival standard as role descriptions because:
Objective: Ensure the saved job JSON matches its stored content_hash, field_hashes, and last_changed_at.
Files:
~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.pyImplementation:
fetch_one() so the flow is:
normalize_job() into:
Verification:
Objective: Prevent false-positive history events caused by filename/path churn.
Files:
~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.pyImplementation:
local_pathlocal_filenamemarkdown_pathtranscript_pathtranscript_md_pathdownload_errortranscript_errorVerification:
Objective: Give attachments/transcripts stable identities based on actual content.
Files:
~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.pyImplementation:
content_hash (sha256)byte_sizemedia_typefetched_atmarkdown_content_hashtranscript_content_hashtranscript_md_content_hashtranscript_lengthvideo_durationtranscript_fetched_atVerification:
Objective: Stop duplicate files and synthetic changes on reruns.
Files:
~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.pyImplementation:
_1, _2, etc.) with content-based identity.csj_attachments/{reference}/{slug}__{short_hash}.extVerification:
Objective: Separate stable asset metadata from transient job JSON details.
Files:
~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py~/.hermes/workspace/csj/csj_asset_manifests/Implementation:
Create csj_asset_manifests/{reference}.json with entries containing:
asset_idreferenceasset_typelogical_namesource_url_rawsource_url_normalizedcontent_hashmedia_typebyte_sizefirst_seen_atlast_seen_atfetched_atstatusderived_from_asset_idlocal_pathVerification:
Objective: Track asset changes explicitly.
Files:
collector.py~/.hermes/workspace/csj/csj_asset_events.jsonl~/.hermes/workspace/csj/csj_asset_history/{reference}/Implementation: Emit events:
asset_addedasset_removedasset_changedasset_download_failedtranscript_addedtranscript_changedtranscript_unavailableStore immutable per-version asset metadata snapshots.
Verification:
asset_changed event and preserved previous versionObjective: Allow full reconstruction of historical advert state.
Files:
collector.pyImplementation: Each job history snapshot should record the asset IDs + content hashes active at that time. Do not rely on mutable current local paths.
Verification:
Objective: Give closed jobs an explicit archive state.
Files:
collector.pyImplementation:
Add archive_completeness, e.g.:
completepartial_missing_assetspartial_failed_transcriptsno_auxiliary_assetsOn closure:
Verification:
Objective: Reduce noisy supporting_links and accidental captures from page chrome/footer.
Files:
collector.pyImplementation:
Verification:
Objective: Make transcript records more useful and stable.
Files:
collector.pyImplementation:
hereVerification:
Objective: Avoid silent archive gaps.
Files:
collector.pyImplementation:
If markitdown or youtube-transcript-api is missing:
Verification:
Objective: Treat asset freshness separately from role-text freshness.
Files:
collector.pyImplementation: Future flags:
--refresh-assets--retry-failed-assets--backfill-assetsUse these to:
Verification:
asset_manifest_patharchive_completenessasset_idasset_typelogical_namesource_url_rawsource_url_normalizedcontent_hashmedia_typebyte_sizefirst_seen_atlast_seen_atfetched_atstatusderived_from_asset_idlocal_pathDesign rule: job-level comparable hashes must use stable asset identities and content hashes, not volatile local file paths.
The implementation is in a good state when:
Use these after each phase:
python3 -m py_compile ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py
Known useful test references:
457480 — contact fix case453120 — attachment + contact453156 — PDF + contact454928 — two attachments + YouTube transcriptSuggested manual verification runs:
python3 ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py --details -n 1 --full --force
python3 ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py --details -n 1 --refresh
Expected outcomes: