Building The Longhand Archive

How a solo archival system gets built in public

View My GitHub Profile

CSJ Auxiliary Asset Archival Implementation Plan

For Hermes: Use subagent-driven-development skill to implement this plan task-by-task.

Goal: Make attachments and YouTube transcripts first-class archival assets, managed with the same rigor as role description changes.

Architecture: Separate asset capture/versioning from main job-detail parsing. The final job JSON should reference stable asset metadata rather than volatile local filenames. Asset content hashes and stable identifiers must be computed before the final job record is hashed and saved, so job history reflects the actual saved record.

Tech Stack: Python collector at ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py, markitdown, youtube-transcript-api, JSON manifests/events, file hashing via hashlib.


Problem Summary

The current collector can now:

But auxiliary assets are not yet managed to the same archival standard as role descriptions because:


Phase 1 — Make the Current Model Internally Correct

Task 1: Move asset enrichment before final hashing

Objective: Ensure the saved job JSON matches its stored content_hash, field_hashes, and last_changed_at.

Files:

Implementation:

  1. Refactor fetch_one() so the flow is:
    • fetch detail page
    • parse raw fields
    • merge listing fields
    • enrich raw record with downloaded attachments/transcripts
    • only then normalize/hash/diff
    • save final record
  2. If needed, split normalize_job() into:
    • pre-hash cleanup/normalization
    • post-enrichment finalization + hashing
  3. Ensure no enrichment happens after the final comparable record is produced.

Verification:

Task 2: Stop using volatile local paths as change-significant identity

Objective: Prevent false-positive history events caused by filename/path churn.

Files:

Implementation:

  1. Define volatile asset fields that should not drive job-level meaningful change:
    • local_path
    • local_filename
    • markdown_path
    • transcript_path
    • transcript_md_path
    • download_error
    • transcript_error
  2. Update the comparable-record hashing logic to ignore these fields.
  3. Compare stable asset fields instead:
    • normalized source URL
    • logical title/name
    • content hash
    • media type
    • byte size
    • asset type/category

Verification:

Task 3: Add real content hashes for assets

Objective: Give attachments/transcripts stable identities based on actual content.

Files:

Implementation:

  1. For each downloaded attachment, compute and store:
    • content_hash (sha256)
    • byte_size
    • media_type
    • fetched_at
  2. For each Markdown derivative, compute and store:
    • markdown_content_hash
  3. For each transcript, compute and store:
    • transcript_content_hash
    • transcript_md_content_hash
    • transcript_length
    • video_duration
    • transcript_fetched_at
  4. Reuse these stable hashes in job-level comparison instead of local paths.

Verification:

Task 4: Make attachment/transcript capture idempotent

Objective: Stop duplicate files and synthetic changes on reruns.

Files:

Implementation:

  1. Replace filename-suffix dedup (_1, _2, etc.) with content-based identity.
  2. Recommended naming scheme:
    • csj_attachments/{reference}/{slug}__{short_hash}.ext
  3. If the same content already exists for that job, reuse it.
  4. Only create a new file when content actually changes.
  5. Apply the same principle to transcript files and Markdown derivatives.

Verification:


Phase 2 — Promote Auxiliary Files to First-Class Archival Assets

Task 5: Introduce a per-job asset manifest

Objective: Separate stable asset metadata from transient job JSON details.

Files:

Implementation: Create csj_asset_manifests/{reference}.json with entries containing:

Verification:

Task 6: Add asset-level history and events

Objective: Track asset changes explicitly.

Files:

Implementation: Emit events:

Store immutable per-version asset metadata snapshots.

Verification:

Objective: Allow full reconstruction of historical advert state.

Files:

Implementation: Each job history snapshot should record the asset IDs + content hashes active at that time. Do not rely on mutable current local paths.

Verification:

Task 8: Define closed-job archival completeness

Objective: Give closed jobs an explicit archive state.

Files:

Implementation: Add archive_completeness, e.g.:

On closure:

Verification:


Phase 3 — Improve Extraction Quality and Reliability

Task 9: Scope extraction to the vacancy content region

Objective: Reduce noisy supporting_links and accidental captures from page chrome/footer.

Files:

Implementation:

Verification:

Task 10: Improve YouTube metadata capture

Objective: Make transcript records more useful and stable.

Files:

Implementation:

Verification:

Task 11: Make dependency failures explicit

Objective: Avoid silent archive gaps.

Files:

Implementation: If markitdown or youtube-transcript-api is missing:

Verification:

Task 12: Add dedicated asset refresh/backfill policy

Objective: Treat asset freshness separately from role-text freshness.

Files:

Implementation: Future flags:

Use these to:

Verification:


Job JSON should contain

Asset manifest entries should contain

Design rule: job-level comparable hashes must use stable asset identities and content hashes, not volatile local file paths.


Implementation Order

  1. Fix hash timing bug
  2. Make attachment/transcript capture idempotent with content hashes
  3. Strip volatile local-path fields from change-significant comparisons
  4. Add per-job asset manifest
  5. Add asset events/history
  6. Add closed-job archive completeness
  7. Improve extraction scoping and YouTube metadata
  8. Add dedicated asset refresh/backfill commands

Success Criteria

The implementation is in a good state when:


Verification Commands

Use these after each phase:

python3 -m py_compile ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py

Known useful test references:

Suggested manual verification runs:

python3 ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py --details -n 1 --full --force
python3 ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py --details -n 1 --refresh

Expected outcomes: