Building The Longhand Archive

How a solo archival system gets built in public

View My GitHub Profile

CSJ Skill Modular Refactor Implementation Plan

For Hermes: Use subagent-driven-development skill to implement this plan task-by-task.

Status (2026-04-19): Core modular refactor is now effectively complete.

Outcome: This skill should now be treated as the CSJ Collector. It is a native-only modular collector for Civil Service Jobs, developed in the context of The Longhand Archive. The stable public entrypoint remains scripts/collector.py, but most logic has been moved into focused modules under csj/.

Working approach: Keep architecture, planning, implementation, and documentation work inside this skill until a second real collector exists and shared patterns are proven. Evolve toward broader archive concepts from tested CSJ Collector behavior rather than speculative multi-source abstractions.

Current Architecture:

Tech Stack: Python 3.11+, existing skill at ~/.hermes/skills/research/civil-service-jobs-collector/, pytest, requests, optional markitdown / youtube-transcript-api.


Non-Negotiable Constraints

  1. The CSJ system must continue functioning fully as a Hermes skill throughout the refactor.
  2. Keep scripts/collector.py working as the canonical entrypoint used by the skill docs and cron jobs.
  3. Do not change output paths under ~/.hermes/workspace/csj/.
  4. Do not change current JSON schema or lifecycle semantics unless a task explicitly says otherwise.
  5. Do not break current cron jobs (b7b55c30587d, 0f2135aa9efa) — they should still call the same script path.
  6. Refactor under TDD: every new module extraction step starts with a failing test or a strengthened existing regression test.
  7. Prefer small extractions over a big-bang rewrite.

Current Structure Snapshot

Current modular structure is:

Current tests include:

Final Internal Package Layout

Implemented internal package:

~/.hermes/skills/research/civil-service-jobs-collector/csj/
├── __init__.py
├── cli.py
├── config.py
├── state.py
├── hashing.py
├── history.py
├── lifecycle.py
├── assets.py
├── normalize.py
├── native.py
├── run.py
└── collector_impl.py (indirectly via scripts wrapper, not inside csj/)

Public entrypoint remains:

~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py

Current wrapper/import chain is effectively:

scripts/collector.py -> csj.cli.main() -> scripts/collector_impl.py::scrape()

BrowserCollector / browser-use support is intentionally no longer part of the target layout.


Full Test Plan

Test Strategy Overview

The refactor is successful only if three layers remain green:

  1. Unit/regression tests for extracted helpers
  2. CLI smoke tests for the public entrypoint
  3. Behavioural parity checks on summary / repair logic and no-regression outputs

Test Layers

Layer A — Existing regression tests must stay green

These are the baseline guardrails and must pass after every task:

pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests

Expected: all pass.

Layer B — New module-import tests

Add tests that verify the wrapper script still exposes the same callable behaviour after moving code into csj/.

Examples:

Layer C — Extraction parity tests

For each extracted cluster, add tests that compare old expected behaviour with new module calls.

Examples:

Layer D — Smoke tests for CLI entrypoint

Run after major milestones:

python3 ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py --help
python3 ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py --repair-lifecycle --dry-run
python3 -m py_compile ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py

Expected:

Layer E — Optional live smoke test after final integration

Run one safe real-world command:

python3 ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py --repair-lifecycle --dry-run

Then optionally a normal scrape command if desired:

python3 ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py --details -w 5 --full --refresh

Use only after all unit tests are green.


New Test Files To Add

~/.hermes/skills/research/civil-service-jobs-collector/tests/test_cli_wrapper.py
~/.hermes/skills/research/civil-service-jobs-collector/tests/test_hashing.py
~/.hermes/skills/research/civil-service-jobs-collector/tests/test_normalize.py
~/.hermes/skills/research/civil-service-jobs-collector/tests/test_state_and_paths.py

Keep existing:

Specific Behaviour To Test

Hashing / diff tests

Normalization tests

State/path tests

CLI wrapper tests

Lifecycle tests

Continue existing coverage plus add:


Implementation Tasks

Task 1: Create the internal package skeleton

Objective: Introduce a package structure without changing runtime behaviour.

Files:

Step 1: Write failing test

Add a CLI smoke test like:

import subprocess


def test_collector_wrapper_help_runs():
    result = subprocess.run(
        [
            "python3",
            "/root/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py",
            "--help",
        ],
        capture_output=True,
        text=True,
    )
    assert result.returncode == 0
    assert "Civil Service Jobs Collector" in result.stdout

Step 2: Run test to verify failure

Run:

pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests/test_cli_wrapper.py -v

Expected: FAIL until wrapper/package is created.

Step 3: Write minimal implementation

Step 4: Run tests

pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests/test_cli_wrapper.py -v
pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests
python3 -m py_compile ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py

Expected: all pass.

Step 5: Commit

git add ~/.hermes/skills/research/civil-service-jobs-collector
git commit -m "refactor: add csj package skeleton and wrapper entrypoint"

Task 2: Extract configuration and path constants

Objective: Centralize constants and file paths without changing values.

Files:

Step 1: Write failing test

Add assertions such as:

from csj import config


def test_config_paths_match_existing_workspace_layout():
    assert str(config.DATA_DIR).endswith('/.hermes/workspace/csj')
    assert str(config.JOBS_DIR).endswith('/.hermes/workspace/csj/csj_jobs')
    assert config.SCHEMA_VERSION == '2.2'

Step 2: Verify failure

pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests/test_state_and_paths.py -v

Step 3: Implement

Step 4: Verify

pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests/test_state_and_paths.py -v
pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests

Task 3: Extract state and lock management

Objective: Move filesystem bootstrapping/state I/O into a small stable module.

Files:

Functions to move:

Tests to add:

Verification:

pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests/test_state_and_paths.py -v
pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests

Task 4: Extract hashing and diff logic

Objective: Make content hashing reusable and independently testable.

Files:

Functions to move:

Tests:

Verification:

pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests/test_hashing.py -v
pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests

Task 5: Extract normalization helpers

Objective: Isolate parsing/cleanup logic from scraping flow.

Files:

Functions to move:

Test cases:

Verification:

pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests/test_normalize.py -v
pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests

Task 6: Extract history/event persistence

Objective: Separate write-side persistence from orchestration.

Files:

Functions to move:

Testing approach: Use temporary directory monkeypatching where possible so tests assert:

Verification:

pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests

Task 7: Extract lifecycle logic

Objective: Move the newly-stabilized lifecycle logic into its own module without behaviour drift.

Files:

Functions to move:

Critical tests:

Verification:

pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests/test_lifecycle.py -v
pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests/test_repair_mode.py -v
pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests
python3 ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py --repair-lifecycle --dry-run

Task 8: Extract asset and archival logic

Objective: Isolate the noisiest subsystem without changing storage semantics.

Files:

Functions to move:

Test plan: Start with pure helper coverage first. Suggested tests:

Note: This is the highest-risk extraction because it has the most filesystem side effects. Keep steps small.

Verification:

pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests
python3 -m py_compile ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py

Task 9: Finish native collector boundary cleanup

Objective: Finish separating native transport/auth/search/detail parsing concerns from the remaining CLI-facing glue.

Files:

Focus areas:

Test plan: Do not attempt full integration tests for live scraping here. Instead:

Verification:

pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests
python3 ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py --help

Optional manual smoke test:

python3 ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py --repair-lifecycle --dry-run

Task 10: Extract orchestration and leave the script as a thin wrapper

Objective: Finish the modularization while preserving the public entrypoint.

Files:

Functions to move:

End state:

Verification:

pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests
python3 -m py_compile ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py
python3 ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py --help
python3 ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py --repair-lifecycle --dry-run

Optional final live smoke test:

python3 ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py --details -w 5 --full --refresh

Task 11: Refactor migrate_attachment_pool.py only if it reduces duplication cleanly

Objective: Share asset-path/hash helpers where useful without creating coupling hazards.

Files:

Rule: Only do this if the shared code is clearly stable. Do not force this extraction if it risks entangling the migration utility with collector runtime dependencies.

Verification:

python3 -m py_compile ~/.hermes/skills/research/civil-service-jobs-collector/scripts/migrate_attachment_pool.py
pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests

Task 12: Documentation update for module layout and stable entrypoints

Objective: Document the new internal structure without changing how the skill is operated.

Files:

Docs must state explicitly:

Verification:


Final Verification Checklist

Before declaring the modular refactor complete:


Recommended Execution Order

  1. Package skeleton + wrapper
  2. Config/constants
  3. State/locks
  4. Hashing/diff
  5. Normalization
  6. History/events
  7. Lifecycle
  8. Assets
  9. Native collector boundary cleanup
  10. Orchestration
  11. Optional migration utility cleanup
  12. Docs

This order front-loads low-risk, pure-function extractions and delays high-risk I/O and collector-class movement until the guardrails are stronger.


Live execution notes

Risks To Watch

  1. Import path breakage — the wrapper must work when invoked directly by cron.
  2. Circular imports — especially among config/state/history/assets/orchestration.
  3. Filesystem side effects in tests — use temp dirs or monkeypatch aggressively.
  4. Asset subsystem coupling — likely the hardest extraction.
  5. Docs drift — keep plan/status docs aligned with the now-native-only collector.
  6. Terminology/compatibility drift — prefer Collector wording in docs while preserving compatibility-sensitive collector names.

Suggested First Milestone

If you want to de-risk this work, stop after Tasks 1–5 first. That gets you:

That would deliver a meaningful modularity win with relatively low operational risk.