How a solo archival system gets built in public
For Hermes: Use subagent-driven-development skill to implement this plan task-by-task.
Status (2026-04-19): Core modular refactor is now effectively complete.
Outcome: This skill should now be treated as the CSJ Collector. It is a native-only modular collector for Civil Service Jobs, developed in the context of The Longhand Archive. The stable public entrypoint remains scripts/collector.py, but most logic has been moved into focused modules under csj/.
Working approach: Keep architecture, planning, implementation, and documentation work inside this skill until a second real collector exists and shared patterns are proven. Evolve toward broader archive concepts from tested CSJ Collector behavior rather than speculative multi-source abstractions.
Current Architecture:
scripts/collector.py — stable public entrypoint wrapperscripts/collector_impl.py — thin integration layer / top-level scrape flowcsj/run.py — collector orchestration pipeline, including the top-level scrape flowcsj/native.py — native requests + ALTCHA clientcsj/assets.py, csj/lifecycle.py, csj/history.py, csj/hashing.py, csj/normalize.py, csj/records.py, csj/state.py, csj/config.pyTech Stack: Python 3.11+, existing skill at ~/.hermes/skills/research/civil-service-jobs-collector/, pytest, requests, optional markitdown / youtube-transcript-api.
scripts/collector.py working as the canonical entrypoint used by the skill docs and cron jobs.~/.hermes/workspace/csj/.b7b55c30587d, 0f2135aa9efa) — they should still call the same script path.Current modular structure is:
~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector_impl.py~/.hermes/skills/research/civil-service-jobs-collector/csj/config.py~/.hermes/skills/research/civil-service-jobs-collector/csj/state.py~/.hermes/skills/research/civil-service-jobs-collector/csj/hashing.py~/.hermes/skills/research/civil-service-jobs-collector/csj/history.py~/.hermes/skills/research/civil-service-jobs-collector/csj/lifecycle.py~/.hermes/skills/research/civil-service-jobs-collector/csj/assets.py~/.hermes/skills/research/civil-service-jobs-collector/csj/normalize.py~/.hermes/skills/research/civil-service-jobs-collector/csj/records.py~/.hermes/skills/research/civil-service-jobs-collector/csj/native.py~/.hermes/skills/research/civil-service-jobs-collector/csj/run.py~/.hermes/skills/research/civil-service-jobs-collector/csj/cli.pyCurrent tests include:
~/.hermes/skills/research/civil-service-jobs-collector/tests/test_asset_runtime.py~/.hermes/skills/research/civil-service-jobs-collector/tests/test_cli_wrapper.py~/.hermes/skills/research/civil-service-jobs-collector/tests/test_hashing.py~/.hermes/skills/research/civil-service-jobs-collector/tests/test_lifecycle.py~/.hermes/skills/research/civil-service-jobs-collector/tests/test_normalize.py~/.hermes/skills/research/civil-service-jobs-collector/tests/test_repair_mode.py~/.hermes/skills/research/civil-service-jobs-collector/tests/test_state_and_paths.pyImplemented internal package:
~/.hermes/skills/research/civil-service-jobs-collector/csj/
├── __init__.py
├── cli.py
├── config.py
├── state.py
├── hashing.py
├── history.py
├── lifecycle.py
├── assets.py
├── normalize.py
├── native.py
├── run.py
└── collector_impl.py (indirectly via scripts wrapper, not inside csj/)
Public entrypoint remains:
~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py
Current wrapper/import chain is effectively:
scripts/collector.py -> csj.cli.main() -> scripts/collector_impl.py::scrape()
BrowserCollector / browser-use support is intentionally no longer part of the target layout.
The refactor is successful only if three layers remain green:
These are the baseline guardrails and must pass after every task:
pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests
Expected: all pass.
Add tests that verify the wrapper script still exposes the same callable behaviour after moving code into csj/.
Examples:
scripts/collector.py still workscsj.lifecycle, csj.assets, csj.normalize works directly--help still worksFor each extracted cluster, add tests that compare old expected behaviour with new module calls.
Examples:
Run after major milestones:
python3 ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py --help
python3 ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py --repair-lifecycle --dry-run
python3 -m py_compile ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py
Expected:
--help prints usageRun one safe real-world command:
python3 ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py --repair-lifecycle --dry-run
Then optionally a normal scrape command if desired:
python3 ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py --details -w 5 --full --refresh
Use only after all unit tests are green.
~/.hermes/skills/research/civil-service-jobs-collector/tests/test_cli_wrapper.py
~/.hermes/skills/research/civil-service-jobs-collector/tests/test_hashing.py
~/.hermes/skills/research/civil-service-jobs-collector/tests/test_normalize.py
~/.hermes/skills/research/civil-service-jobs-collector/tests/test_state_and_paths.py
Keep existing:
test_lifecycle.pytest_repair_mode.pynormalize_url_for_diff() strips volatile CSJ query state consistentlynormalize_asset_source_url() preserves stable attachment identitycompute_field_hashes() returns stable results for same inputcompute_content_hash() changes when meaningful fields changediff_job_records() distinguishes meaningful vs non-meaningful changessys.pathscripts/collector.py --help worksscripts/collector.py --repair-lifecycle --dry-run reaches package entrypointContinue existing coverage plus add:
csj.lifecycleObjective: Introduce a package structure without changing runtime behaviour.
Files:
~/.hermes/skills/research/civil-service-jobs-collector/csj/__init__.py~/.hermes/skills/research/civil-service-jobs-collector/csj/config.py~/.hermes/skills/research/civil-service-jobs-collector/csj/cli.py~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py~/.hermes/skills/research/civil-service-jobs-collector/tests/test_cli_wrapper.pyStep 1: Write failing test
Add a CLI smoke test like:
import subprocess
def test_collector_wrapper_help_runs():
result = subprocess.run(
[
"python3",
"/root/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py",
"--help",
],
capture_output=True,
text=True,
)
assert result.returncode == 0
assert "Civil Service Jobs Collector" in result.stdout
Step 2: Run test to verify failure
Run:
pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests/test_cli_wrapper.py -v
Expected: FAIL until wrapper/package is created.
Step 3: Write minimal implementation
csj.cliscripts/collector.py a thin wrapper importing csj.cli.maincsj.cli can still call back into legacy functions in scripts/collector.py if needed via an intermediate compatibility layer, but avoid circular importsStep 4: Run tests
pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests/test_cli_wrapper.py -v
pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests
python3 -m py_compile ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py
Expected: all pass.
Step 5: Commit
git add ~/.hermes/skills/research/civil-service-jobs-collector
git commit -m "refactor: add csj package skeleton and wrapper entrypoint"
Objective: Centralize constants and file paths without changing values.
Files:
~/.hermes/skills/research/civil-service-jobs-collector/csj/config.py~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py~/.hermes/skills/research/civil-service-jobs-collector/tests/test_state_and_paths.pyStep 1: Write failing test
Add assertions such as:
from csj import config
def test_config_paths_match_existing_workspace_layout():
assert str(config.DATA_DIR).endswith('/.hermes/workspace/csj')
assert str(config.JOBS_DIR).endswith('/.hermes/workspace/csj/csj_jobs')
assert config.SCHEMA_VERSION == '2.2'
Step 2: Verify failure
pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests/test_state_and_paths.py -v
Step 3: Implement
GRADE_VALUES, GRADE_NORMALIZE, DEPT_ALIASES, regex patterns into csj.configStep 4: Verify
pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests/test_state_and_paths.py -v
pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests
Objective: Move filesystem bootstrapping/state I/O into a small stable module.
Files:
~/.hermes/skills/research/civil-service-jobs-collector/csj/state.py~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py~/.hermes/skills/research/civil-service-jobs-collector/tests/test_state_and_paths.pyFunctions to move:
ensure_dirsload_statesave_stateacquire_lockrelease_lockTests to add:
Verification:
pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests/test_state_and_paths.py -v
pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests
Objective: Make content hashing reusable and independently testable.
Files:
~/.hermes/skills/research/civil-service-jobs-collector/csj/hashing.py~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py~/.hermes/skills/research/civil-service-jobs-collector/tests/test_hashing.pyFunctions to move:
stable_json_dumpsnormalize_text_for_diffnormalize_list_for_diffnormalize_url_for_diffnormalize_asset_source_urlstrip_volatile_asset_fieldsbuild_comparable_recordhash_valuecompute_field_hashescompute_content_hashdiff_job_recordsTests:
Verification:
pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests/test_hashing.py -v
pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests
Objective: Isolate parsing/cleanup logic from scraping flow.
Files:
~/.hermes/skills/research/civil-service-jobs-collector/csj/normalize.py~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py~/.hermes/skills/research/civil-service-jobs-collector/tests/test_normalize.pyFunctions to move:
null_if_emptyparse_salary_intnormalize_gradenormalize_working_patternextract_security_clearanceparse_closes_isoclean_location_primaryTest cases:
Verification:
pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests/test_normalize.py -v
pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests
Objective: Separate write-side persistence from orchestration.
Files:
~/.hermes/skills/research/civil-service-jobs-collector/csj/history.py~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.pyFunctions to move:
append_eventwrite_history_versionsave_job_recordTesting approach: Use temporary directory monkeypatching where possible so tests assert:
Verification:
pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests
Objective: Move the newly-stabilized lifecycle logic into its own module without behaviour drift.
Files:
~/.hermes/skills/research/civil-service-jobs-collector/csj/lifecycle.py~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.pytest_lifecycle.py, test_repair_mode.pyFunctions to move:
verify_missing_job_urlis_probable_csj_homepageclassify_fetch_failurecompute_search_refsevaluate_repair_actionrun_lifecycle_repairCritical tests:
Verification:
pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests/test_lifecycle.py -v
pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests/test_repair_mode.py -v
pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests
python3 ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py --repair-lifecycle --dry-run
Objective: Isolate the noisiest subsystem without changing storage semantics.
Files:
~/.hermes/skills/research/civil-service-jobs-collector/csj/assets.py~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.pyscripts/migrate_attachment_pool.py to import shared helpers only after the main refactor is stableFunctions to move:
classify_asset_urlextract_supporting_assets_from_htmldownload_attachmentsfetch_youtube_transcriptsTest plan: Start with pure helper coverage first. Suggested tests:
Note: This is the highest-risk extraction because it has the most filesystem side effects. Keep steps small.
Verification:
pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests
python3 -m py_compile ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py
Objective: Finish separating native transport/auth/search/detail parsing concerns from the remaining CLI-facing glue.
Files:
~/.hermes/skills/research/civil-service-jobs-collector/csj/native.py~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector_impl.py~/.hermes/skills/research/civil-service-jobs-collector/csj/cli.pyFocus areas:
csj/native.pyscripts/collector_impl.pyTest plan: Do not attempt full integration tests for live scraping here. Instead:
--help and repair mode still workVerification:
pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests
python3 ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py --help
Optional manual smoke test:
python3 ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py --repair-lifecycle --dry-run
Objective: Finish the modularization while preserving the public entrypoint.
Files:
~/.hermes/skills/research/civil-service-jobs-collector/csj/orchestration.py~/.hermes/skills/research/civil-service-jobs-collector/csj/cli.py~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.pyFunctions to move:
finalize_jobnormalize_jobscrapeEnd state:
scripts/collector.py is wrapper-onlycsj/Verification:
pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests
python3 -m py_compile ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py
python3 ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py --help
python3 ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py --repair-lifecycle --dry-run
Optional final live smoke test:
python3 ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py --details -w 5 --full --refresh
migrate_attachment_pool.py only if it reduces duplication cleanlyObjective: Share asset-path/hash helpers where useful without creating coupling hazards.
Files:
~/.hermes/skills/research/civil-service-jobs-collector/scripts/migrate_attachment_pool.py~/.hermes/skills/research/civil-service-jobs-collector/csj/assets.pyRule: Only do this if the shared code is clearly stable. Do not force this extraction if it risks entangling the migration utility with collector runtime dependencies.
Verification:
python3 -m py_compile ~/.hermes/skills/research/civil-service-jobs-collector/scripts/migrate_attachment_pool.py
pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests
Objective: Document the new internal structure without changing how the skill is operated.
Files:
~/.hermes/skills/research/civil-service-jobs-collector/SKILL.md~/.hermes/workspace/csj/CSJ-ARCHITECTURE.md~/.hermes/skills/research/civil-service-jobs-collector/README.mdDocs must state explicitly:
scripts/collector.py remains the stable public entrypointcsj/Verification:
Before declaring the modular refactor complete:
scripts/collector.py still works as the public skill entrypointpython3 -m py_compile passes on wrapper and moved modules--help works--repair-lifecycle --dry-run worksThis order front-loads low-risk, pure-function extractions and delays high-risk I/O and collector-class movement until the guardrails are stronger.
scripts/collector.py is now a stable wrapper that imports csj.cli.main while re-exporting legacy symbols from scripts/collector_impl.py for test compatibility during the transition.csj/ and completed extractions for:
config.pystate.pyhashing.pynormalize.pyhistory.pylifecycle.pyassets.py (including manifest/enrichment/path/archive helpers and live download/transcript routines)pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests → 45 passedpython3 ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py --repair-lifecycle --dry-run → passedpython3 ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py --help → passedcsj/native.py and scripts/collector_impl.py455795), which is expected runtime drift rather than a refactor regression.collector names.If you want to de-risk this work, stop after Tasks 1–5 first. That gets you:
That would deliver a meaningful modularity win with relatively low operational risk.