How a solo archival system gets built in public
Last updated: 2026-04-19
The CSJ Collector is now in a good modular state and remains fully functional.
It has been reframed from “collector” to collector in architecture/documentation terms, while preserving compatibility-sensitive names/paths where needed.
The current working approach is:
Browser-use/browser fallback support was removed intentionally. The collector is now native-only.
Result:
The old monolithic implementation has been split into focused modules.
Current code layout:
scripts/collector.py — stable public entrypoint wrapperscripts/collector_impl.py — thin integration layer / top-level scrape flowcsj/run.py — collector orchestration pipeline, including the top-level scrape flowcsj/native.py — native requests + ALTCHA collection/parsingcsj/assets.py — attachment/transcript capture, manifests, asset history/event handlingcsj/lifecycle.py — lifecycle classification, repair mode, withdrawn/missing verificationcsj/history.py — job history versions and event loggingcsj/hashing.py — comparable-record hashing, diffing, asset/source normalizationcsj/normalize.py — normalization helperscsj/records.py — job-record normalization, finalization, and persistence helperscsj/state.py — state/lockingcsj/config.py — constants and pathscsj/cli.py — CLI entrypoint, parser, and runtime context assemblyscripts/collector_impl.pyevaluate_repair_action still re-exported from collector module surface)Safe wording has been moved toward Collector in:
Compatibility-sensitive names were intentionally left unchanged, including:
scripts/collector.pyscripts/collector_impl.pycivil-service-jobs-collectorscraped_at/tmp/csj_collector.logThe following references now exist and should be maintained as we go:
references/csj-collector-architecture.md
references/csj-archive-envelope.md
source = "csj"content_type = "job_posting"references/csj-field-mapping.md
references/refresh-lifecycle-edge-cases.md
/root/.hermes/workspace/csj/CSJ-MODULAR-REFACTOR-PLAN.md has been updated so it reflects the current post-refactor state rather than the older future-state plan.
Most recent verified state:
uvx pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/testsVerified:
python3 ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py --helppython3 ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py --repair-lifecycle --dry-runHelp output identifies the tool as:
Civil Service Jobs CollectorA real lifecycle bug was found and fixed after hardening work:
refresh_existing_listing_state() in csj/run.py could silently reactivate already-closed records by resetting skipped/search-present records back to status=active / lifecycle_status=active--repair-lifecycle runs produced duplicate repair_closed history snapshots/events for a subset of refscollector.py --repair-lifecycle --dry-run returned closed: 0 on the repaired datasetHistorical duplicate repair-closure entries from earlier runs still exist in the archive, but the current-state reactivation risk in that path appears resolved.
These were made deliberately and should not be changed casually:
scripts/collector.py and scripts/collector_impl.pycivil-service-jobs-collectorscraped_atcollector* to collector*scraped_at -> collected_atThese are the best next actions inside this skill.
When changing the CSJ Collector, continue updating:
references/csj-collector-architecture.mdreferences/csj-archive-envelope.mdreferences/csj-field-mapping.mdThis is the default low-risk path.
Potentially start emitting or documenting conceptual metadata such as:
source = csjcontent_type = job_postingImportant: do this first in docs/summaries if needed, not as a breaking schema rewrite.
A useful next design task would be to explicitly mark which current modules/functions are most likely to become shared later across The Longhand Archive.
Strong candidates include:
first_seen, last_seen, status, last_changed_at)But this should stay conceptual until a second collector exists.
If there is a collector improvement/bugfix/new archival behavior to implement, the architecture is now clean enough to do that work with less risk.
Use this principle for future work:
Build The Longhand Archive iteratively by evolving the CSJ Collector first. Generalize only from working, tested collector behavior. Keep planning and architecture inside this skill until a second collector justifies extraction.
Start by reading:
references/csj-collector-architecture.mdreferences/csj-archive-envelope.mdreferences/csj-field-mapping.mdreferences/refresh-lifecycle-edge-cases.mdThen verify current operational health with:
pytest -q ~/.hermes/skills/research/civil-service-jobs-collector/tests
python3 ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py --help
python3 ~/.hermes/skills/research/civil-service-jobs-collector/scripts/collector.py --repair-lifecycle --dry-run
If working on naming/terminology again, remember:
collector names until there is a deliberate migration plan