Reproducibility Checklist — FORJA¶

Use this checklist before a pilot run, benchmark batch, or audit bundle handoff.

1. Canonical scope¶

[ ] Official benchmark claims are restricted to SA, TS, ILS, GRASP, METIS, and KaHIP.
[ ] Any greedy run kept in the repository is labeled exploratory and excluded from official benchmark tables, manifests, and selector-label claims.

[ ] OS, kernel, CPU, and thread controls are recorded.
[ ] Any report using this release states that the benchmark ran in an audited WSL2 environment and does not overclaim bare-metal timing equivalence.
[ ] Python environment is pinned and archived (poetry.lock, pip list, or equivalent).
[ ] gpmetis is available when METIS runs are expected, and its runtime-reported version is recorded.
[ ] kaffpa is available when KaHIP runs are expected, and its runtime-reported version is recorded.
[ ] The active experimental narrative treats KaHIP 3.17 as the official repository reference.
[ ] If a vendored ./KaHIP tree is present, it is logged separately as auxiliary repository material rather than assumed to match the runtime binary.

[ ] The archived plan YAML uses schema: forja-exp-v1.
[ ] All compared solvers receive the same per-instance wall-clock budget.
[ ] The same balance semantics are enforced across all compared families.
[ ] Hyperparameters were frozen in the pilot before the main benchmark claim was made.
[ ] Metaheuristic benchmark plans use the D-012 frozen profiles for SA, TS, ILS, and GRASP.
[ ] Outputs are validated independently after execution.

[ ] Artifacts validate against specs/jsonschema/solver_run.schema.v1.json.
[ ] Total serialized runtime is stored as elapsed_ms.
[ ] Checkpoint timestamps are stored as checkpoints[].time_ms.
[ ] time_ns does not appear as an official serialized checkpoint field.
[ ] Legacy names such as runtime_ms and elapsed_wall_ms are absent from current benchmark artifacts.
[ ] When NFE is emitted by a metaheuristic, it is treated as diagnostic instrumentation rather than as the universal cross-family budget.

[ ] Instance identifiers and hashes are archived.
[ ] Commit SHA, hostname, solver name, k, beta, seed, and budget are preserved per run.
[ ] stdout / stderr are kept when relevant to diagnose external-solver behaviour.
[ ] Output locations match the selected plan output.raw_dir.
[ ] If TTT or ECDF is reported, the target rule, censoring rule, and wall-clock interpretation follow D-010.
[ ] If performance profiles are reported, the ratio definition and the common-feasible instance set are stated explicitly.
[ ] Selector evaluation, when reported, uses an instance-level preregistered outer holdout over the collapsed per-instance table.
[ ] No selector preprocessing or model choice touches the outer test partition.
[ ] Selector claims do not rely on hyperparameter search or nested model selection unless the canon is explicitly reopened.
[ ] The instantiated CART parameter tuple is recorded before selector evaluation is reported.
[ ] Pilot runs are linked to EXP-BENCH-PILOT-001 and main campaign runs are linked to EXP-BENCH-MAIN-001.
[ ] Main campaign execution did not begin before documented pilot review.
[ ] Reported artifact fields match the active confirmed contract (elapsed_ms, checkpoints[].time_ms, status semantics, optional NFE only when instrumented).
[ ] Manifest linkage between raw outputs and later benchmark tables is preserved under the active release-candidate contract.

[ ] Tables and plots cite the commit, plan, and dataset slice used.
[ ] Any legacy material consulted during the analysis is identified as such.
[ ] Any deviation from the canonical contract is documented explicitly before the result is reused in prose.