Reproducibility Checklist — FORJA¶
Use this checklist before a pilot run, benchmark batch, or audit bundle handoff.
1. Canonical scope¶
- [ ] Official benchmark claims are restricted to
SA,TS,ILS,GRASP,METIS, andKaHIP. - [ ] Any
greedyrun kept in the repository is labeled exploratory and excluded from official benchmark tables, manifests, and selector-label claims.
2. Environment¶
- [ ] OS, kernel, CPU, and thread controls are recorded.
- [ ] Any report using this release states that the benchmark ran in an audited WSL2 environment and does not overclaim bare-metal timing equivalence.
- [ ] Python environment is pinned and archived (
poetry.lock,pip list, or equivalent). - [ ]
gpmetisis available when METIS runs are expected, and its runtime-reported version is recorded. - [ ]
kaffpais available when KaHIP runs are expected, and its runtime-reported version is recorded. - [ ] The active experimental narrative treats KaHIP 3.17 as the official repository reference.
- [ ] If a vendored
./KaHIPtree is present, it is logged separately as auxiliary repository material rather than assumed to match the runtime binary.
3. Plans and fairness¶
- [ ] The archived plan YAML uses
schema: forja-exp-v1. - [ ] All compared solvers receive the same per-instance wall-clock budget.
- [ ] The same balance semantics are enforced across all compared families.
- [ ] Hyperparameters were frozen in the pilot before the main benchmark claim was made.
- [ ] Metaheuristic benchmark plans use the D-012 frozen profiles for
SA,TS,ILS, andGRASP. - [ ] Outputs are validated independently after execution.
4. Result artifacts¶
- [ ] Artifacts validate against
specs/jsonschema/solver_run.schema.v1.json. - [ ] Total serialized runtime is stored as
elapsed_ms. - [ ] Checkpoint timestamps are stored as
checkpoints[].time_ms. - [ ]
time_nsdoes not appear as an official serialized checkpoint field. - [ ] Legacy names such as
runtime_msandelapsed_wall_msare absent from current benchmark artifacts. - [ ] When NFE is emitted by a metaheuristic, it is treated as diagnostic instrumentation rather than as the universal cross-family budget.
5. Execution trace¶
- [ ] Instance identifiers and hashes are archived.
- [ ] Commit SHA, hostname, solver name,
k,beta, seed, and budget are preserved per run. - [ ]
stdout/stderrare kept when relevant to diagnose external-solver behaviour. -
[ ] Output locations match the selected plan
output.raw_dir. -
[ ] If TTT or ECDF is reported, the target rule, censoring rule, and wall-clock interpretation follow D-010.
-
[ ] If performance profiles are reported, the ratio definition and the common-feasible instance set are stated explicitly.
-
[ ] Selector evaluation, when reported, uses an instance-level preregistered outer holdout over the collapsed per-instance table.
-
[ ] No selector preprocessing or model choice touches the outer test partition.
-
[ ] Selector claims do not rely on hyperparameter search or nested model selection unless the canon is explicitly reopened.
-
[ ] The instantiated CART parameter tuple is recorded before selector evaluation is reported.
-
[ ] Pilot runs are linked to
EXP-BENCH-PILOT-001and main campaign runs are linked toEXP-BENCH-MAIN-001. -
[ ] Main campaign execution did not begin before documented pilot review.
-
[ ] Reported artifact fields match the active confirmed contract (
elapsed_ms,checkpoints[].time_ms, status semantics, optional NFE only when instrumented). - [ ] Manifest linkage between raw outputs and later benchmark tables is preserved under the active release-candidate contract.
6. Reporting¶
- [ ] Tables and plots cite the commit, plan, and dataset slice used.
- [ ] Any legacy material consulted during the analysis is identified as such.
- [ ] Any deviation from the canonical contract is documented explicitly before the result is reused in prose.