Reproducible rigs

Experiments, re-runnable

Every cited number in an Aresalab paper comes out of one of the public rigs below. Same script, same input, same output — to the byte for the deterministic ones, within a tolerance band for the benchmarks. If a paper claim isn't backed by something here, it shouldn't be in the paper.

Rules of the lab

· If a paper cites a number, that number is regenerated by a script or artifact that can be inspected publicly. No fluff, no ghost benchmarks.
· Rigs run on stdlib + open data wherever possible. The Mars Colony rig has zero dependencies; the Fire Safety rig only reads the same JSONs the live dashboard does.
· Numbers are stamped with the timestamp of the run that produced them. The card metric you see below was the headline at that moment; a re-run will refresh both this page and the paper.

Last run
1mo ago
2026-04-27 07:00 UTC
Runtime
~1 s on M2 Pro
Stack
Python
pass@k synthetic benchmark
chapter · Applied ML 2026 · Ch. 1 — pass@k
Generates a 120-problem synthetic grade-school math benchmark across three difficulty tiers, runs four solvers (regex baseline, parser, iid noisy oracle, sticky noisy oracle) with 20 samples each, and computes pass@1 / pass@5 / pass@10 with bootstrap 95% CIs. Demonstrates how the classical 1−(1−p)^k bound breaks under correlated sampling.
run
cd local folder containing the public book asset run.py
$ python3 run.py --out results
iid oracle pass@1
0.701
iid oracle pass@10
1.000
sticky oracle pass@10
0.998
Correlation penalty @5
2.6 pp
Parser baseline pass@1
0.835
Outputs & notes
Writes
· results/headline.json — overall pass@1/5/10 across solvers
· results/metrics.json — per-solver, per-tier, per-k with 95% CIs
· Applied ML 2026 Ch. 1 prose consumes the same JSON; PDF and web reader cannot drift
Notes
The public script ships with the book at /publications/applied_ml_2026/run.py — no genass access required.
Standard library only — no third-party deps, no API keys. Deterministic from `--seed`.
A real LLM can be dropped in as a one-line adapter to replace `noisy_oracle`; the harness, grading, and CIs all work unchanged.
Last run
1mo ago
2026-04-20 09:01 UTC
Runtime
<1 s on M2 Pro
Stack
Python
Fire Safety dashboard-as-paper
paper · Data-Driven Fire Safety Analytics: A Reproducible Audit of 5…
Consumes the same pre-aggregated JSONs the archived dashboard used, re-derives every paper claim, and runs a strict consistency check against stats.json.
run
cd paper artifact bundle / experiments/
$ uv run python experiments/run.py
Records Audited
550,145
Fire Alarms
205,398
Commercial Share
64.0%
Bounded Annual Cost
$131.5M
Outputs & notes
Writes
· experiments/results/headline.json — card-facing rollup
· experiments/results/breakdown.json — per-year / per-season / per-city / per-priority slices
· data/fire_safety_results.json — consumed verbatim by the paper's Quarto cells
Notes
Two denominators are tracked explicitly: 550,145 total dispatches (incl. EMS / non-fire) vs. 367,444 fire-specific. Every percent in the paper says which denominator it uses.
The dashboard source was retired from the active workspace; use the archived fire-safety source tree if this rig needs a full data refresh.
Last run
1mo ago
2026-04-20 09:01 UTC
Runtime
~3 s on M2 Pro
Stack
Python
Mars Colony rule substrate
paper · Emergent Social Collaboration in Multi-Agent LLM Systems: A …
Deterministic Python reimplementation of the Mars Colony simulator's rule layer: 10 agents, role-affinity bond matrix, need-decay action selector, hard $0.007 cost ceiling per session. Generates every number cited in the paper across 10 seeds × 10K ticks.
run
cd paper artifact bundle / experiments/
$ uv run python experiments/run.py
Strong Friendships (mean)
3.5
Working Relationships (mean)
14.6
Construction Sites Done
3 / 3
Cost Ceiling / Session
$0.007
Action Mix · Work
49.2%
Outputs & notes
Writes
· experiments/results/headline.json — card-facing rollup
· experiments/results/aggregate.json — mean / std across seeds
· experiments/results/metrics.json — full per-session telemetry
· data/simulation_results.json — consumed verbatim by the paper's Quarto cells
Notes
stdlib-only — no third-party deps. Use --quick for a 3-seed × 2K-tick smoke run.
The LLM dialogue layer (GPT-4o-mini / Claude Haiku) is NOT called here; what's reproduced is the rule-governed dynamics and the cost envelope.
Last run
1mo ago
2026-04-20 07:02 UTC
Runtime
~30 s on M2 Pro
Stack
Rust + Python
AresaDB benchmarks
paper · AresaDB: A High-Performance Multi-Model Database in Rust
Reproduces the four card metrics in the AresaDB paper (point-lookup p50/p99, batch-insert throughput, HNSW vs brute-force speedup, secondary-index speedup) by running the Rust bench suite over a 50K-node / 250K-edge / 10K-vector workload.
run
cd paper artifact bundle / experiments/
$ uv run python experiments/run.py
Point Lookup
5 µs (p50)
Insert Rate
75K/sec batch
Vector Search
7 µs (HNSW)
Index Speedup
23×
Outputs & notes
Writes
· experiments/results/headline.json — card-facing rollup
· experiments/results/raw/*.json — full Criterion output per benchmark
· Quarto paper consumes the same JSON; LaTeX render is byte-for-byte deterministic
Notes
Rust toolchain required (cargo). The Python wrapper invokes `cargo bench` for each suite, parses Criterion JSON, and writes the unified headline.

pass@k synthetic benchmark

Fire Safety dashboard-as-paper

Mars Colony rule substrate

AresaDB benchmarks