Reproducible rigs

Experiments, re-runnable

Every cited number in an Aresalab paper comes out of one of the scripts below. Same script, same input, same output — to the byte for the deterministic ones, within a tolerance band for the benchmarks. If a paper claim isn't backed by something here, it shouldn't be in the paper.

Rules of the lab
  • · If a paper cites a number, that number is regenerated by a script in the repo. No fluff, no ghost benchmarks.
  • · Rigs run on stdlib + open data wherever possible. The Mars Colony rig has zero dependencies; the Fire Safety rig only reads the same JSONs the live dashboard does.
  • · Numbers are stamped with the timestamp of the run that produced them. The card metric you see below was the headline at that moment; a re-run will refresh both this page and the paper.
  • Last run
    1d ago
    2026-04-20 10:40 UTC
    Runtime
    ~1 s on M2 Pro
    Stack
    Python

    pass@k synthetic benchmark

    chapter · Applied ML 2026 · Ch. 1 — pass@k

    Generates a 120-problem synthetic grade-school math benchmark across three difficulty tiers, runs four solvers (regex baseline, parser, iid noisy oracle, sticky noisy oracle) with 20 samples each, and computes pass@1 / pass@5 / pass@10 with bootstrap 95% CIs. Demonstrates how the classical 1−(1−p)^k bound breaks under correlated sampling.

    run
    cd genass/publications/quarto/applied_ml_2026/experiments/
    $ python3 experiments/run.py
    iid oracle pass@1
    0.702
    iid oracle pass@10
    1.000
    sticky oracle pass@10
    0.986
    Correlation penalty @10
    1.5 pp
    Parser baseline pass@1
    0.843
    Outputs & notes
    Writes
    • · experiments/results/headline.json — overall pass@1/5/10 across solvers
    • · experiments/results/metrics.json — per-solver, per-tier, per-k with 95% CIs
    • · Applied ML 2026 Ch. 1 prose consumes the same JSON; PDF and web reader cannot drift
    Notes
    • Standard library only — no third-party deps, no API keys. Deterministic from `--seed`.
    • A real LLM can be dropped in as a one-line adapter to replace `noisy_oracle`; the harness, grading, and CIs all work unchanged.
  • Last run
    2d ago
    2026-04-20 09:01 UTC
    Runtime
    <1 s on M2 Pro
    Stack
    Python

    Fire Safety dashboard-as-paper

    paper · Data-Driven Fire Safety Analytics: A Reproducible Audit of 5…

    Consumes the same pre-aggregated JSONs the live dashboard does, re-derives every paper claim, and runs a strict consistency check against stats.json (the dashboard's oracle). Fails loudly if dashboard and paper drift apart.

    run
    cd genass/publications/quarto/fire_safety_dashboard/experiments/
    $ uv run python experiments/run.py
    Records Audited
    550,145
    Fire Alarms
    205,398
    Commercial Share
    64.0%
    Bounded Annual Cost
    $131.5M
    Outputs & notes
    Writes
    • · experiments/results/headline.json — card-facing rollup
    • · experiments/results/breakdown.json — per-year / per-season / per-city / per-priority slices
    • · data/fire_safety_results.json — consumed verbatim by the paper's Quarto cells
    Notes
    • Two denominators are tracked explicitly: 550,145 total dispatches (incl. EMS / non-fire) vs. 367,444 fire-specific. Every percent in the paper says which denominator it uses.
    • To refresh upstream data: rerun `convert-to-parquet.py` + `precompute-aggregations.ts` in yev/apps/fire-safety/.
  • Last run
    2d ago
    2026-04-20 09:01 UTC
    Runtime
    ~3 s on M2 Pro
    Stack
    Python

    Mars Colony rule substrate

    paper · Emergent Social Collaboration in Multi-Agent LLM Systems: A …

    Deterministic Python reimplementation of the Mars Colony simulator's rule layer: 10 agents, role-affinity bond matrix, need-decay action selector, hard $0.007 cost ceiling per session. Generates every number cited in the paper across 10 seeds × 10K ticks.

    run
    cd genass/publications/quarto/mars_colony_collaboration/experiments/
    $ uv run python experiments/run.py
    Strong Friendships (mean)
    3.5
    Working Relationships (mean)
    14.6
    Construction Sites Done
    3 / 3
    Cost Ceiling / Session
    $0.007
    Action Mix · Work
    49.2%
    Outputs & notes
    Writes
    • · experiments/results/headline.json — card-facing rollup
    • · experiments/results/aggregate.json — mean / std across seeds
    • · experiments/results/metrics.json — full per-session telemetry
    • · data/simulation_results.json — consumed verbatim by the paper's Quarto cells
    Notes
    • stdlib-only — no third-party deps. Use --quick for a 3-seed × 2K-tick smoke run.
    • The LLM dialogue layer (GPT-4o-mini / Claude Haiku) is NOT called here; what's reproduced is the rule-governed dynamics and the cost envelope.
  • Last run
    2d ago
    2026-04-20 07:02 UTC
    Runtime
    ~30 s on M2 Pro
    Stack
    Rust + Python

    AresaDB benchmarks

    paper · AresaDB: A High-Performance Multi-Model Database in Rust

    Reproduces the four card metrics in the AresaDB paper (point-lookup p50/p99, batch-insert throughput, HNSW vs brute-force speedup, secondary-index speedup) by running the Rust bench suite over a 50K-node / 250K-edge / 10K-vector workload.

    run
    cd genass/publications/quarto/aresadb_technical_report/experiments/
    $ uv run python experiments/run.py
    Point Lookup
    5 µs (p50)
    Insert Rate
    75K/sec batch
    Vector Search
    7 µs (HNSW)
    Index Speedup
    23×
    Outputs & notes
    Writes
    • · experiments/results/headline.json — card-facing rollup
    • · experiments/results/raw/*.json — full Criterion output per benchmark
    • · Quarto paper consumes the same JSON; LaTeX render is byte-for-byte deterministic
    Notes
    • Rust toolchain required (cargo). The Python wrapper invokes `cargo bench` for each suite, parses Criterion JSON, and writes the unified headline.