# Bench: validating forge's flexibility without performance regression

This chapter describes the public benchmark suite shipped with `mujoco-wasm-forge`.
The suite exists to justify *why forge exists*:

- **Flexibility**: a minimal, configurable interface surface and multiple distribution variants (e.g. single-threaded vs pthreads).
- **Flatness**: flat handles and a predictable ABI that downstream tools and apps can build around (Worker / Node / Web).
- **Rule-based, auditable generation**: wrappers/exports are generated and checked by rules so changes are explicit and reviewable.

In forge, **flexibility** is framed as a complement to the official embind distribution. Official embind typically offers
a richer, stable, and ergonomic binding surface, but that breadth also comes with a larger surface area and a more
standardized distribution shape, which can make fine-grained control over variants and exported ABI less direct. Forge
instead relies on the **flat ABI organization**
and the **automated output structure** (stable dist layout + ABI artifacts + rules/gates): most changes are
`change rules -> regenerate -> gate`, which is designed to be auditable and variant-friendly.

Forge is intended to be usable in **extreme deployment environments** (e.g. online demos with strict security and
resource constraints) and in **research prototyping** workflows, while still being a reasonable base for routine use.

The benchmark asks a concrete question:

> Under these design constraints, does forge stay in the same performance class as the official MuJoCo embind WASM build
> for Simulate-style workloads? (And when it differs, is the reason attributable and controllable?)

This is **not** a claim that forge is universally better than official embind. The two distributions have different
goals; the bench is here to make trade-offs visible, reproducible, and discussable.

## What we measure (organized by user needs)

| Need / scenario | Why we care | What we measure (metric) | Notes |
| --- | --- | --- | --- |
| Ship fast (web demo / docs) | Download + caching affects real users | Artifact size (`.js` + `.wasm`) | Measured from produced bundles |
| Page becomes interactive | Cold start is a hard UX constraint | Init time (`modFactory() -> Module.ready`) | Dominated by threading/runtime initialization |
| First meaningful output (Simulate) | First step / first snapshot is when users feel success | TTFS (ready -> first step / first scene snapshot) | Browser TTFS is end-to-end; Node TTFS isolates engine load/step |
| Predictable resource budget | Thread pools + compilation can cause spikes | RSS after init + **peak** RSS | Node provides more reliable RSS sampling than browsers |
| Smooth model load | XML compilation and mesh processing dominate load time | XML compile/load latency | Sensitive to threading defaults and asset shape |
| Stable steady-state simulation | After load, engine hot-loop dominates | `ms/step` / steps/sec | Requires comparable threading configuration for strict comparisons |
| Low JS <-> WASM overhead | Flat/handle-based APIs should reduce crossing cost | FFI microbench (ns/call for lightweight getters) | Microbench is intentionally small and repeatable |
| Robust lifecycle (reset/reload) | Demos frequently reset or reload models | Reload loop time + RSS drift | Amplifies leaks and binding overhead |
| Diagnosable failures | Public demos must fail with readable errors | Errmsg/errno gates (bad XML, missing plugin, missing assets) | Forge has explicit helper gates for this |
| Plugin baseline | Some Simulate models require plugins | Plugin availability smoke (e.g. `touch_grid`) | Treated as a product-level contract difference |

## Comparison matrix (what we compare)

| Label | Intended meaning | Threading policy |
| --- | --- | --- |
| `official-3.4.0` | Official embind baseline | As built upstream |
| `official-3.5.0-hc4` / `official-3.5.0-hc32` | Official embind (3.5.0) under different effective pool sizing | Pthreads enabled; pool approximated via `hardwareConcurrency` |
| `forge-3.4.0-single` | Forge baseline | Single-threaded |
| `forge-3.5.0-single` | Forge single-threaded variant | Single-threaded |
| `forge-3.5.0-pthreads` | Forge pthread variant | Pthreads enabled; pool defaults to 4 (override via `MJWF_PTHREAD_POOL_SIZE`; for 3.5.0 default clamp=pool) |

For strict (thread-matched) comparisons, we primarily compare:
`forge-3.5.0-pthreads (pool=4)` vs `official-3.5.0-hc4`.

## Summary findings (reference run)

The suite is designed to be rerun on different machines; numbers vary. The key question is whether forge's organization
and flexibility introduces a performance penalty.

| Dimension | Observation (selected numbers) | Interpretation (what it means) | Caveats |
| --- | --- | --- | --- |
| Distribution size | `forge-3.5.0-single`: wasm 3.33 MiB / JS 256.1 KiB; `official-3.5.0-hc4`: wasm 8.24 MiB / JS 296.7 KiB | Minimal/flat surface can reduce shipping footprint | Not guaranteed across future upstreams |
| Init + memory | `forge-3.5.0-pthreads`: init 52.7ms / peak RSS 462.8 MiB; `official-3.5.0-hc4`: init 84.4ms / peak 659.2 MiB; `official-3.5.0-hc32`: RSS after init 507.9 MiB | Threading policy and pool sizing can dominate cost | RSS is platform-dependent; browsers differ |
| Plugin baseline | `sensor`/`touch_grid`: forge=ok, official=error | Forge can treat plugin availability as part of the dist contract | Official plugin packaging may evolve upstream |
| `ms/step` (thread-matched) | e.g. `raj` 0.086 vs 0.103 ms/step; `cards` 0.325 vs 0.343 (forge pthreads vs official hc4) | Forge organization does not inherently degrade simulation throughput | Small models can be dominated by measurement overhead |
| FFI microbench | `model_nq`: 14.71ns vs 33.56ns/call; `mj_version`: 4.23ns vs 5.24ns | Flat handles reduce high-frequency boundary cost | Microbench != whole-app performance |
| TTFS (Simulate) | `humanoid` ready->snapshot: 77ms (pthreads) vs 111ms (single); `cards`: 505ms (pthreads) vs 225ms (single) | End-to-end pthread benefit depends on model/load path | HUD metrics depend on environment/COI configuration |
| Reload / lifecycle | 50 iterations: 0.840ms/iter (forge pthreads) vs 1.003ms/iter (official hc4); RSS drift: +1.9 MiB vs +7.0 MiB | Lifecycle pressure can amplify binding overhead differences | Not a proxy for every workload |

### Reference snapshot tables

Reference run environment (for the snapshot tables below): Windows x64 / Node `v22.16.0` / 32 logical CPUs.
Unless noted otherwise:
- Node metrics are medians of 5 runs.
- Browser HUD metrics default to 3 runs (median).
- Sizes and memory are reported in KiB/MiB (1024-based).

### Snapshot (Node: bundle / init / memory)

| Label | JS (KiB) | WASM (MiB) | Init (ms) | RSS after init (MiB) | RSS peak (MiB) |
| --- | --- | --- | --- | --- | --- |
| `forge-3.5.0-single` | 256.1 | 3.33 | 19.4 | 59.8 | 208.8 |
| `forge-3.5.0-pthreads` | 272.6 | 3.33 | 52.7 | 110.7 | 462.8 |
| `official-3.5.0-hc4` | 296.7 | 8.24 | 84.4 | 125.5 | 659.2 |
| `official-3.5.0-hc32` | 296.7 | 8.24 | 169.8 | 507.9 | 873.9 |

### Model load + stepping (thread-matched, Node: `pool=4` vs `hc=4`)

| Model | Forge compile/load (ms) | Official compile/load (ms) | Forge steady `ms/step` | Official steady `ms/step` |
| --- | --- | --- | --- | --- |
| `cards` | 90.8 | 109.6 | 0.325 | 0.343 |
| `raj` | 303.4 | 384.3 | 0.086 | 0.103 |
| `humanoid` | 5.4 | 7.6 | 0.020 | 0.021 |
| `flex_bunny` | 59.5 | 102.5 | 0.426 | 0.465 |

### Functional baseline (Node: plugin-dependent models)

| Model | Forge 3.5.0 single | Official 3.5.0 (hc=4) |
| --- | --- | --- |
| `sensor` | ok | error |
| `touch_grid` | ok | error |

### FFI microbench (Node: apples-to-apples subset)

> This is a "1e6 calls" microbench. It is meant for comparing relative organization/binding overhead, not for
> extrapolating whole-app throughput.

| Metric | Forge 3.5.0 pthreads ns/call (median) | Official 3.5.0 ns/call (median) |
| --- | --- | --- |
| `mj_version` | 4.23 | 5.24 |
| `model_nq` | 14.71 | 33.56 |

### Simulate TTFS (Browser: forge variants only)

> Browser numbers come from `mujoco-wasm-play` + Playwright HUD sampling (default: 3 runs, median).
> In some environments, the pthread variant may not expose HUD `CPU (ms/step)`, but TTFS/FPS is still sampled.

| Model | Variant | TTFS ready->snapshot median (ms) | TTFS nav->snapshot median (ms) | FPS median |
| --- | --- | --- | --- | --- |
| `humanoid` | single | 111 | 322 | 120 |
| `humanoid` | pthreads | 77 | 287 | 120 |
| `cards` | single | 225 | 499 | 120 |
| `cards` | pthreads | 505 | 791 | 120 |
| `raj` | single | 338 | 570 | 121 |
| `raj` | pthreads | 270 | 510 | 45.5 |

### Reload / lifecycle (thread-matched, Node: `pool=4` vs `hc=4`)

| Label | Iterations | ms/iter | RSS drift (MiB) | RSS peak (MiB) |
| --- | --- | --- | --- | --- |
| `forge-3.5.0-pthreads` | 50 | 0.840 | 1.9 | 462.8 |
| `official-3.5.0-hc4` | 50 | 1.003 | 7.0 | 659.2 |

## Limitations (what this bench may still miss)

- Official embind and forge necessarily use different JS-level APIs, even when the underlying engine version is comparable.
- Browser pthread deployments require COOP/COEP + `SharedArrayBuffer`; deployment constraints are part of the trade-off.
- Some metrics (especially browser memory) are harder to measure precisely and are currently more reliable in Node.
- Fastest possible is not the goal; the goal is forge capabilities without unacceptable regression.

## Reproducing the bench

See `bench/README.md` for the full procedure. The short version:

```powershell
powershell -ExecutionPolicy Bypass -File tools/build_official_embind.ps1 -Ref 3.4.0 -OutDir C:\dev\mjwf-bench\official\3.4.0 -Clean
powershell -ExecutionPolicy Bypass -File tools/build_official_embind.ps1 -Ref 3.5.0 -OutDir C:\dev\mjwf-bench\official\3.5.0
```

```bash
node bench/node/run_matrix.mjs --official340 C:/dev/mjwf-bench/official/3.4.0 --official350 C:/dev/mjwf-bench/official/3.5.0
node bench/node/report.mjs
```

```powershell
powershell -ExecutionPolicy Bypass -File bench/browser/run_playwright_bench.ps1
```