Benchmarking
Capsem includes capsem-bench, a Python benchmarking tool that runs inside the VM. It outputs rich tables to stderr for humans and saves structured JSON to /tmp/capsem-benchmark.json for machine consumption.
The default capsem-bench all run includes storage split diagnostics so Linux
and macOS artifacts carry the same rootfs/workspace/tmpfs attribution data.
Running benchmarks
Section titled “Running benchmarks”just benchmark # Standard artifact-recording benchmark suitejust bench # Alias for just benchmarkjust benchmark-compare # Compare committed Linux/macOS artifactsjust run "capsem-bench disk" # Disk I/O onlyjust run "capsem-bench rootfs" # Rootfs reads onlyjust run "capsem-bench storage" # Rootfs/workspace/tmpfs splitjust run "capsem-bench startup" # CLI cold-start onlyjust run "capsem-bench http" # HTTP through proxyjust run "capsem-bench throughput" # 100MB downloadjust run "capsem-bench snapshot" # Snapshot operations onlyjust run "capsem-bench mitm-load" # MITM proxy concurrency/load testjust run "capsem-bench mcp-load" # Guest MCP endpoint concurrency/load testjust run "capsem-bench dns-load" # DNS proxy concurrency/load testcargo bench -p capsem-security-engine --bench security_engine_celuv run pytest tests/capsem-serial/test_security_engine_benchmark.py -xvsjust full-test # Full validation including benchmarksBoot timing
Section titled “Boot timing”Boot timing is measured independently from capsem-bench. The guest init script (capsem-init) records the wall-clock duration of each boot stage using /proc/uptime. The PTY agent sends these measurements to the host over the vsock control channel, where they are displayed as an inline table with a proportional bar chart.
Measured stages
Section titled “Measured stages”| Stage | What happens |
|---|---|
squashfs | Mount the compressed read-only rootfs from the virtio block device |
virtiofs | Mount the VirtioFS shared directory from the host |
overlayfs | Create the overlay filesystem (ext4 loopback upper + squashfs lower) |
workspace | Bind-mount /root from the VirtioFS workspace |
network | Configure dummy0 interface and iptables DNS/HTTPS redirect rules |
dns_proxy | Start capsem-dns-proxy and bridge DNS to host vsock:5007 |
net_proxy | Start the TCP-to-vsock proxy for HTTPS interception |
deploy | Copy MCP server, capsem-doctor, capsem-bench, and diagnostics from initrd |
venv | Create the Python virtualenv (uses uv for speed) |
agent_start | Launch the PTY agent and connect vsock ports |
Invariant
Section titled “Invariant”The diagnostic suite enforces that total boot time stays under 1 second (test_environment.py::test_boot_time_under_1s). Stages exceeding 500ms are flagged as slow. The most common regression is venv — if uv is missing from the rootfs, Python falls back to python3 -m venv which is ~10x slower.
Benchmark categories
Section titled “Benchmark categories”Host-native baseline
Section titled “Host-native baseline”just benchmark records a host-native artifact under benchmarks/host-native/
on every run. It uses the same artifact envelope as VM benchmarks and records
UTC time, host CPU/RAM/OS metadata, git state, filesystem context, local disk
I/O, CLI startup, synthetic small-file reads, and metadata-stat throughput. Use
this artifact as the local bare-host reference for VM comparison; it is not
produced by capsem-bench inside the guest. By default the temporary host I/O
workload runs under target/host-native-benchmark so it measures the project
filesystem rather than /tmp tmpfs; override with
CAPSEM_HOST_NATIVE_BENCH_DIR when comparing a specific disk.
Cross-platform artifact comparison
Section titled “Cross-platform artifact comparison”Use just benchmark-compare after Linux and macOS have committed artifacts
from the same benchmark version. The command reads benchmarks/, compares
Linux x86_64 against macOS arm64, reports ratios and percentages for common
lanes, and lists missing lanes such as host-native or Criterion artifacts when
one side has not rerun the current just benchmark suite yet.
just benchmark also runs benchmark retention. Before the run, it copies the
current host architecture’s active generated artifacts into benchmarks/archive/
so same-version reruns do not silently overwrite the prior evidence. After the
run, active category directories keep the latest generated data_*.json for
each category, architecture, and benchmark lane; superseded generated artifacts
are zipped under benchmarks/archive/ with a manifest containing their paths,
hashes, version, architecture, lane, timestamp, and source commit. Historical
archives are for engineering provenance, while current docs and performance
claims should cite the active latest artifacts.
Disk I/O (disk)
Section titled “Disk I/O (disk)”Measures scratch disk performance in /root (VirtioFS-backed workspace).
| Test | Method | Metric |
|---|---|---|
| Sequential write | Write 256MB in 1MB blocks, fdatasync at end | Throughput (MB/s) |
| Sequential read | Read 256MB in 1MB blocks after drop_caches | Throughput (MB/s) |
| Random 4K write | 10,000 random pwrite calls on 64MB file, fdatasync per write | IOPS, throughput |
| Random 4K read | 10,000 random pread calls on 64MB file after drop_caches | IOPS, throughput |
Write test size is configurable via CAPSEM_BENCH_SIZE_MB (default: 256).
Rootfs reads (rootfs)
Section titled “Rootfs reads (rootfs)”Measures read performance on the compressed squashfs rootfs where binaries and libraries live.
| Test | Method | Metric |
|---|---|---|
| Sequential read | Read the largest file in /usr/bin, /usr/lib, /opt/ai-clis in 1MB blocks | Throughput (MB/s) |
| Random 4K read | 5,000 random pread calls across all rootfs files (>4KB) | IOPS, throughput |
Storage split diagnostics (storage)
Section titled “Storage split diagnostics (storage)”Measures rootfs reads plus writable-path I/O across /root, /tmp,
/var/tmp, /var/log, and /run by default. Use it when Linux and macOS
benchmarks diverge and you need to separate VirtioFS workspace costs from
tmpfs, overlayfs, squashfs/rootfs reads, and host filesystem behavior.
This section is recorded by the canonical just benchmark path because
capsem-bench all includes storage; the long-running load tests remain
explicit opt-ins.
The path set is configurable via CAPSEM_STORAGE_BENCH_PATHS; write test size
is configurable via CAPSEM_STORAGE_BENCH_SIZE_MB (default: 64). The detailed
I/O profile also records sequential 4K/64K/1M read/write IOPS and random 4K
read plus sync-write IOPS with latency percentiles. Its file size and random
operation count are configurable via CAPSEM_STORAGE_IO_PROFILE_SIZE_MB
(default: 64) and CAPSEM_STORAGE_IO_PROFILE_RANDOM_OPS (default: 2000).
The rootfs section reports the booted squashfs compression and block/chunk size
from /dev/vda, plus overlay lower/upper/work directories when visible. The
top-level kernel section records /proc/cmdline, virtio block queue settings,
FUSE connection backpressure knobs, and known host-side KVM queue sizes.
CLI cold-start (startup)
Section titled “CLI cold-start (startup)”Measures wall-clock time to run <cli> --version with page cache dropped between runs. Each command is timed 3 times.
| Command | What it tests |
|---|---|
python3 --version | CPython interpreter startup |
node --version | Node.js runtime startup |
claude --version | Claude Code CLI (Node-based) |
gemini --version | Gemini CLI (Node-based) |
codex --version | Codex CLI (native binary + Node) |
HTTP (http)
Section titled “HTTP (http)”Measures HTTP throughput through the MITM proxy using concurrent GET requests.
- Default: 50 requests to
https://www.google.com/with concurrency 5 - Custom:
capsem-bench http <URL> <N> <C> - Reports: successful/failed count, requests/sec, latency percentiles (p50, p95, p99, min, max)
Each worker thread uses a persistent requests.Session. Latency includes the full round-trip: guest -> net-proxy -> vsock -> host MITM proxy -> internet -> response back.
Proxy throughput (throughput)
Section titled “Proxy throughput (throughput)”Downloads a ~10 MB PDF through the MITM proxy and reports end-to-end throughput.
Uses curl -L to download https://cdn.elie.net/static/files/i-am-a-legend/i-am-a-legend-slides.pdf (301-redirects to elie.net, so the selected profile must allow both hosts). This measures the maximum sustained bandwidth the proxy pipeline can deliver, including TLS termination, body inspection, and re-encryption.
Load tests (mitm-load, mcp-load, dns-load)
Section titled “Load tests (mitm-load, mcp-load, dns-load)”These modes are opt-in because they stress hot paths more aggressively than the default all suite.
| Mode | What it exercises |
|---|---|
mitm-load | Concurrent HTTPS requests through the MITM proxy |
mcp-load | Guest MCP framed transport and host endpoint dispatch |
dns-load | DNS redirect, capsem-dns-proxy, host DNS policy, and resolver path |
Security Engine CEL microbenchmarks
Section titled “Security Engine CEL microbenchmarks”The host-side Rust Criterion harness measures canonical Security Engine CEL paths without booting a VM:
cargo bench -p capsem-security-engine --bench security_engine_celcargo bench -p capsem-core --bench security_packsThe S08d harness covers CEL compile time, warm enforcement evaluation,
detection evaluation, backtest evidence deduplication, runtime registry
operations, compiled-plan rebuild cost, policy-context projection/
materialization, 100-rule last-match evaluation, Detection IR parse/lowering,
and a native Rust lookup comparator for the same HTTP policy. These numbers
explain runtime hot-path and rule-pack costs; they do not replace
VM-originated benchmark artifacts. just benchmark runs both Criterion
harnesses, archives their target/criterion estimates as JSON under
benchmarks/security-engine/, and then runs the VM-originated security
benchmark.
Security Engine VM-originated benchmarks
Section titled “Security Engine VM-originated benchmarks”The host-side serial benchmark measures the real VM-originated enforcement path for a process security event:
uv run pytest tests/capsem-serial/test_security_engine_benchmark.py -xvsThe first S08d paths install runtime CEL enforcement rules, send repeated
blocked process exec, blocked HTTPS request, blocked DNS lookup, and blocked
MCP tools/call workloads through live VMs, assert the expected block results,
check runtime match counters, verify canonical security_events rows in
session.db, and confirm logs exposes the Security Engine decision with
VM/profile/user/rule attribution. DNS artifacts also verify the legacy
dns_events row carries the runtime policy action and qname. MCP artifacts
verify mcp_calls policy fields and request-id-matched server/tool log
projection. Committed artifacts are written to
benchmarks/security-engine/.
Snapshot operations (snapshot)
Section titled “Snapshot operations (snapshot)”End-to-end latency for snapshot operations via the guest MCP endpoint. Tests at 3 workspace sizes (10, 100, 500 files of 4KB each):
| Operation | What it does |
|---|---|
create | Populate workspace, create a named snapshot via snapshots create |
list | List all snapshots with change diffs |
changes | List files changed since the last checkpoint |
revert | Revert a single modified file from the snapshot |
delete | Delete the snapshot |
Each operation is measured as the full round-trip: guest CLI -> MCP server (NDJSON over vsock) -> host gateway -> APFS filesystem operation -> response back to guest.
JSON output
Section titled “JSON output”All benchmarks save structured JSON to /tmp/capsem-benchmark.json inside the VM:
{ "version": "0.3.0", "timestamp": 1711561234.5, "hostname": "capsem", "disk": { "seq_write": { "throughput_mbps": 1180, ... }, ... }, "rootfs": { ... }, "startup": { "commands": { "python3": { "mean_ms": 9.0 }, ... } }, "http": { "requests_per_sec": 58, "latency_ms": { "p50": 67, ... } }, "throughput": { "throughput_mbps": 34.3, ... }, "snapshot": { "10_files": { "create_ms": 879, ... }, ... }, "dns_load": { "qname": "api.openai.com", "levels": [...] }}Adding a new benchmark
Section titled “Adding a new benchmark”- Create a new module in
guest/artifacts/capsem_bench/(e.g.,mytest.py) with amytest_bench()function that returns a dict and prints a Rich table to stderr - Add the mode name to
VALID_MODESincapsem_bench/__main__.py - Wire it into
main()with theif mode in ("name", "all"):pattern (lazy import) - Update the
dev-benchmarkskill and this page