Benchmark Reproducibility

TL;DR:

Benchmarking results must be reproducible to be trusted
OP Labs made some recent improvements to achieve reproducible results
Load test flakiness reduced from 42% to <1% after our reproducibility improvements

Measuring a chain’s sustained throughput is easy. Getting a number you can trust is much more difficult. We must be able to measure a chain’s performance in order to know what our current capabilities are and how changes to our software/architecture affect that performance. To reliably attribute performance differences across runs to changes in the software itself, we must eliminate other potential sources of variability first: the test tooling and methodology. If the results of our measurements are unreliable, then we risk wasting time re-running tests and chasing red herrings instead of addressing true performance bottlenecks.

At OP Labs we have a team dedicated to building benchmark tools that yield reproducible results so that the results are reliable, actionable, and can be confidently shared with users/customers. We use an open-source tx-spammer called contender, which means external parties don’t have to just trust our results. They can verify the results by triggering a load test from contender that targets any evm-compatible rpc endpoint.

Potential causes of variability

hardware types: cloud or local machine types that host ELs, CLs, etc. during tests
chain config: gasLimit, gasTarget, block times, flashblocks enabled
EL config: mempool size
load test type: what type of txs are being mined during the load test; some tx types are much easier to process than others so we should be explicit about the traffic profile tied to any Mgas/s or mined-tps values.
load test duration: short bursts of high throughput are not as reliable as long-term sustained throughput
network latency: location of tx spammer relative to location of the target network/chain

How to eliminate test variability

Chain configuration consistency

The first place to start is to choose a sane, consistent default configuration for any chain we test. Most often we black-box load test an entire chain by spamming the outermost proxyd url and measure results via the blocks produced by the chain during that load test. This allows us to closely mirror production setup for the chain, and identify bottlenecks at any level whether that’s proxyd ingress + any other network hops, DA throughput, or the L2 block-builder itself. But each additional component in the chain adds new potential variability if we don’t keep its configuration consistent. If we change too many variables between consecutive runs, it makes it very difficult to pinpoint exactly what affected performance.

Reliable test tooling

The primary benchmark tooling we use is:

contender: open-source tx spammer; built by flashbots, significant contributions made by OP Labs engineers
op-benchmark: closed-source wrapper around contender which has awareness and access to OP Labs devnets internals; gives us more detailed metrics/data and allows us to wipe mempools between each run to ensure every load test starts with the same clean slate (i.e. not affected by previous runs)

Contender supports a variety of tx-types (aka “scenarios” in the contender codebase) that we can use during a given load test. There are several built-in scenarios as well as support for custom toml-based scenarios. If we’re not consistent with the tx-types we use across load tests, then that can be a major source of variability for the final Mgas/s and mined-tps values we use to represent overall throughput. If “gas” was a perfect representation of the effort it takes a sequencer to process that tx, then tx-types would not matter. Currently OP Labs focuses on the following contender scenarios to compare performance for different chains/devnets:

erc20
- token transfers from a limited pool of “sender addresses” to fuzzed token recipients
- ~55kgas/tx
- stresses storage since every tx hits unique storage slots
groth16Verify
- groth16 zk proof verifcation
- ~300kgas/tx
- stresses cpu since elliptic curve operations are cpu-intensive
(future) uniV3
- uniswap swaps of one token to another
- common txs sent on many chains, applicable to many OP Enterprise customer targets

The worst performing scenario helps us set the threshold for the max gasTarget and gasLimit a given chain can use before it opens itself up to a DOS attack where it couldn’t process valid user txs as fast as they are sent.

We also build custom scenarios for customers who expect specific apps/txs to consume significant block space on their chain. Our future plans include expanding the standard suite of scenarios we use as part of every benchmarking process to stress-test different dimensions of potential bottlenecks.

Recent improvements

For awhile we unexpectedly got catastrophic failures on almost half of our black-box load tests that ran with significant load for >1 min. This made it difficult to differentiate when a load test found the maximum throughput versus when a transient issue corrupted our load test results. After implementing fixes across the benchmark tooling and core op-stack components, the load tests are now much more resilient/reliable. Transient events (e.g., single rpc failure, L1 derivation) now result in minor temporary hits to throughput instead of leading to cascading failures that cause the entire load test to be discarded (e.g. introduction of nonce gap that prevents all futures txs from making it in a block).

Three changes did most of the work:

contender: harden against transient rpc failures (merged pr); defense against a single rpc failure causing the entire load test to fail due to a nonce gap
op-node config: use light-CL mode, which outsources L1 derivation to a separate rpc node instead of doing that work within the sequencer. This eliminates some contention within op-node since block building and L1 derivation run in the same thread. This type of issue is usually described as a “derivation pipeline stall” or “FCU avalanche”.
op-reth config: increase mempool max pending/queued txs so that we are resilient to minor transient blips that temporarily throttle the block-builder’s throughput. This prevents the mempool from reaching its capacity, at which point it starts dropping txs, leading to nonce gaps in the txs beings sent by contender.

Results

Before improving our benchmark reproducibility, our results were shaky. When we ran the same load test (i.e. same test config against the same target chain), the test would pass most of the time but also frequently fail on re-runs. A “pass” in this context means all sent txs were successfully mined and the chain generally was healthy. For one particular high load test (scenario: erc20, tps: 1700) we had a 58/42 pass/fail split.

After the fixes and reconfigurations, those same re-runs succeeded 100% of the time (n=66). That is a huge improvement that allows us to trust the results and avoid wasting time debugging flakey tests. This means we can devote more time to identifying and addressing actual performance bottlenecks, which delivers real value to customers and users.

By registering for our newsletter, you consent to receive updates from us. Please review our privacy policy to learn how we handle your data. You can unsubscribe at any time.

Benchmark Reproducibility

Potential causes of variability

How to eliminate test variability

Chain configuration consistency

Reliable test tooling

Recent improvements

Three changes did most of the work:

Results

Sign up

Sign up

Sign up