Before you continue, please read and agree to the Terms of Service and Optimism Community Agreement.

Stefano Charissis

March 11, 2026

Benchmarking the OP Stack

TL;DR:

We’re investing heavily in benchmarking the OP Stack - with a focus on end-to-end benchmarks on full devnets
We treat bottleneck identification as a first-class deliverable, not a footnote
We measure sustainable ‘real work’ under representative workloads with UX guardrails
This is a first in our series of benchmarking blog posts

Most blockchain benchmarks are great at producing a number but not so good at explaining what it means. You’ll see “X TPS” or “Y gas/sec,” but you won’t see the transaction mix, the hardware, the state shape, the failure modes, the end-to-end throughput, or whether the system stayed stable once the load was sustained. Too often, our industry fixates on TPS, and we end up optimising the wrong thing: a headline number instead of sustained, user-visible capacity.

For engineers and operators, this framing provides an incomplete picture and is almost unusable. If you can’t reproduce a result, you can’t compare releases. If you can’t explain the bottleneck, you can’t fix it. That’s why we treat bottlenecks as a first-class output: every run should tell you what limited throughput and why.

For these reasons, and more, we’re building a workload-based, end-to-end benchmarking framework for the OP Stack: to measure real capacity under realistic conditions, with guardrails and evidence.

The goal isn’t a flashy peak metric. It’s to make scaling work predictable. Scaling is often the ultimate goal of benchmarking. But to achieve this we must first embrace being in a state of perpetual learning. Learning what the system can actually sustain, what breaks first and how different choices (protocol, client, config, infra) change the shape of the performance envelope.

Benchmarking is how we turn that learning into something useful: a shared, defensible source of truth that engineers can iterate on, operators can plan around, and partners can trust.

This post is an introduction to how we think about benchmarking the OP Stack: what we measure, why “TPS” alone doesn’t cut it anymore, and what makes our approach different from the status quo.

Over the next posts, we’ll dig into definitions (what we count), methodology (how we make results reproducible), and how benchmark outputs translate into real engineering decisions.

Why benchmarking matters (and why it’s harder than it looks)

At OP Labs, we’re investing heavily in benchmarking because we think it’s one of the highest-leverage ways to make the OP Stack faster, safer, and easier to operate. We don’t want performance to be a collection of one-off tests, heroic debugging sessions, or benchmark folklore. To push the limits of Ethereum scaling, we need a measurement system that’s as serious as the engineering.

If you’ve spent any time in crypto, you’ve seen the classic performance debate:

“We do X TPS, and Y gas/sec…. We’re faster because our number is bigger.”

Those numbers aren’t wrong. They’re just usually incomplete because they’re answering a different question than the one people actually care about:

“How much real work can the system sustain, end-to-end, without user experience degrading?”

That’s the benchmark question we care most about. And when a system is pushed to its limits, it rarely fails in a tidy, single-variable way. It fails as a system:

Latencies creep up
Inclusion time increases (txs sit in the mempool / queue longer before inclusion)
Tail percentiles get spiky
Verifier nodes fall behind
Engineering teams start getting paged

The important part is: these aren’t mysteries. Before a chain “goes down,” you can usually see it coming in the metrics. Benchmarking is how we surface those early warning signals under controlled conditions, so we can raise real capacity without learning the hard way.

Full-Network, End-to-End Benchmarks

The status quo: execution-only benchmarking

Most performance numbers in crypto still come from execution-only tests: one node, one client, one narrow slice of the system. Even in the L2 world — where “performance” is clearly more than just EVM execution — most benchmarks you see are still shaped like execution-layer tests.

There are a few notable exceptions, but it’s still uncommon to treat the entire rollup pipeline as the thing you benchmark by default. That’s too narrow. It’s a bit like testing how powerful a car engine is on a stand and declaring you’ve measured the car. Useful? Sure. Sufficient? Not even close.

Execution-only benchmarks usually answer:

“How fast can a single node execute a specific kind of transaction?”
“What’s the biggest number we can hit in a clean, controlled setup?”

They’re often less helpful for the questions protocol engineers and operators end up caring about: end-to-end inclusion behavior, stability under sustained load, and what the current bottleneck actually is.

Our approach: full devnet, end-to-end (with a bottleneck diagnosis)

Our approach starts from the opposite direction: we benchmark the OP Stack end-to-end on a full devnet, because that’s where real limits show up — queueing, coordination costs, tail latency, and the failure modes you only see when the whole system is under pressure.

We want system-level truth using realistic traffic patterns.
End-to-end devnet benchmarks answer:

“How much can the whole system sustain before user experience starts to degrade?”
“What breaks first, and why?”
“If we change one thing, what bottleneck moves next?”

For rollups, “performance” is the product of a whole pipeline working together: transaction ingestion and propagation, sequencing and block building, execution, state access and storage behavior, L1 coordination and publishing, node health under sustained load, and everything surrounding that (plus the messy stuff: variance, failure modes, recovery).

This isn’t the easiest path. Full-system tests are noisier than single-node tests, and they’re harder to make perfectly repeatable. But they also surface the problems that actually matter in production: inclusion time blowing out, failure rates rising, backlogs that don’t drain, and nodes falling behind.

And importantly: every run is expected to end with a bottleneck diagnosis, not just a headline number. Once an end-to-end run points to a real limiter, we can zoom in with microbenchmarks to measure that component in isolation — but we start with the full system so we don’t optimize the wrong thing.

A simple example: you double sequencer block-building performance, but blocks can't reach verifier nodes fast enough via p2p, so verifier nodes fall behind. End-to-end benchmarks make that obvious.

TPS & Gas/s

TPS is a tempting number to use. It’s a standard in the tech industry, very familiar and intuitive. The problem with this though is that not all transactions are the same.

A simple ETH transfer costs ~21,000 gas. It touches two account balances, executes no contract logic, and carries no calldata. It's lightweight, predictable, and parallelizable — two transfers with non-overlapping accounts have zero state dependencies and can run concurrently.

On the other end of the spectrum, onchain ZK proof verification (Groth16 or PLONK) runs ~200,000 to 2,000,000+ gas. Most of that cost comes from elliptic curve pairing operations via the ecPairing precompile — 45,000 base gas plus 34,000 per point pair — with the proof, public inputs, and verification key all landing as calldata. Memory expansion is significant, and the call puts real CPU pressure on the node.

And two chains running identical gas/s numbers can be bottlenecked on completely different resources — one on execution, one on state root computation — requiring entirely different scaling investments to fix.

Two transactions. Same gas usage. Completely different load on the network.

This is why headline TPS is the wrong metric — it flattens workload complexity into a single number that tells you almost nothing about real throughput. Meaningful benchmarking measures what the system can sustain under representative workloads, and uses those results to surface bottlenecks, not just scores.

A rollup can “do a lot of TPS” by pushing cheap, homogenous transactions that don’t look like what users do in the real world. Gas/sec has a similar issue: it compresses a multi-dimensional system into a single scalar that can be optimized in ways that don’t necessarily improve user outcomes.

As workloads diversify (more contract-heavy activity, different calldata patterns, different state access behavior, different latency sensitivity) you start caring less about “how many transactions” and more about:

What kinds of actions are those transactions representing?
Are they sustainable over time?
Do inclusion time and failure rates stay healthy?
What breaks first when you turn the dial up?

That’s another reason we’re focused on measuring real work under representative workloads with guardrails.

Workloads

Instead of chasing one universal headline number, we define workloads: versioned, documented mixes of transaction types that represent a real category of onchain activity.

This matters because onchain activity is diverging fast. A stablecoin payment workload looks nothing like a DeFi trading workload — different transaction types, different state access patterns, different resource constraints, different latency requirements. As the OP Stack powers more fintech, neo-finance, and institutional use cases alongside consumer DeFi, a single throughput number becomes not just incomplete but actively misleading. It tells you a chain is fast without telling you what it's fast at.

Workload-based benchmarking is how we solve this. Each workload has a name and a version, a defined transaction mix, and an intended interpretation, giving operators and partners an apples-to-apples basis for evaluating the system against their specific use case. If you're building a stablecoin payment rail, you should know exactly how the chain performs under a payment workload, not a generic one that doesn't represent your users.

Workloads also stop you from accidentally benchmarking the wrong thing. A concrete example: a stablecoin payment workload is data-availability-bound — high transaction count, low compute per transaction. A DeFi trading workload is execution-bound — high compute, significant state contention from concurrent pool interactions. Same chain, same hardware, completely different binding constraint. If your benchmark doesn't reflect the actual workload, you'll optimize the wrong thing and never know it.

Guardrails

Throughput doesn’t really count if you have to break UX to get it. So we treat throughput as valid only when the system stays inside defined “guardrails,” such as:

Transaction inclusion time staying within acceptable bounds
Failure rates staying low
Stability under sustained load (no unbounded backlogs)
Predictable behavior at the tail

This also means that new capacity that is used entirely by spam doesn’t add throughput, meaningful gains help users take action.

This is the heart of “max sustained” benchmarking: we don’t accept wins that degrade the user experience. If p50 looks great but p99 is on fire, you didn’t win - you just moved the problem.

Bottlenecks are a Deliverable

A benchmark that ends with a report full of numbers and charts is only half a benchmark. What we want from every meaningful run is:

A defensible headline result, and
A clear, evidence-backed explanation of what limited us

This transitions performance tuning from a guessing game to a science. Accompanying our charts are the artifacts engineers actually use: system metrics across percentiles, profiling/tracing outputs and structured context to compare runs meaningfully.

The goal is to shorten the loop from: “we saw a regression / we hit a ceiling” to “here’s the evidence, here’s the limiter, here’s what to fix next.”

That loop is the compounding advantage.

The Goal

In summary, these are the goals of our benchmarking platform:

Proficiency: build a standardized, repeatable system we trust (and can explain)
Confidence: Make production changes without concern. Catch any regressions early
Publish methodology openly so the ecosystem can reproduce and build on our results
Inform our scalability roadmap: use evidence to prioritize what actually moves the limit

What’s Next

This post was the “why” and the big-picture approach: measure real performance end-to-end, on a full devnet, with guardrails.

In the next posts, we’ll get more specific. Planned topics include:

Onchain Actions/sec: what we mean by “real work,” what we count, and what we don’t
Workloads: how we define and version transaction mixes so results stay comparable over time
Guardrails: what “sustained” really means, and why p95/p99 matter as much as the headline number
Reproducibility: how we make runs consistent and defensible across releases and environments
Bottlenecks: how we go from a benchmark result to a clear “this is the limiter” story engineers can act on

If you’ve ever looked at a TPS chart and thought “cool, but what does this actually mean?” we’ll make that concrete in the next posts.

By registering for our newsletter, you consent to receive updates from us. Please review our privacy policy to learn how we handle your data. You can unsubscribe at any time.

Benchmarking the OP Stack

Why benchmarking matters (and why it’s harder than it looks)

Full-Network, End-to-End Benchmarks

The status quo: execution-only benchmarking

Our approach: full devnet, end-to-end (with a bottleneck diagnosis)

TPS & Gas/s

Workloads

Guardrails

Bottlenecks are a Deliverable

The Goal

What’s Next

Sign up

Sign up

Sign up