HTTP/2 Single-Packet Race Conditions — Arbiter Security Research

Background: Why Timing Matters

Race conditions in web applications have been exploitable in theory for as long as concurrent request processing has existed. In practice, the window has always been too narrow to hit reliably. A classic time-of-check-to-time-of-use (TOCTOU) gap in a coupon redemption endpoint might be 50 microseconds wide. If your two requests arrive 2 milliseconds apart — a gap that feels instantaneous to a human but spans thousands of CPU cycles — the server processes them sequentially and the race never fires.

James Kettle's Smashing the State Machine research, presented at Black Hat USA 2023, changed the calculus. The core insight: HTTP/2 multiplexing allows multiple requests to be packed into a single TCP packet. When the server's kernel delivers that packet to userspace via a single read() syscall, all requests land in the same buffer at the same instant. The timing window collapses from network jitter (milliseconds) to kernel scheduling granularity (sub-microsecond). Suddenly, race conditions that were theoretically possible become reliably triggerable.

Arbiter implements this methodology as a fully automated pipeline spanning four execution strategies, a 12-type race taxonomy, and a multi-signal anomaly detection system. This article walks through our approach in detail.

The Predict-Probe-Prove Pipeline

The race engine follows a three-phase pipeline, analogous to Arbiter's constraint-driven approach but specialized for concurrency bugs:

Predict — Identify candidate endpoints and classify potential race types from the state graph.
Probe — Execute simultaneous requests using the optimal synchronization strategy.
Prove — Analyze responses for anomalies that confirm exploitability.

The Predict phase leverages Arbiter's state graph — the same model used for constraint-driven discovery. Shared resource analysis identifies write-write conflicts — endpoints where POST, PUT, PATCH, or DELETE operations touch the same resource. Pattern matching against known race-prone paths (/transfer, /redeem, /vote, /reserve) provides additional candidates. But the real work happens in the Probe phase, where timing precision determines whether a race fires or fizzles.

HTTP/2 Single-Packet Attack

The primary execution strategy is the HTTP/2 single-packet attack. The goal is simple: deliver N complete HTTP requests to the server in a single TCP segment so that they are processed from the same kernel buffer in the same event loop iteration.

We use the h2 crate directly rather than hyper. This is a deliberate architectural decision — hyper abstracts away frame-level control, and we need to manipulate exactly when HEADERS and DATA frames are sent on the wire. The h2 crate gives us a SendRequest<Bytes> handle with direct control over stream lifecycle.

Phase 1: Stream Reservation

The attack begins by opening N HTTP/2 streams without completing them. For each request in the race batch:

For each request in the batch:
  1. Send HEADERS frame on a new HTTP/2 stream, withholding the body
     - If the request has a body: do NOT set END_STREAM on HEADERS
     - This keeps the stream open, waiting for DATA frames
  2. Store the response handle and send handle for later use

At this point, the server has received N HEADERS frames and allocated N stream contexts, but no request is complete. The server is waiting for DATA frames (or END_STREAM) on each stream. All HEADERS frames travel over the same TCP connection — HTTP/2 multiplexing guarantees this. But they may arrive across multiple packets since we are sending them sequentially. That is fine; the critical synchronization happens next.

Phase 2: Simultaneous Body Release

Once all streams are reserved, a tight loop releases all request bodies at once:

For each prepared stream (tight loop, no awaits):
  1. Send DATA frame with request body, setting END_STREAM=true
  // All sends happen on the same H2 connection back-to-back

Because all send_data calls happen on the same h2 connection, the DATA frames are written to the same TCP socket buffer. With TCP_NODELAY enabled (Nagle disabled), the kernel flushes this buffer immediately. If the total frame data fits within a single TCP segment (MSS, typically 1460 bytes for Ethernet), all DATA frames leave in one packet.

On the server side, the kernel's TCP stack delivers the entire segment to userspace in a single read() call. The HTTP/2 framing layer demultiplexes the DATA frames and completes all N requests in the same parsing pass. Each request enters the application's handler within the same event loop tick — the timing window between them is limited to the server's frame parsing overhead, typically single-digit microseconds.

Phase 3: Response Collection

All ResponseFuture handles are awaited with a configurable timeout. Response bodies are collected via h2 flow control:

For each stream:
  1. Await response with a configurable timeout
  2. Collect the response body, releasing H2 flow control capacity as chunks arrive
  3. Record status, headers, and body for anomaly analysis

Connection parameters are tuned for race conditions: the HTTP/2 flow control window is sized to avoid stalls mid-race, and concurrent stream limits are pushed high. TLS is negotiated with ALPN h2 to ensure HTTP/2 from the first byte.

Enhanced HTTP/2: END_STREAM Delay

The basic single-packet attack works well when request bodies are small enough to fit in one TCP segment. But what if you need to send 20 requests with 500-byte JSON bodies each? The total frame data exceeds one segment, and the tight-loop approach may span multiple packets, reintroducing timing variance.

The enhanced strategy solves this with three techniques drawn directly from Kettle's research.

END_STREAM Delay Technique

The key insight: an HTTP/2 server does not begin processing a request until it receives a frame with the END_STREAM flag set. By sending all HEADERS and DATA frames without END_STREAM, then triggering END_STREAM on all streams simultaneously, we decouple "data delivery" from "request completion."

Phase 1: For each stream, send HEADERS without END_STREAM
Phase 2: For each stream, send DATA (request body) without END_STREAM
  -- Server has all data but considers requests incomplete --
Phase 3: Tight loop sends zero-length DATA with END_STREAM on all streams
  -- All requests complete simultaneously --

The zero-length DATA frames with END_STREAM are tiny — just the 9-byte HTTP/2 frame header each. Even with 200 concurrent streams, that is 1,800 bytes, well within a single TCP segment. The server receives all END_STREAM signals in one read() and begins processing all requests from the same event loop iteration.

This technique is strictly superior to the basic attack for requests with bodies larger than a few hundred bytes. The timing diagram illustrates the difference:

Basic single-packet attack:
  Client                                          Server
    |--- [HEADERS stream 1] ----------------------> |
    |--- [HEADERS stream 2] ----------------------> |
    |--- [DATA+ES stream 1] [DATA+ES stream 2] --> |  <-- must fit in 1 segment
    |                                               |--> process(1), process(2)

END_STREAM delay:
  Client                                          Server
    |--- [HEADERS stream 1] ----------------------> |  (buffered, not processed)
    |--- [HEADERS stream 2] ----------------------> |  (buffered, not processed)
    |--- [DATA stream 1] -------------------------> |  (buffered, not processed)
    |--- [DATA stream 2] -------------------------> |  (buffered, not processed)
    |                                               |
    |--- [0-byte DATA+ES 1] [0-byte DATA+ES 2] --> |  <-- always fits in 1 segment
    |                                               |--> process(1), process(2)

Connection Warming

TCP connections are not steady-state from the first byte. The congestion window starts small (typically 10 MSS per RFC 6928), the server's receive buffers may not be fully allocated, and TLS session parameters are still being cached. Sending the race payload on a cold connection introduces variance that has nothing to do with the target application.

Connection warming sends HTTP/2 PING frames before the attack to stabilize the connection. A configurable number of pings with inter-ping delays are sent to warm up the path.

The pings accomplish several things: they grow the TCP congestion window past slow-start, they prime the server's HTTP/2 connection state machine, and they give the kernel time to allocate optimal buffer sizes. In practice, connection warming significantly reduces timing spread.

Timing Spread Analysis

After executing a race, the engine computes a timing spread from response arrival times — the difference between the fastest and slowest response in the batch. This spread, along with mean and standard deviation, serves as a quality signal for how tightly synchronized the requests actually were.

A low spread (sub-millisecond) strongly indicates single-packet delivery. Higher spreads suggest multi-packet delivery, and the engine may retry with connection warming or fall back to the END_STREAM delay technique. This feedback loop is automatic — the Probe phase adapts its strategy based on observed timing quality.

Last-Byte Synchronization for HTTP/1.1

Not all targets support HTTP/2. When the server only speaks HTTP/1.1, multiplexing is unavailable — each request requires its own TCP connection, and there is no protocol-level mechanism to batch them. The last-byte synchronization strategy achieves comparable timing precision through a different mechanism.

The Approach

The idea is to send all but the final byte of each HTTP request on separate connections, then release all final bytes simultaneously. The server's TCP stack has buffered N-1 bytes of each request; it cannot begin processing until the complete request arrives. A single coordinated byte release on all connections triggers simultaneous processing.

For each request:
  1. Serialize the full HTTP/1.1 request to bytes
  2. Split at the last byte: prefix (N-1 bytes) and final byte (1 byte)
  3. Each request gets its own TCP connection (or TLS wrapper)

OS Threads and Spin-Loop Barriers

Each request gets its own OS thread — not a Tokio task, not a green thread, but a real std::thread::spawn thread. This is deliberate. Async runtimes introduce scheduling jitter: a Tokio task might be descheduled between being woken and actually executing its I/O. OS threads with spin-loop synchronization give us deterministic timing.

The synchronization primitive is a precision barrier — a shared atomic structure that coordinates all threads to release simultaneously. Each thread signals its arrival, then enters a busy-wait spin loop. When all threads have arrived, the main thread triggers the barrier, and every waiting thread proceeds at essentially the same instant. An optional target timestamp allows the barrier to delay release until a precise nanosecond, further tightening synchronization.

Why spin-loop waiting instead of OS sleep? Operating system sleep has 10-15ms of jitter on most platforms. The kernel's timer resolution on Linux defaults to 1ms (with CONFIG_HZ=1000), and sleep() involves a context switch through the scheduler. A busy-wait spin loop burns CPU but achieves dramatically tighter synchronization — orders of magnitude better than sleep-based coordination.

The barrier uses strong memory ordering on all atomic operations — correctness (every thread seeing the release signal at the same time) matters more than the nanoseconds saved by weaker orderings. The barrier is executed once per race attempt; it is not a hot path.

Execution Flow

The full last-byte synchronization sequence:

1. Main thread: create barrier expecting N threads
2. Spawn N OS threads, each:
   a. TCP connect (or TLS handshake)
   b. Set TCP_NODELAY
   c. Write prefix bytes (N-1 bytes) to socket
   d. Flush socket  -- server has buffered an incomplete request
   e. Wait on barrier  -- thread spins until released
3. Main thread: wait for all threads to arrive at barrier
4. Main thread: brief settle pause to let all threads stabilize
5. Main thread: release barrier
6. All threads simultaneously:
   a. Write final byte
   b. Flush
   c. Read response

The settle pause exists because thread arrival at the barrier is not perfectly simultaneous. Some threads complete their TLS handshake faster than others. The settle period ensures all threads are spinning on the barrier before any of them are released, preventing early arrivals from writing their final byte before stragglers have even reached the wait point.

First-Sequence Synchronization

For extreme concurrency requirements — 10,000+ simultaneous requests — neither HTTP/2 multiplexing nor last-byte sync scales cleanly. HTTP/2 servers typically cap concurrent streams at 100-250. Last-byte sync at 10,000 threads requires 10,000 TCP connections and 10,000 OS threads, which is operationally expensive and introduces its own scheduling variance.

The first-sequence synchronization strategy uses IP fragmentation to achieve kernel-level simultaneity at massive scale. Each HTTP request is split at the MTU boundary. All fragments except the final one are sent first. TCP reassembly on the server buffers incomplete segments. Then all final fragments are released simultaneously — each target connection's TCP stack reassembles the complete segment and delivers it to userspace at the same kernel tick.

This technique requires raw socket access (CAP_NET_RAW on Linux, root on macOS) and involves manually constructing IP packets with correct fragmentation offsets and the More Fragments flag. It is the most invasive strategy and is only selected when the constraint probing phase determines that lower-overhead approaches cannot achieve the required concurrency.

Strategy Selection

Arbiter does not blindly pick a synchronization strategy. A constraint probing phase tests the target's capabilities before the race:

Probe	What It Tests	Fallback Behavior
Max H2 streams	Server's `SETTINGS_MAX_CONCURRENT_STREAMS`	< 32 streams: fall back to last-byte sync
Body size limits	Maximum request body the server accepts	Constrains payload size per stream
Keep-alive timeout	How long the server holds idle connections	Adjusts warming and preparation timing
Rate limiting	Whether the target returns 429 responses	Switch to first-sequence sync for burst-before-limit

The decision tree: if the target supports HTTP/2 with at least 32 concurrent streams, use the enhanced H2 strategy with END_STREAM delay and connection warming. If H2 is available but stream limits are low, use basic H2 single-packet. If only HTTP/1.1, use last-byte synchronization. If rate limiting is detected and the required concurrency exceeds what connection-per-request strategies can sustain before hitting the limit, escalate to first-sequence sync.

Race Type Taxonomy

The detection system classifies races into 12 types across two categories. The first six are classic race conditions well-known in application security. The second six are the more subtle classes identified by Kettle's research, which exploit state machine inconsistencies that only manifest under precise concurrency.

Classic Race Types

Type	Severity (1-10)	Concurrency	Description
`DoubleSpend`	10	10-20	Same token/coupon/credit redeemed multiple times
`LimitBypass`	8	20-50	Rate/quantity limits exceeded via concurrent requests
`Toctou`	9	10-20	Check and use separated by exploitable window
`StateConfusion`	7	20-30	Application enters invalid state under concurrency
`PrivilegeRace`	9	10-20	Privilege change races with privileged action
`FileRace`	7	20-30	File creation/deletion/rename TOCTOU

Kettle State Machine Classes

Type	Severity (1-10)	Multi-Endpoint	Description
`ObjectMasking`	8	Yes	Object created and masked/deleted simultaneously, leaving orphaned references. Cf. GitLab HackerOne #604534.
`TokenMisrouting`	9	Yes	Token issued to one session delivered to another under concurrent auth flows. Follows the CVE-2022-4037 pattern.
`PartialConstruction`	7	No	Object accessed before initialization completes, exposing default/zero values.
`DeferredCollision`	6	No	Effects manifest after the race window closes, requiring delayed observation.
`MultiEndpoint`	8	Yes	Race between different endpoints that share backend state (e.g., password reset + login).
`SessionCorruption`	9	Yes	Concurrent session operations produce cross-user data leakage.

Each type has a recommended concurrency level — the number of simultaneous requests that maximizes detection probability without overwhelming the target. Double-spend races fire with as few as 10 concurrent requests; limit bypasses may require 50. The distinction between single-endpoint and multi-endpoint races is important: single-endpoint races involve one endpoint racing against itself (the common case), while multi-endpoint races involve two different endpoints racing against shared state (the Kettle cases). Multi-endpoint races require the detection engine to correlate responses across different URL patterns.

Detection Heuristics

The detection system combines three complementary strategies to identify race candidates from the state graph without requiring any manual annotation.

1. Pattern Matching

A pattern library matches endpoint paths against known race-prone operations. For example:

/transfer, /send, /withdraw with body keywords like amount, recipient → DoubleSpend (critical)
/redeem, /coupon, /promo, /voucher with body keywords like code, coupon_code → DoubleSpend (critical)
/vote, /like, /upvote, /rate with body keywords like item_id, post_id → LimitBypass (high)
/reserve, /book, /claim, /lock with body keywords like resource_id, slot → TOCTOU (high)
Additional patterns for file operations, auth flows, etc.

Path matching provides the initial signal, but body keyword reinforcement is what elevates a candidate from "maybe interesting" to "worth testing." An endpoint at /api/v2/actions matches no path pattern, but if its request body contains amount and recipient, keyword reinforcement associates it with DoubleSpend.

2. Shared Resource Analysis

The more powerful strategy operates on the state graph directly. It builds a mapping from each inferred resource to the endpoints that write to it. A "resource" here is derived from path pattern analysis — /api/users/{id}/balance and /api/users/{id}/transfer share the users/{id} resource.

Any resource with two or more write endpoints (methods POST, PUT, PATCH, or DELETE) is a race candidate. If both endpoints modify the same resource, concurrent execution may violate the atomicity assumptions of either handler. This analysis discovers race conditions in custom application logic that no pattern list could anticipate.

3. Rate Limit Detection

Endpoints that return HTTP 429 (Too Many Requests) are automatically flagged as limit bypass candidates. The reasoning: if the application bothers to enforce a rate limit, the operation behind it is sensitive. And rate limits implemented in application code (rather than at the reverse proxy level) are often vulnerable to the single-packet attack — the limit-check and limit-increment are separate database operations with a TOCTOU gap between them.

Anomaly Detection

After the Probe phase executes a race and collects responses, the Prove phase analyzes them for anomalies. Several anomaly types are defined, ranked from strongest to weakest indicators of a real race condition:

Anomaly Type	Strength	Indicator
Duplicate objects	Strong	Multiple responses contain newly created objects where only one should exist (e.g., two successful coupon redemptions).
Token misdirection	Strong	A response contains a token/session belonging to a different user or request context.
State leakage	Strong	Response exposes internal state that should have been invalidated (e.g., accessing a deleted resource returns data).
Inconsistent responses	Moderate	Identical requests produce different status codes or response structures within the same race batch.
Sporadic errors	Moderate	Some requests in the batch produce 500-series errors while others succeed, suggesting database deadlocks or constraint violations.
Timing skew	Weak	One response takes significantly longer than others, suggesting it hit a lock or retry path.

The confidence ranking reflects diagnostic precision. Duplicate objects is a near-certain indicator — if you sent 20 identical coupon redemptions and got 3 successes, that is a race condition. Timing skew is weaker; a single slow response could be caused by garbage collection, connection reuse, or unrelated server load. The engine treats strong indicators as high-confidence evidence suitable for automated reporting.

HTTP/2 Request Smuggling

The race engine shares infrastructure with Arbiter's HTTP/2 smuggling detection, which tests 11 variant classes:

Variant	Mechanism
`H2.CL`	HTTP/2 front-end, `Content-Length` disagreement with backend over HTTP/1.1 downgrade
`H2.TE`	`Transfer-Encoding: chunked` injected into HTTP/2 request, honored after downgrade
`H2.Downgrade`	Generic downgrade: HTTP/2 to HTTP/1.1 between front-end proxy and backend
`PseudoHeader`	Manipulation of `:path`, `:method`, `:authority`, or `:scheme` pseudo-headers
`HeaderCase`	Mixed-case headers (e.g., `Transfer-encoding`) bypass header filters
`ConnectionHeader`	`Connection:` hop-by-hop header smuggled through HTTP/2
`HeaderNormalization`	Exploiting differences in header normalization between proxy tiers
`CRLFInjection`	`\r\n` sequences in HTTP/2 header values, interpreted after downgrade
`RequestLineInjection`	Spaces or control characters in `:path` that split the HTTP/1.1 request line
`ChunkedWithCL`	Both `Transfer-Encoding` and `Content-Length` present post-downgrade
`H2.Zero`	Zero `Content-Length` with non-empty body (or vice versa) after downgrade

These classes exploit the impedance mismatch between HTTP/2's binary framing and HTTP/1.1's text-based protocol. When a reverse proxy downgrades an HTTP/2 request to HTTP/1.1 for a backend server, header values that were safely binary-framed in HTTP/2 may become injection vectors in the text protocol. The h2 crate's frame-level access is essential here — hyper would sanitize exactly the malformed inputs we need to send.

Architectural Decisions

Several implementation choices deserve explicit justification, because they contradict conventional wisdom:

h2 over hyper. Hyper is the standard Rust HTTP client and it uses h2 internally. But hyper's API is request/response-oriented — you hand it a Request and get back a Response. There is no way to send HEADERS now and DATA later, or to control END_STREAM placement. The h2 crate exposes the frame lifecycle directly, which is what makes the END_STREAM delay technique possible.

OS threads for last-byte sync. Tokio tasks are cooperative: they yield at .await points and rely on the runtime to schedule them. A spin loop in a Tokio task would block the entire runtime. OS threads can spin independently without affecting each other, and the kernel's thread scheduler on modern CPUs can wake all threads from a barrier within a few microseconds using inter-processor interrupts.

rustls over native-tls. The reason is ALPN. ALPN (Application-Layer Protocol Negotiation) is the TLS extension that lets client and server agree on HTTP/2 during the handshake. rustls makes ALPN configuration straightforward and consistent across platforms. native-tls delegates to OpenSSL/Schannel/SecureTransport, and ALPN support varies. Since our entire race strategy depends on confirmed HTTP/2 negotiation, we cannot afford platform-dependent ALPN behavior.

State graph integration. The race engine is not a standalone tool. It consumes the same state graph that feeds constraint inference, IDOR detection, and authentication testing. This means race candidates are discovered automatically from observed application behavior, not from a manually curated list of "endpoints to test." When the state graph identifies two POST endpoints that share a resource, the race engine can test them without any human in the loop.

Conclusion

The single-packet attack transforms race conditions from theoretical curiosities into reliably exploitable bugs. The key is timing precision: HTTP/2 multiplexing for tightly synchronized delivery, END_STREAM delay for large payloads, last-byte sync for HTTP/1.1 targets, and spin-loop barriers for coordination without OS scheduler jitter.

But the technique is only as useful as the detection and analysis around it. Knowing how to send 20 requests simultaneously is necessary but not sufficient. You need to know which 20 requests to send (the Predict phase), and you need to determine whether the responses indicate a real race or benign variance (the Prove phase). The full pipeline — state graph analysis, pattern matching, shared resource detection, strategy selection, precision timing, and anomaly classification — is what makes automated race condition discovery practical.

Kettle's research opened the door. The engineering challenge is walking through it at scale.