The Combinatorial Trap

Traditional web application scanners work by enumeration. For each discovered endpoint, they iterate over every parameter and fire a battery of payloads: SQL injection strings, XSS vectors, path traversal sequences, command injection attempts. If an application exposes 200 endpoints with an average of 5 parameters each, and the scanner carries 500 payloads per vulnerability class across 15 classes, the search space is on the order of 200 × 5 × 500 × 15 = 7.5 million requests. Most of those requests are meaningless. The scanner does not know which parameters accept user IDs, which endpoints enforce authorization, or which state transitions the application expects. It is brute-forcing a problem that has structure.

Fuzzers face the same issue from a different angle. Coverage-guided fuzzers are excellent at finding memory corruption in parsers and protocol implementations, but they are structurally blind to business logic. A fuzzer will never discover that swapping a user_id path parameter between two authenticated sessions yields another user's data, because it has no model of identity, ownership, or authorization. It does not understand that GET /api/orders/1742 returns different data depending on who is asking.

The vulnerability is not in the input format. It is in the constraint the application fails to enforce.

This observation is the foundation of Arbiter's approach. Rather than enumerating payloads against parameters, we first build a model of the application's behavior, infer the constraints it appears to enforce, and then generate targeted hypotheses that test whether those constraints hold. The search space becomes proportional to the number of meaningful constraints — typically tens or low hundreds — rather than the cartesian product of endpoints, parameters, and payloads.

Architecture: Observe, Infer, Violate

The system is organized as a three-phase pipeline. Each phase produces a typed intermediate representation that feeds the next. There is no shared mutable state between phases; the output of one is the input to the next, making each phase independently testable and its behavior reproducible.

  1. Observe — Construct a typed state graph from raw HTTP traffic.
  2. Infer — Extract constraints from the state graph via specialized inference engines.
  3. Violate — Generate hypotheses from constraints, test them, and produce proofs.

The following sections walk through each phase in detail.

Phase 1: State Graph Construction

The input to the pipeline is a set of HTTP request-response pairs, captured from a proxy, HAR import, or active crawl. The Observe phase transforms this flat sequence into a state graph: a directed graph where nodes are endpoint aggregates and edges are observed transitions between them.

Graph Representation

Each node in the graph represents an endpoint aggregate: a unique combination of HTTP method and parameterized path pattern. Nodes carry typed parameter metadata (location, inferred type, observed values) and the authentication context under which they were observed. Edges represent observed transitions between endpoints, capturing both frequency and timing.

Path parameterization is nontrivial. The system must recognize that /api/users/3fa85f64-5717-4562-b3fc-2c963f66afa6/orders and /api/users/a1b2c3d4-e5f6-7890-abcd-ef1234567890/orders share the same path pattern. We detect UUID, integer, slug, and base64 segments via pattern classification, then merge endpoints that differ only in parameterized segments.

Data Flow Tracking

Two categories of cross-request data flow are tracked: parameter value flows and CSRF token flows.

Parameter value flows capture cases where a value appearing in one endpoint's response later appears as a parameter in a request to another endpoint. For example, if an order_id returned by POST /api/orders later appears as the {id} path parameter in GET /api/orders/{id}, that flow is recorded. The system is JSON-path-aware: it tracks values through nested JSON structures rather than relying on naive string matching, which avoids false correlations.

CSRF token flow tracking records which endpoint issues a token and which endpoint consumes it. This provenance information feeds directly into the CSRF bypass detection in Phase 3.

Authentication Boundary Detection

The graph also identifies authentication boundaries. The system inspects request headers for four authentication schemes — Bearer tokens, session cookies, HTTP Basic, and API keys — and groups endpoints by which scheme they require. Endpoints that return 200 without credentials and 200 with credentials are classified differently from endpoints that return 401/403 without credentials. This distinction is critical: it separates public endpoints from authenticated ones, and it identifies which authentication scheme protects which surface.

Phase 2: Constraint Inference

Phase 2 is where the system transitions from observation to understanding. Seven specialized inference engines analyze the state graph and produce typed constraint values. Each engine is responsible for one class of application behavior.

The Constraint Type System

The system defines eleven constraint categories, each carrying a confidence score:

Constraint Category Meaning Example
Identity Binding A parameter is bound to the authenticated user's identity user_id in path must match JWT sub claim
Role Required Endpoint requires a specific role or privilege level /admin/users requires admin role
State Precondition Request depends on prior state-changing operation POST /checkout requires items in cart
Numeric Bound Numeric parameter has observed min/max range quantity always between 1 and 99
Temporal Time-based constraint on request validity Token expires after 300s
Rate Limit Endpoint enforces request frequency limits 429 after 100 requests per minute
Ordering Endpoints must be called in a specific sequence GET /cart before POST /checkout
Parameter Dependency A parameter's value depends on a prior response order_id must come from POST /orders
Auth Required Endpoint requires specific authentication scheme Bearer token required for /api/*
Negative Inferred from consistent failure without prerequisite Requests fail unless preceded by POST /login
Custom Application-specific constraint from heuristic analysis Coupon code only valid once per user

Every constraint carries a confidence score between 0.0 and 1.0. Only constraints exceeding a tunable inference threshold survive into Phase 3. The threshold reflects the minimum confidence at which generating a hypothesis is cheaper than the false-positive noise it would create.

Identity Binding Inference

The most consequential inference engine answers the question: which parameters, if any, are bound to the identity of the authenticated user? Finding these is the prerequisite for IDOR detection.

The algorithm correlates parameter values against user identifiers extracted from authentication contexts across all observations. For each parameter on each endpoint, it computes the match rate: what fraction of observed parameter values correspond to a known user ID from the associated auth context? If the match rate exceeds a threshold, the parameter is flagged as identity-bound.

Confidence is adjusted by multiple signals. Parameter location matters: a path parameter is a stronger indicator of identity binding than a query or header parameter. The parameter's inferred type matters: UUID-typed parameters are less likely to be coincidental matches than integers. And semantic name matching matters: the system recognizes that user_id, sub (from JWT claims), owner_id, and account_id often refer to the same identity concept. This cross-referencing between parameter names and auth context field names catches identity bindings that a purely value-based approach would miss.

Role Inference

The role inference engine identifies endpoints that are gated by privilege level. It analyzes two signals: the ratio of 401/403 responses to 200 responses across different auth contexts, and path prefix matching against common administrative paths (/admin, /staff, /internal, /manage). An endpoint that returns 403 for most users but 200 for a specific auth context is likely role-gated. The confidence is proportional to the disparity in access patterns.

Ordering and Dependency Inference

The ordering inference engine operates on transition edges in the state graph. It counts how many times endpoint A is followed by endpoint B across all observed sessions. If a pair (A, B) appears consistently, the system infers an ordering constraint. This captures workflows like "add to cart, then checkout" or "create payment method, then create subscription."

The parameter dependency engine operates on tracked value flows. If a value flows from endpoint A's response into endpoint B's request parameter, that constitutes a parameter dependency constraint: the parameter cannot be freely chosen but must come from a prior interaction with the application.

Negative Inference

Negative inference is the inverse of the others. It identifies endpoints that return both 4xx and 2xx responses, then examines the sessions that produced successes. If those successful sessions consistently include a specific predecessor endpoint that the failing sessions lack, the predecessor is inferred as a prerequisite. This is a form of abductive reasoning: we observe the effect (success vs. failure) and infer the cause (presence or absence of a prior step).

Deduplication and Ranking

After all seven engines run, the system deduplicates constraints using a semantic equivalence function that normalizes away incidental differences like specific IDs and confidence values. If two engines produce constraints that are semantically equivalent, only the one with the higher confidence survives. The final constraint set is sorted by confidence (descending) and capped at a configurable maximum to keep the hypothesis space tractable.

Deduplication is not just an optimization. It prevents hypothesis explosion in Phase 3, where each constraint may generate multiple violation hypotheses. Keeping the constraint set small and high-confidence is what makes the overall approach tractable.

Phase 3: Violation Detection

With a ranked set of constraints in hand, Phase 3 asks: what would it look like if this constraint were violated? The hypothesis generator maps each constraint category to one or more violation hypotheses, each with a priority score that determines testing order.

Hypothesis Generation

Constraint Hypothesis Priority Maps To
Identity Binding Identity Violation High IDOR (CWE-639)
Role Required Privilege Escalation High Vertical Privilege Escalation (CWE-269)
Numeric Bound Bound Violation Medium Business Logic (CWE-840)
Rate Limit Rate Limit Bypass Medium Rate Limiting Bypass
Ordering State Bypass Medium Workflow Bypass (CWE-840)
Auth Required Auth Bypass Highest Authentication Bypass (CWE-287)

Priority determines testing order, not importance. Identity violations and auth bypasses are tested first because they tend to be the most straightforward to confirm or reject: the response either contains another user's data or it does not. Business logic violations, which may require multi-step state manipulation, are tested later.

Proof-of-Concept Generation

Each hypothesis type has a dedicated PoC generation strategy:

IDOR (Identity Violation): The system constructs a request to the target endpoint using the attacker's authentication context but substitutes the victim's resource identifier into the identity-bound parameter. If the original request was GET /api/users/attacker-uuid/documents/42, the PoC becomes GET /api/users/victim-uuid/documents/42 with the attacker's Bearer token. A 2xx response containing data that matches the victim's profile confirms the vulnerability.

Privilege Escalation: The system sends the low-privilege user's credentials to endpoints that were only observed succeeding with high-privilege credentials. If GET /admin/users returned 200 for an admin session and 403 for a regular user, the PoC re-sends the regular user's request — but this time expecting to detect if the 403 is enforced at the application level or merely at the UI level.

Ordering/State Bypass: If the inferred ordering is A then B, the system sends B directly, skipping A. A checkout endpoint that succeeds without a prior add-to-cart step, or a payment confirmation that processes without a valid payment intent, indicates a state machine violation.

Passive Violation Detection

In addition to hypothesis-driven testing, the system runs three passive detection passes over the state graph:

  • Ordering bypass detection — Searches the existing traffic for sessions where endpoint B was called without its prerequisite A, and succeeded. This catches violations that already happened during crawling.
  • Auth bypass detection — Identifies endpoints that returned 2xx for unauthenticated requests despite having an auth requirement inferred from other sessions.
  • CSRF bypass detection — Checks whether state-changing requests (POST, PUT, DELETE) succeed without valid CSRF tokens. Cross-references token provenance tracking to verify the token was not merely absent but also not validated.

Crucially, only 2xx responses are flagged. A 403 or 401 response to a violation attempt is a correct rejection, not a finding. This seems obvious, but many scanning tools generate findings based on the attempt rather than the outcome.

Confidence Scoring

Each validated finding receives a composite confidence score built from four weighted components: reproducibility (can the finding be reproduced across multiple attempts?), constraint clarity (how confident was the original constraint inference?), impact certainty (how clearly does the evidence demonstrate impact?), and false-positive resistance (how unlikely is this to be a false positive?).

Reproducibility carries the most weight because a vulnerability that cannot be reproduced is not useful in a report. A finding that reproduced consistently across different resource IDs is significantly more credible than one that worked once.

The validation threshold is deliberately higher than the inference threshold. The system is allowed to be speculative when generating hypotheses — that is cheap. But it must be conservative when reporting findings — that is what the user sees.

Reasoning Chains

Every finding includes a structured reasoning chain — a typed sequence of steps that documents how the system arrived at the conclusion. The chain progresses through distinct phases: initial observation of relevant traffic, constraint inference with supporting evidence, hypothesis formation, evidence collection (the requests sent and responses received), validation (reproduction attempts), and a final conclusion with CWE mapping and confidence score.

The reasoning chain serves two purposes. For the user, it provides a complete audit trail: "We observed this traffic, inferred this constraint, formed this hypothesis, sent these requests, and got these responses." For the system, it enables automated triage: a finding whose chain includes a high-confidence identity binding and confirmed reproduction is almost certainly real. One whose chain relies on a marginal negative inference should be reviewed more carefully.

Search Space Reduction in Practice

The theoretical claim is that constraint-driven testing reduces the search space from O(parameters × payloads) to something proportional to the number of meaningful constraints. How does this play out in practice?

Consider a REST API with 150 endpoints, 600 total parameters, and a traditional scanner carrying 500 payloads across 15 vulnerability classes. The brute-force search space is approximately 4.5 million requests. Most of these are noise: SQL injection payloads sent to UUID parameters, XSS vectors sent to integer fields, path traversal attempts against endpoints that do not read files.

The constraint-driven approach on the same application might produce:

  • 12 identity binding constraints (12 parameters bound to user identity across 8 endpoints)
  • 4 role-required constraints (admin-only endpoints)
  • 6 ordering constraints (multi-step workflows)
  • 8 parameter dependency constraints (values that must flow from prior responses)
  • 3 auth-required constraints (endpoints requiring specific auth schemes)
  • 2 rate limit constraints
  • 5 numeric bound constraints

That is 40 constraints, generating roughly 50-60 violation hypotheses after expansion. Each hypothesis requires 1-5 requests to test (a single substitution for IDOR, a small sequence for state bypass). The total request count is in the low hundreds — three to four orders of magnitude fewer than the brute-force approach.

But the reduction is not just quantitative. The qualitative difference is that every request in the constraint-driven approach tests a specific, meaningful property of the application. There are no wasted requests. There is no "spray SQL injection at everything and hope" phase.

Business Logic: The Category Scanners Miss

The constraint framework extends naturally to business logic vulnerabilities — the category that traditional scanners are worst at detecting, because these bugs are not about malformed input but about valid operations performed in invalid sequences or with invalid quantities.

The business logic module includes several specialized analyzers:

State Machine Modeling

The state machine modeler reconstructs the application's intended state machine from observed traffic. It identifies states (order created, payment pending, payment confirmed, shipped) and transitions (create -> pay -> confirm -> ship). A workflow analyzer then generates step-skip test plans: for each transition A -> B -> C, test whether A -> C succeeds by skipping B. A checkout that succeeds without payment, or a shipment that processes without confirmation, is a business logic flaw.

Price Manipulation

The price manipulation detector monitors numeric fields in cart and checkout flows. If a price or total field appears in a request body (not just a response), the system tests whether modifying it changes the server-side total. This catches applications that trust client-submitted prices, a vulnerability class that payload-based scanners cannot detect because the "payload" is a valid number — just not the right one.

Discount and Idempotency Analysis

Discount analysis tests whether discount codes or promotional offers can be applied multiple times, stacked beyond intended limits, or used after expiration by replaying captured requests. Idempotency analysis tests whether operations that should be idempotent (payment processing, order submission) can be repeated to cause duplicate effects.

Browser Verification

For vulnerability classes where server response analysis alone is insufficient, the system includes browser-based verification using headless Chrome:

  • XSS verification: The system injects candidate payloads and monitors for JavaScript dialog events (alert, confirm, prompt). A triggered dialog is proof of execution. The system captures both a screenshot at the moment of the dialog and a DOM snapshot showing the injected payload in context.
  • SQL injection timing analysis: For blind SQLi candidates, the system measures response timing with statistical rigor — multiple baseline measurements followed by timed injection payloads, looking for statistically significant deviations rather than single-sample comparisons.

Browser verification is expensive (each check requires launching a browser context, navigating, waiting for events), so it is only triggered for findings that have already passed the confidence threshold from the constraint-based analysis. This is another example of the constraint framework paying dividends: by reducing the candidate set before expensive verification, the total assessment time drops dramatically.

Why Constraints Find What Fuzzers Miss

The gap between constraint-driven analysis and fuzzing is not about capability — it is about the class of bugs each approach can reach.

Fuzzers excel at finding bugs where the vulnerability is in how input is processed: buffer overflows, format string bugs, injection via parser edge cases. These are input-level bugs. The input itself is malformed, and the bug is the application's failure to handle it safely.

Constraint violations are fundamentally different. The input is valid. The request is well-formed. The parameters have correct types and reasonable values. The bug is that the application allows a valid operation that it should not — accessing another user's data, skipping a required workflow step, processing a payment with a client-modified total. No amount of input mutation will surface these bugs, because there is nothing wrong with the input. The error is in the authorization model, the state machine, or the trust boundary.

This is why the observe-infer-violate pipeline must first build a model of the application's behavior. You cannot test whether a constraint holds without first knowing the constraint exists. And you cannot know the constraint exists without observing enough of the application's behavior to infer it.

The most dangerous vulnerabilities are not the ones where the application crashes. They are the ones where it responds 200 OK — with someone else's data.

Practical Implications

The constraint-driven approach has several concrete consequences for how security testing works in practice:

Coverage is measured differently. Traditional scanners measure coverage as "percentage of endpoints tested" or "percentage of parameters fuzzed." Constraint-driven testing measures coverage as "percentage of inferred constraints tested." An application might have 150 endpoints but only 40 meaningful constraints. Testing all 40 constraints provides more security assurance than fuzzing all 150 endpoints with generic payloads.

False positive rates drop. Because each test targets a specific constraint with a specific expected outcome, the system knows what a true positive looks like before sending the request. An IDOR test expects to see the victim's data in the response. A state bypass test expects a 2xx where a 4xx should have been. This targeted expectation makes false positives structurally less likely than in payload-spray approaches, where the scanner must guess at the significance of each response.

Reports become explainable. The reasoning chain attached to each finding is not a post-hoc rationalization — it is the actual inference path the system followed. This means the report can show: "We observed that user_id in the path always matched the JWT sub claim (identity binding, high confidence). We hypothesized that substituting a different user's ID would bypass this binding. We tested with three different victim IDs and received 200 OK with victim data in all three cases." That is a finding a human can evaluate, reproduce, and act on.

The approach composes with existing workflows. Because the input to Phase 1 is HTTP traffic, the system works with any traffic source: manual browsing through a proxy, automated crawl results, CI/CD integration test suites, or production traffic samples. You do not need to replace your existing testing workflow. You feed it the traffic you already have, and the constraint engine extracts structure from it.


Arbiter currently covers 52 vulnerability classes and implements 267 MCP tools that expose this pipeline to AI agents. The constraint engine, hypothesis generator, and business logic analyzers are all accessible as individual tools, allowing an agent to reason about the application at whatever level of abstraction the situation requires — from high-level "scan this application" to low-level "test whether this specific identity binding holds for this parameter."

The core insight remains simple: intelligence is constraint. The more structure you can extract from observed behavior, the less you need to search. And the less you search, the more meaningful each test becomes.