Arbiter exposes 267 MCP tools to the AI agent. That number sounds impressive in a feature table, but what does it actually mean during an assessment? How do those tools compose into a coherent workflow? This article walks through an illustrative security assessment of a realistic target application, from traffic import through to a submission-ready bug bounty report — showing how the agent reasons at each phase, and how the system correlates findings into attack chains that are greater than the sum of their parts.
For this walkthrough, the target is a representative multi-tenant SaaS application — a financial dashboard with a REST API, OAuth2 login, role-based access control, and a React SPA frontend. The application is enrolled in a bug bounty program. The details below are illustrative: they show the workflow and decision-making process, not a specific engagement.
The 267 Tools: A Map Before the Territory
Before diving into the assessment, it is worth understanding how 267 tools are organized. They are not a flat list — they are grouped into handler modules with fast dispatch. The agent never browses a menu of 267 items. It reasons about what it needs to do next, and the orchestration layer maps that intent to the right tool.
| Category | Purpose |
|---|---|
| Vulnerability Scanners | 52 vulnerability classes (XSS, SQLi, SSRF, IDOR, CSRF, CORS, and more), each with targeted detection logic |
| Auth & Crypto | Authentication protocol analysis (OAuth, SAML), session management, cryptographic weaknesses |
| Infrastructure | Protocol-specific and cloud infrastructure testing (DNS, GraphQL, WebSocket, gRPC, HTTP/3) |
| Reconnaissance | Discovery, crawling, fingerprinting, OSINT, secret detection |
| Orchestration | Multi-tool pipelines, phase management, coverage tracking, combo operations |
| Verification | Browser-based proof generation and finding validation |
| Reporting | CVSS 4.0 scoring, attack chain construction, platform export |
| AI-Assisted | Adaptive bypass generation, AI security auditing, Prompt Shield |
| Support | Proxy management, out-of-band correlation, session handling, WAF analysis |
The key architectural pattern: every tool that produces findings feeds them into a shared finding store with automatic deduplication, and after every tool execution, the system extracts findings and runs auto-correlation. Findings are never isolated — they are always evaluated in the context of everything else the assessment has already discovered.
Phase 1: Traffic Import
Every assessment begins with traffic. Arbiter supports five ingestion paths with a unified internal representation. For this assessment, the tester has been manually browsing the target application through Burp Suite, building up a sitemap of authenticated interactions. The agent begins by importing that traffic:
// Import Burp Suite traffic capture
// -> 847 exchanges imported, 312 unique endpoints
The five ingestion paths serve different workflows:
| Method | Use Case |
|---|---|
| HAR Import | Browser DevTools exports, CI pipeline captures |
| Burp XML | Existing Burp Suite sitemaps |
| Live Proxy | Real-time capture during manual testing (non-MITM HTTP proxy, tunnels HTTPS) |
| Browser Extension | Direct integration with browser, captures requests in-context |
| API Spec Import | Generate traffic from OpenAPI or Postman collections |
With traffic loaded, the agent sets up authentication context so the system knows which sessions belong to which users:
// Configure two test sessions:
// Session 1: "member" role (low-privilege user)
// Session 2: "admin" role (high-privilege user)
// Auth type: Bearer token
This is critical. Without auth context, the system cannot detect identity bindings, role boundaries, or privilege escalation. The state graph builder uses these contexts to classify endpoints into authentication phases — pre-auth, auth transition, post-auth — and to map authentication boundaries (session cookies, bearer tokens, API keys) to endpoint groups.
Phase 2: Reconnaissance
The agent now initiates the first phase of the assessment playbook — Recon. The playbook defines a 5-phase assessment workflow with decision points, expected outputs, and advancement triggers. The agent requests an ordered tool chain for the current phase:
// Phase: Recon
// Target: the SaaS application
// -> Returns ordered tool chain:
// subdomain enumeration, DNS analysis, CT log scanning,
// cloud asset discovery, technology fingerprinting
The agent executes the chain. Subdomain enumeration discovers a staging instance (out of scope, noted but not tested), a WebSocket endpoint, and a CDN subdomain. Technology fingerprinting identifies:
// Technology fingerprinting results:
// -> React 18.2, Next.js 14.1, Express 4.18
// Server: cloudflare
// X-Powered-By: Express (information disclosure - noted)
// HSTS: present, max-age=31536000
// CSP: default-src 'self'; script-src 'self' 'unsafe-inline'
Two immediate observations. The X-Powered-By header is an information disclosure — minor, but it goes into the finding store. The CSP includes 'unsafe-inline' for script-src, which the system flags as relevant context for any XSS findings that come later. This is not a finding on its own yet, but the auto-correlation engine will use it.
Phase 3: Discovery
Discovery builds the application map. The agent uses a combo crawl-and-scan tool for efficiency — it crawls the SPA, extracts JavaScript bundles, and performs lightweight scanning in a single pass:
// Crawl and scan the target SPA
// -> 47 new endpoints discovered
// 3 source maps found
// SPA routes: /dashboard, /accounts/{id}, /transfers, /settings,
// /admin/users, /admin/audit-log
// 12 forms with 38 parameters extracted
Source map extraction is significant. An auth-related source map reveals client-side route guards — the admin routes /admin/users and /admin/audit-log are guarded by a simple if (user.role !== 'admin') check in JavaScript. This is client-side authorization only. The agent notes this for Phase 4, where it will test whether the API endpoints behind these routes enforce authorization server-side.
The agent also runs JS bundle analysis to search for hardcoded secrets and API patterns:
// JS bundle analysis
// -> Found: Public Stripe key (not sensitive)
// Found: API base URL pattern: /api/v2/{resource}
// Found: Commented-out debug endpoint: /api/v2/debug/user-lookup
A commented-out debug endpoint. The agent immediately tests it:
// Send raw request to debug endpoint
// GET /api/v2/debug/[email protected]
// -> 200 OK
// -> Returns user data: user_id, email, role, created_at
The debug endpoint is live in production, and it returns user data for any email address. This is immediately logged as a finding: Information Disclosure via Debug Endpoint (CWE-215). The auto-correlation engine runs and notes this finding, but cannot yet correlate it with anything else. That will change.
Phase 4: Scanning
This is where the bulk of the 267 tools come into play. The agent selects a thorough scan plan, which sequences vulnerability scanners across all discovered endpoints with appropriate payloads for each parameter type:
// Start thorough scan plan
// Scope: the target API
// Auth sessions: member + admin
// Exclude: health check and version endpoints
The scan plan orchestrates tool chains with automatic data routing between tools. Behind the scenes, this invokes dozens of individual scanners. Rather than listing every tool call, here are the significant findings that emerge:
Finding 1: Reflected XSS in Search
// XSS scanner on /api/v2/transactions/search
// -> VULNERABLE: reflected XSS in search parameter
// Payload reflected unescaped in JSON error response
// Content-Type: application/json (but rendered in browser error page)
// Severity: Medium
Finding 2: Missing CSRF Protection on Transfer Endpoint
// CSRF scanner on POST /api/v2/transfers
// -> VULNERABLE: accepts requests without CSRF token or Origin validation
// Cookie-based auth with SameSite=Lax (but GET-to-POST bypass possible)
// Severity: Medium
Finding 3: IDOR on Account Details
// IDOR scanner on /api/v2/accounts/{id}
// Tests cross-user access with both auth sessions
// -> VULNERABLE: GET /api/v2/accounts/{id} returns account data
// regardless of authenticated user
// Low-privilege user can read other users' account data
// Severity: High
Finding 4: Broken Access Control on Admin Endpoints
// Auth scanner on admin endpoints
// Tests: member role vs admin role access
// -> VULNERABLE: GET /api/v2/admin/users returns 200 for member role
// Server returns full user list with emails and roles
// GET /api/v2/admin/audit-log returns 200 for member role
// Severity: High (user list), Medium (audit log)
The source map finding from Phase 3 predicted this. The admin routes had client-side guards but no server-side enforcement.
Finding 5: CORS Misconfiguration
// CORS scanner on /api/v2/accounts
// -> VULNERABLE: Origin reflection — server reflects any Origin
// in Access-Control-Allow-Origin header
// Access-Control-Allow-Credentials: true
// Severity: Medium
Finding 6: Weak Session Token Entropy
// Session security analysis (50 token samples)
// -> WARNING: Session cookie uses sequential component
// Effective entropy: ~48 bits (below 128-bit recommendation)
// HttpOnly: true, Secure: true, SameSite: Lax
// Severity: Low
Auto-Correlation: Where Findings Become Attack Chains
After each tool execution, the system runs auto-correlation. With six findings now in the store, the correlation engine activates several of its 16+ rules. This is where Arbiter diverges most sharply from traditional scanners — individual medium-severity findings combine into critical attack chains.
Correlation Rule: XSS + Missing CSP
The XSS finding (Medium) is correlated with the CSP observation from Phase 2. The CSP includes 'unsafe-inline', which means the XSS payload can execute arbitrary inline scripts without CSP blocking it. The correlation engine upgrades the combined severity:
// Correlation: XSS + Weak CSP
// Combined severity: Critical
// Rationale: Reflected XSS with unsafe-inline CSP permits full
// script execution — cookie theft, session hijacking,
// keylogging. CSP provides no mitigation.
Correlation Rule: CORS Bypass + Cookie Without SameSite=Strict
// Correlation: CORS Bypass + Cookie Without SameSite=Strict
// Combined severity: High
// Rationale: CORS origin reflection with credentials:true allows
// any origin to make authenticated requests. SameSite=Lax
// permits this for top-level navigations. Attacker-controlled
// origin can read authenticated API responses.
Correlation Rule: Account Enumeration + Debug Endpoint
// Correlation: Account Enumeration + Information Disclosure
// Combined severity: Critical
// Rationale: Debug endpoint reveals user emails and IDs.
// Admin user list accessible to any authenticated user.
// Combined: attacker can enumerate all users, obtain their
// account IDs, and exploit IDOR to access their data.
The correlation engine also constructs a multi-step attack chain template — the "XSS Cookie Exfiltration Chain":
// Attack chain: XSS Cookie Exfiltration
// Required findings: XSS + CORS misconfiguration + weak CSP
// Steps:
// 1. Exploit reflected XSS in /api/v2/transactions/search
// 2. CSP allows inline script execution (unsafe-inline)
// 3. Script reads document.cookie (HttpOnly=true blocks this —
// chain adjusted: use XSS to make authenticated API calls instead)
// 4. CORS misconfiguration allows exfiltrating API responses
// to attacker-controlled origin
// Adjusted chain:
// 1. Attacker sends victim link triggering XSS
// 2. Injected script makes fetch() to /api/v2/accounts (same origin)
// 3. CORS allows reading response from any origin
// 4. Script exfiltrates account data to attacker server
// Impact: Full account data theft for any user who clicks the link
Notice how the chain was adjusted. The system detected that HttpOnly: true blocks direct cookie theft, so it modified the attack path to use the XSS for same-origin API calls instead. This is the attack reasoning engine in action — it maps finding types to prioritized follow-up actions and adapts when one path is blocked.
The full set of auto-correlation rules covers 16+ combinations:
| Finding Combination | Upgraded Severity | Attack Outcome |
|---|---|---|
| XSS + Missing CSP | Critical | Full script execution, no browser mitigation |
| CORS Bypass + Cookie Without SameSite | High | Cross-origin credential theft |
| Account Enumeration + Weak Reset Token | Critical | Account takeover via predictable reset |
| Open Redirect + OAuth Misconfiguration | Critical | Token theft via redirect chain |
| IDOR + Information Disclosure | Critical | Mass data exfiltration with known IDs |
| CSRF + State-Changing Endpoint | High | Unauthorized actions on behalf of victim |
| BAC + Admin Functionality | Critical | Privilege escalation to admin |
| SSRF + Cloud Metadata | Critical | Cloud credential theft via metadata service |
Phase 5: Exploitation and Browser Verification
Findings and correlations are hypotheses until verified. The agent now enters the exploitation phase, using browser verification to produce concrete proof for each finding. Arbiter drives headless Chrome via the Chrome DevTools Protocol, capturing screenshots, DOM snapshots, console logs, and network requests as evidence.
Verifying the XSS
The attack reasoning engine for XSS findings specifies a verification sequence: validate with proof, generate PoC, capture screenshot, check CSP. The agent follows this:
// Verify XSS finding in headless browser
// -> Launches headless Chrome
// Navigates to vulnerable URL with payload
// Detects JavaScript alert dialog
// Captures:
// - Screenshot at moment of dialog trigger
// - DOM snapshot showing injected payload in context
// - Console log entries
// - Network request/response pair
// Result: CONFIRMED, confidence: 0.97
The XSS verifier specifically monitors for dialog events — alert, confirm, prompt — as proof of JavaScript execution. A triggered dialog is unambiguous evidence. The system captures both the screenshot and the DOM state, because the screenshot alone is insufficient for a bug bounty report: the reviewer needs to see the injected payload in the page source.
Verifying the IDOR
// Generate multi-step PoC for IDOR finding (cURL format)
// -> Step 1: Authenticate as low-privilege user
// POST /api/v2/auth/login
//
// Step 2: Request another user's account data
// GET /api/v2/accounts/{victim-account-id}
// with attacker's Bearer token
//
// Step 3: Observe victim's data in response
// Expected: 403 Forbidden
// Actual: 200 OK with victim's account balance, transactions
Verifying the Admin Access
// Verify broken access control finding in headless browser
// -> Logs in as member role
// Navigates directly to /admin/users (bypassing client-side guard)
// Page renders full admin user management interface
// Screenshot captured showing user list
// DOM snapshot shows email addresses, roles, account status
// Result: CONFIRMED, confidence: 0.99
The browser verification for the broken access control finding is particularly compelling because it shows the actual admin interface rendering for a non-admin user. This is stronger evidence than a raw API response — it demonstrates that the entire admin functionality is accessible, not just a single endpoint.
Out-of-Band Verification
During the scanning phase, one finding warranted out-of-band verification — a potential blind SSRF in a URL preview feature. The agent uses the OOB correlation server:
// Start OOB correlation server (HTTP + DNS)
// Generate OOB payload with correlation ID
// Send payload via URL preview feature: POST /api/v2/links/preview
// Poll for OOB interactions (30s timeout)
// -> No interaction received. SSRF not confirmed.
// Finding discarded.
The OOB server listens on both HTTP and DNS, and each payload carries a correlation ID that links the incoming interaction back to the specific test that generated it. In this case, the server-side URL preview did not make the outbound request — likely an allowlist or SSRF protection is in place. The finding is discarded. This is the system working correctly: generating hypotheses cheaply, testing them, and discarding the ones that do not confirm.
CVSS 4.0 Scoring
With findings verified, the agent generates CVSS 4.0 scores. Arbiter computes these automatically based on finding type and confirmed impact:
// CVSS 4.0 scoring for IDOR finding:
// -> CVSS:4.0/AV:N/AC:L/AT:N/PR:L/UI:N/VC:H/VI:N/VA:N/SC:N/SI:N/SA:N
// Base Score: 7.1 (High)
// With correlation context (mass enumeration possible): 8.7 (High)
// CVSS 4.0 scoring for broken access control:
// -> CVSS:4.0/AV:N/AC:L/AT:N/PR:L/UI:N/VC:H/VI:H/VA:N/SC:N/SI:N/SA:N
// Base Score: 8.7 (High)
// CVSS 4.0 scoring for XSS:
// -> Base Score: 5.3 (Medium)
// With CSP correlation: 8.7 (High) — upgraded due to full exploitation path
Report Generation
The agent now generates the final report. Arbiter's reporting engine supports configurable output with screenshot embedding, DOM diffs, and console logs. Findings are grouped by vulnerability type:
// Generate assessment report
// All 8 findings included
// Options: screenshots, DOM diffs, console logs,
// attack chains, remediation guidance
The report includes the attack chain constructed during correlation, showing how individual findings compose into a critical exploitation path. It also generates remediation guidance:
// Remediation for IDOR:
// -> Primary: Implement server-side authorization check on
// /api/v2/accounts/{id} — verify authenticated user owns
// the requested account before returning data.
// Secondary: Add rate limiting to prevent mass enumeration.
// Reference: OWASP ASVS V4.0 - 4.2.1
// Executive summary:
// -> 8 findings: 3 Critical (correlated), 2 High, 2 Medium, 1 Low
// Critical attack path: User enumeration -> IDOR -> mass data theft
// Most urgent: Debug endpoint in production, admin access control bypass
Bug Bounty Platform Export
The tester wants to submit the IDOR finding to HackerOne. Arbiter supports three bug bounty platforms, each with its own formatting requirements:
| Platform | Severity Model | Report Format |
|---|---|---|
| HackerOne | CVSS severity levels | Summary / Steps to Reproduce / Impact / Supporting Material |
| Bugcrowd | P1-P5 priority + VRT categories | Title / VRT Category / Description / Impact / PoC |
| Intigriti | CVSS severity | Summary / Steps to Reproduce / Impact / PoC |
The platform export feature adapts each finding to the target platform's conventions:
// Export IDOR finding for HackerOne (with correlation context)
// -> Output:
//
// ## Summary
// Insecure Direct Object Reference (IDOR) on the account details
// endpoint allows any authenticated user to read any other user's
// account data by substituting the account ID in the URL path.
//
// ## Steps to Reproduce
// 1. Authenticate as any user
// 2. Note your own account ID
// 3. Request another user's account ID with your own Bearer token
// 4. Observe: 200 OK response contains the victim's full account data
//
// ## Impact
// Any authenticated user can access any other user's financial data.
// Combined with the debug endpoint (information disclosure finding),
// an attacker can enumerate all user account IDs and systematically
// exfiltrate the entire user base's records.
// CVSS 4.0 Base Score: 7.1 (High), correlated: 8.7.
//
// ## Supporting Material
// - Screenshot: browser verification showing victim data returned
// - cURL PoC: [attached]
// - DOM snapshot: [attached]
// - Correlation note: exploitability amplified by info disclosure
The agent can also export findings as Nuclei templates for automated regression testing:
// Export as Nuclei template
// -> Generates YAML template that reproduces the IDOR check
// Can be integrated into CI/CD pipeline to verify the fix
The Prompt Shield Sidebar
One category of tools deserves separate mention: Prompt Shield. As more applications integrate LLM features, Arbiter includes tools specifically for AI security auditing. During this assessment, the agent noticed an AI-powered transaction categorization feature and ran prompt injection and hidden content scans:
// Prompt injection scan on /api/v2/transactions/categorize
// -> Tested 47 prompt injection patterns
// No injection confirmed — input appears to be used as
// classification context only, not as instruction
// Hidden content scan on /dashboard
// -> No hidden AI instructions or system prompts detected in page source
Prompt Shield also includes tools for testing MCP server implementations themselves, detecting data leakage through AI features, and analyzing whether tool schemas could be exploited. These tools reflect a reality of modern web applications: AI features introduce a new attack surface that traditional scanners do not address.
State Graph: Seeing the Application Whole
Throughout the assessment, the state graph builder has been constructing a graph of the application's behavior. Endpoint nodes are indexed by method and path pattern, with path parameter detection using UUID, numeric, and hex ID regexes. The graph captures:
- State transitions — Which endpoints are called in sequence, with observed counts and average delays
- Auth boundaries — Which endpoints require session cookie, bearer token, basic auth, or API key authentication
- Auth phases — Pre-auth (login, registration), auth transition (the login endpoint itself), post-auth (everything behind authentication)
- CSRF token flows — Which endpoints issue CSRF tokens and which consume them
- Parameter value flows — How values flow from response bodies into subsequent request parameters
This graph is not just a visualization aid. It is the data structure that enables constraint inference, which feeds the IDOR detection, the state bypass testing, and the correlation engine. The state graph is the single representation that makes the 267 tools work as a coherent system rather than a bag of independent scanners.
Tool Composition: Pre-Built Chains and Custom Pipelines
The assessment above used individual tools for clarity, but in practice the agent frequently uses pre-built tool chains — pipelines that wire output from one tool as input to the next automatically. Arbiter ships with pre-built chains covering common workflows:
| Chain Type | Tools Composed | Output |
|---|---|---|
| Full recon | Subdomains + DNS + CT logs + cloud assets + fingerprinting | Complete target surface map |
| Header audit | Security header check + CSP analysis + CORS check + cookie analysis | Header security posture |
| JS secrets | JS bundle extraction + source map analysis + secret detection | Leaked credentials and API patterns |
| Injection sweep | SQLi + XSS + CMDi + SSTI + path traversal across all params | Injection findings with severity |
| Auth audit | OAuth + session + RBAC + password policy + 2FA check | Authentication weakness report |
| CORS exploit | CORS scan + origin manipulation + credential leak test | Exploitable CORS findings |
| SSRF chain | SSRF scan + OOB server + cloud metadata test | Confirmed SSRF with impact |
| Full assessment | All phases sequenced with correlation | Complete assessment report |
| API security | API conformance + versioning + GraphQL introspection + auth | API-specific security findings |
| Deep recon | Full recon + Wayback + GitHub dorks + technology deep-dive | Extended OSINT results |
Additionally, multiple scan plans define different assessment intensities — from quick header audits and top-10 checks, through focused auth-only or injection-only deep scans, to a full end-to-end assessment. The agent selects the appropriate plan based on scope, time constraints, and what the target application looks like.
What 267 Tools Actually Means
The number 267 is not the point. A tool count is a proxy for coverage — the breadth of behaviors the system can observe, the hypotheses it can generate, and the verification methods it has available. What matters is the architecture:
- Every tool feeds the same finding store. There is no siloed output. An observation from the fingerprinting phase is available to the correlation engine when a vulnerability scanner runs three phases later.
- Auto-correlation runs after every tool execution. The system does not wait until the end to connect dots. A finding that becomes critical only in combination with an earlier finding is flagged immediately, so the agent can prioritize verification.
- The playbook provides structure. The 5-phase workflow (Recon, Discovery, Scanning, Exploitation, Reporting) is not a rigid sequence but a set of decision points. The agent can revisit earlier phases if new information warrants it — discovering the debug endpoint during JS analysis sent the agent back to test it immediately, not at the end of a predetermined scan queue.
- Tool composition reduces latency. Pre-built chains and scan plans mean the agent does not need to reason about every individual tool invocation. It operates at the level of "run an auth audit" rather than "call OAuth scanner, then session scanner, then RBAC scanner, then password policy checker."
This illustrative assessment produced 8 findings, 3 auto-correlated attack chains, and a submission-ready HackerOne report — using approximately 40 distinct tools out of the 267 available. The other 227 tools were not wasted: they represent coverage for vulnerability classes, protocols, and application architectures that this target did not expose. A GraphQL application would have activated the GraphQL introspection and query complexity tools. A WebSocket-heavy application would have activated the WebSocket scanner. A Next.js application would have activated the server-side rendering and API route tools.
267 tools is not a feature. It is a surface area — the system's ability to meet whatever application architecture it encounters with appropriate, targeted analysis rather than generic payload spraying.
Arbiter is currently in closed beta. If you are a bug bounty hunter, penetration tester, or security engineer interested in agent-driven assessments, join the waitlist.