Why MCP Tool Calling Doesn't Scale — Arbiter Security Research

The Model Context Protocol has become the standard interface between LLMs and external tools. Its design is clean: each tool declares a JSON Schema for its inputs, and the model selects tools by name, filling in structured arguments. For a handful of tools this works well. But what happens when you have 267 of them?

Arbiter exposes 267 MCP tools across 52 vulnerability classes. We built Forgemax to solve the scaling problem this creates. This post walks through the token economics, the two-tool collapse architecture, the V8 sandbox that makes it safe, and the gateway that holds it all together.

The Problem: Schema Bloat

Every MCP tool definition includes its name, description, and a full JSON Schema for its inputSchema. When the model makes a tools/list call, the server returns all of these definitions, and they go into the context window. Every turn. For the entire conversation.

A typical tool schema runs about 400 tokens. This accounts for the tool name, a description paragraph, and a handful of parameters with their own descriptions, types, and constraints. Some tools are leaner; tools with nested objects or enum arrays run considerably higher. But 400 tokens is a reasonable median across real MCP servers.

The arithmetic is straightforward:

Tool Count	Estimated Context Cost	% of 128K Window
10 tools	~4,200 tokens	3.3%
50 tools	~21,000 tokens	16.4%
76 tools	~33,100 tokens	25.9%
150 tools	~61,800 tokens	48.3%
267 tools	~110,000 tokens	85.9%

At 267 tools, the schema definitions alone consume roughly 110,000 tokens. On a 128K context model, that leaves about 18,000 tokens for the system prompt, conversation history, the actual task, and the model's reasoning. On a 200K model, it is more survivable, but you are still burning over half your context budget on tool metadata that the model will never use in any single turn. Most interactions touch 3-5 tools.

There are secondary costs too. Larger context windows increase latency on the first token. Provider pricing is per-token, and the schema payload repeats on every round trip. And tool selection accuracy degrades as the list grows -- the model has to scan 267 candidate schemas to pick the right one, a needle-in-a-haystack problem that leads to hallucinated tool names and malformed arguments.

The Two-Tool Collapse

Forgemax replaces the entire N-tool surface area with exactly two MCP tools: search and execute. Both accept a single parameter: a JavaScript async arrow function as a string.

search(code: string)

The search tool queries an in-memory capability manifest. It lets the model discover what tools exist, what categories they belong to, and what their schemas look like -- without loading all of that into context upfront.

// List all connected MCP servers
async () => {
  return manifest.servers.map(s => s.name);
}

// Find tools related to authentication testing
async () => {
  return manifest.search("authentication");
}

// Get the full schema for a specific tool
async () => {
  return manifest.tool("narsil", "find_symbols");
}

execute(code: string)

The execute tool runs JavaScript that calls downstream MCP tools through typed proxies. Instead of one tool call per round trip, the model writes code that chains multiple operations in a single execution.

// Call a single tool directly
async () => {
  return forge.callTool("narsil", "find_symbols", {
    pattern: "main"
  });
}

// Fluent API for the same operation
async () => {
  return forge.server("narsil").ast.parse({ file: "main.rs" });
}

// Chain multiple tools in one round trip
async () => {
  const symbols = await forge.callTool("narsil", "find_symbols", {
    pattern: "handle_request"
  });
  const results = [];
  for (const sym of symbols.matches.slice(0, 5)) {
    const analysis = await forge.callTool("narsil", "analyze_function", {
      file: sym.file,
      function_name: sym.name
    });
    results.push({ symbol: sym.name, complexity: analysis.complexity });
  }
  return results;
}

The key insight is that LLMs already know JavaScript. They have been trained on millions of JavaScript files. Writing an async arrow function that calls forge.callTool(server, tool, args) is trivially within their capability distribution. What is not in their training distribution is correctly selecting from a list of 267 JSON schemas and filling in structured arguments for each one.

Type Awareness via forge.d.ts

Forgemax compiles approximately 135 lines of TypeScript definitions (forge.d.ts) into the binary itself. These definitions are exposed to the LLM via the MCP instructions field, which providers inject into the system prompt. The type definitions cover the Forge, ForgeStash, and Manifest APIs -- enough for the model to write correctly typed code without needing to see the downstream tool schemas.

interface Forge {
  callTool(server: string, tool: string, args?: object): Promise<ToolResult>;
  server(name: string): ServerProxy;
  readResource(server: string, uri: string): Promise<ResourceContent>;
  stash: ForgeStash;
}

interface ForgeStash {
  set(key: string, value: any): void;
  get(key: string): any;
  keys(): string[];
  delete(key: string): boolean;
}

interface Manifest {
  servers: ServerInfo[];
  search(query: string): SearchResult[];
  tool(server: string, tool: string): ToolSchema | null;
}

Token Economics: The Benchmark

The Forgemax codebase includes a benchmark at crates/forge-manifest/examples/token_savings.rs that measures the actual token cost of both approaches. The results are consistent regardless of how many downstream tools exist:

Downstream Tools	Raw MCP Tokens	Forgemax Tokens	Savings
10	~4,200	~1,100	73%
76	~33,100	~1,100	96%
150	~61,800	~1,100	98%
267	~110,000	~1,100	99%

The Forgemax token count is constant at approximately 1,100 tokens. That covers the two tool schemas (search and execute) plus the forge.d.ts type definitions in the instructions field. It does not grow with the number of downstream tools because the downstream schemas are never loaded into context -- they are discovered on demand through the manifest.

Progressive Discovery via Manifest Layers

Rather than dumping 267 tool schemas into context, Forgemax implements a four-layer manifest that the model navigates via the search tool. Each layer provides increasing detail, and the model only drills into what it needs.

Layer	Content	Approx. Tokens
Layer 0	Server names + descriptions	~50
Layer 1	Categories per server	~200
Layer 2	Tool list per category	~500
Layer 3	Full schema for a specific tool	~200 each

A typical discovery sequence looks like this: the model calls search to list servers (Layer 0, 50 tokens in the response), identifies the relevant server, drills into its categories (Layer 1), finds the right category, lists tools (Layer 2), and then fetches the full schema for the 1-2 tools it actually needs (Layer 3). Total context consumed: roughly 1,000 tokens of manifest data, versus 110,000 for the full schema dump.

LiveManifest and Lock-Free Reads

The manifest is implemented as a LiveManifest that uses arc_swap::ArcSwap for lock-free reads with atomic swap on updates. This matters because manifest reads happen on the hot path -- every search call reads the manifest -- while updates happen in the background when servers are re-discovered.

Background re-discovery runs on a configurable interval. Forgemax also supports SIGHUP-triggered refresh, so you can force a manifest rebuild without restarting the process. When a downstream MCP server adds or removes tools, the manifest updates atomically; in-flight reads see either the old or the new version, never a torn state.

The V8 Sandbox

Accepting arbitrary JavaScript from an LLM and executing it requires serious sandboxing. Forgemax uses V8 isolates via deno_core, wrapped in a purpose-built sandbox (forge-sandbox crate) that provides defense in depth.

Runtime Architecture

V8 isolates are !Send in Rust -- they cannot be moved across threads. This is a V8 design constraint: each isolate is bound to the thread that created it. Forgemax handles this by running all JsRuntime operations on a dedicated thread with its own single-threaded tokio runtime. The public API remains async + Send via tokio::sync::oneshot channels that bridge the gap.

Every execution gets a fresh V8 runtime. There is no state carried between execute calls -- no global variables, no module cache, no prototype chain modifications. This eliminates an entire class of state-leakage attacks where one execution poisons the environment for the next.

Dual Execution Modes

The sandbox supports two execution modes:

InProcess -- V8 runs on a dedicated thread within the same process. Used in tests for speed and simplicity.
ChildProcess -- Spawns a separate forgemax-worker binary as an isolated OS process. Used in production for maximum isolation.

The ChildProcess mode provides OS-level isolation on top of V8's own sandboxing. The worker process is spawned with .env_clear() (clean environment, no inherited credentials), .kill_on_drop(true) (parent kills child on timeout), and no inherited file descriptors. Credentials and MCP connections never exist inside the worker.

Resource Limits

Every execution is constrained by a SandboxConfig:

Limit	Default Value
Execution timeout	5 seconds
Max heap size	64 MB
Max code size	64 KB
Max output size	1 MB
Max concurrent executions	8
Max tool calls per execution	50

The heap limit uses V8's set_heap_limit callback, which triggers a graceful abort when memory grows beyond the threshold. The execution timeout is enforced from the parent side -- the tokio runtime on the host thread cancels the oneshot channel, and in ChildProcess mode the OS process is killed.

IPC Protocol

In ChildProcess mode, communication between the parent and worker uses a length-delimited JSON protocol over stdin/stdout. Each message is a 4-byte big-endian length prefix followed by a JSON payload.

// Parent -> Worker messages
enum ParentMessage {
    Execute { code: String, context: ExecutionContext },
    ToolCallResult { id: u64, result: ToolResult },
    ResourceReadResult { id: u64, result: ResourceContent },
    StashResult { id: u64, result: StashValue },
    Reset,
}

// Worker -> Parent messages
enum ChildMessage {
    Ready,
    ToolCallRequest { id: u64, server: String, tool: String, args: Value },
    ResourceReadRequest { id: u64, server: String, uri: String },
    StashRequest { id: u64, op: StashOp },
    ExecutionComplete { result: ExecutionResult },
    Log { level: Level, message: String },
}

When LLM-generated code calls forge.callTool("narsil", "find_symbols", {pattern: "main"}), the V8 binding sends a ToolCallRequest over IPC to the parent. The parent holds the actual MCP connections. It dispatches the request to the real MCP server via IpcToolBridge, waits for the response, and sends a ToolCallResult back to the worker. The worker's JavaScript Promise resolves with the result. Credentials and MCP transport details never enter the worker process.

Worker Pool

Spawning a fresh OS process for every execution adds meaningful latency. Forgemax maintains a warm pool of forgemax-worker processes for reuse. After each execution, the worker receives a Reset message, which drops the V8 runtime and creates a fresh one -- significantly faster than a full process spawn.

Pool configuration:

Min workers: 2 (always warm)
Max workers: 8 (scales to demand)
Max idle time: 60 seconds (then killed)
Max uses before recycle: 50 (prevents memory drift)
Health check timeout: 500ms

Pool metrics are tracked via atomic PoolMetrics counters: active workers, idle workers, total spawns, total resets, and recycle events. These are exposed for monitoring without locking overhead.

Security Model: Six Layers

Accepting and executing code from an LLM is an adversarial problem. The model might be manipulated via prompt injection, or it might generate code that accidentally exfiltrates data. Forgemax's security model is defense-in-depth with six distinct layers.

Layer 1: AST Validation (Pre-Execution)

Before any code reaches V8, Forgemax parses it into an AST using oxc_parser and walks the tree looking for banned patterns. This is a static analysis pass that rejects dangerous code structurally, not by string matching.

Banned patterns include:

eval, Function constructor, import(), require()
Deno.* (deno_core runtime internals)
__proto__, globalThis[...] (computed property access on globals)
WebAssembly (arbitrary code execution via Wasm)

The validator includes multi-hop alias detection. If the model writes const e = eval; e("malicious"), the AST walker tracks the assignment and catches the indirect call. It also normalizes Unicode confusables -- Cyrillic characters that look identical to Latin ones (like Cyrillic "a" U+0430 vs Latin "a" U+0061) and fullwidth variants. Without this, an attacker could write еval using Cyrillic "e" and bypass naive string checks.

The validator has dozens of dedicated bypass tests covering known evasion techniques.

Layer 2: V8 Bootstrap

Even after AST validation passes, the V8 runtime is bootstrapped with additional restrictions. The eval function and Function constructor are removed at runtime. Global objects are frozen. Dangerous constructors are deleted from prototypes. This catches anything the AST validator might miss -- belt and suspenders.

Layer 3: V8 Isolate Boundary

V8 isolates provide no filesystem access, no network access, and no environment variable access by default. The only way to interact with the outside world is through the explicitly registered forge.* bindings. Each execution gets a fresh isolate, so there is no persistent state.

Layer 4: API Boundary

The forge.callTool binding is an opaque proxy. It accepts a server name, tool name, and arguments, and returns a result. The LLM code never sees MCP connection URLs, transport details, authentication tokens, or internal routing logic. Argument validation and rate limiting happen at this boundary. Even if the model writes creative JavaScript, the only operations it can perform are the ones exposed through the binding.

Layer 5: Error Redaction

When tool calls fail, the error messages from downstream servers might contain sensitive information: internal URLs, IP addresses, file paths, bearer tokens, API keys, or full stack traces. The forge-sandbox/src/redact.rs module strips all of these before the error reaches the LLM. The model sees a sanitized error message sufficient for debugging, but not for data exfiltration.

Layer 6: Process Isolation

In production (ChildProcess mode), the worker runs as a separate OS process with:

Clean environment (env_clear()) -- no inherited PATH, HOME, AWS_*, etc.
kill_on_drop(true) -- if the parent panics or the timeout fires, the child is killed immediately
No inherited file descriptors
Absolute binary paths only -- no PATH fallback

Even if all five previous layers were bypassed, the worker process has no credentials, no network access to MCP servers, and no way to interact with the host system beyond the IPC channel.

Gateway Architecture

Forgemax is not just a sandbox -- it is a gateway that sits between the LLM and an arbitrary number of downstream MCP servers. The forge-client crate manages connections to downstream servers via stdio or HTTP+SSE transports, using the rmcp crate for MCP protocol handling.

Routing and Dispatch

When the IpcToolBridge receives a ToolCallRequest from the worker, it dispatches the call to the correct downstream server via RouterDispatcher. The router maps (server_name, tool_name) pairs to the correct McpClient instance.

Before dispatch, Forgemax performs Levenshtein distance matching (via the strsim crate) against known tool names. If the model asks for a tool called "find_symb0ls" (note the zero), the call will not silently fail -- it returns a structured TOOL_NOT_FOUND error with suggestions like "Did you mean 'find_symbols'?". This significantly reduces wasted round trips from typos and hallucinated tool names.

Resilience: Circuit Breakers and Reconnection

Each downstream MCP server connection is wrapped in a resilience stack: CircuitBreaker(Timeout(McpClient)).

TimeoutDispatcher -- Per-server timeouts prevent a single slow server from stalling the entire execution.
Circuit Breaker -- A standard Closed/Open/HalfOpen state machine. After a configurable number of failures, the circuit opens and subsequent calls fail fast instead of waiting for timeouts. After a recovery interval, the circuit enters HalfOpen state and allows a probe request through.
ReconnectingClient -- For stdio-based MCP servers, broken pipes are common (the server crashes, the SSH tunnel drops). The reconnecting client detects the broken connection and transparently re-spawns the server process.
Server Groups -- Servers can be organized into isolation groups for cross-server data flow control. A tool in the "reconnaissance" group cannot directly access results from a tool in the "exploitation" group without explicit data passing.

// Resilience stack composition (conceptual)
let client = McpClient::connect(transport).await?;
let timed = TimeoutDispatcher::new(client, server_config.timeout);
let resilient = CircuitBreaker::new(timed, CircuitBreakerConfig {
    failure_threshold: 5,
    recovery_timeout: Duration::from_secs(30),
    half_open_max_calls: 1,
});

What This Looks Like in Practice

Consider a concrete scenario. An LLM-driven code analysis needs to: discover available analysis tools, find symbols matching a pattern, then analyze each matching function for complexity metrics. In standard MCP, this is at least 3 round trips (discovery, search, analyze), each requiring the full schema context.

With Forgemax, the model writes:

async () => {
  // Discover relevant tools (progressive manifest)
  const tools = await manifest.search("analysis");

  // Find matching symbols
  const symbols = await forge.callTool("narsil", "find_symbols", {
    pattern: "handle_request"
  });

  // Analyze each match in the same execution
  const results = [];
  for (const sym of symbols.matches.slice(0, 5)) {
    const analysis = await forge.callTool("narsil", "analyze_function", {
      file: sym.file,
      function_name: sym.name
    });
    results.push({ symbol: sym.name, complexity: analysis.complexity });
  }
  return results;
}

One MCP round trip. One execute call. The schema context is 1,100 tokens, not 110,000. The model wrote a short JavaScript function instead of orchestrating 3+ sequential tool calls. And because the code runs in a sandboxed V8 isolate with IPC-bridged tool calls, it is no less secure than standard MCP tool calling.

Trade-Offs and Limitations

This architecture is not without costs. There are trade-offs worth being explicit about:

Debugging complexity -- When something goes wrong inside an execute call, the error path is longer: LLM code, V8 runtime, IPC bridge, MCP server, and back. Stack traces from the worker process are redacted before reaching the model, which is correct for security but makes debugging harder.
Model capability requirements -- The model must be able to write correct async JavaScript. Current frontier models (Claude, GPT-4-class, Gemini) handle this well. Smaller models may struggle with the async patterns and produce code that hangs or deadlocks.
Latency per tool call -- Individual tool calls within an execute block have IPC overhead (~1-2ms per hop for the length-delimited JSON serialization). This is negligible for typical usage but could matter for executions that make 50 sequential tool calls.
Static analysis gaps -- The AST validator catches known dangerous patterns, but JavaScript is a dynamic language. The security model does not rely on the validator alone -- it is one of six layers precisely because no single layer is complete.

Results

Forgemax mediates between LLM agents and downstream MCP tools. The key results:

Context reduction: 99% savings at 267 tools (110,000 tokens reduced to ~1,100).
Round-trip reduction: Complex multi-tool workflows that previously required 5-10 LLM round trips now complete in 1-2.
Tool selection accuracy: Levenshtein-based suggestions significantly reduce "tool not found" errors from hallucinated names.
Worker pool reuse: V8 runtime resets avoid the overhead of fresh process spawns on every execution.
Security: Zero sandbox escapes across hundreds of tests, including dozens of targeted bypass attempts against the AST validator.

The codebase includes hundreds of tests across the workspace, licensed under FSL-1.1-ALv2 (Apache 2.0 after 2 years).

Forgemax is open source at github.com/postrv/forgemax. If you are building MCP tool servers with more than a handful of tools, the schema bloat problem is likely already costing you context, latency, and accuracy. The two-tool collapse is a structural solution, not a workaround.

If you are interested in what 267 MCP tools look like in a real security assessment, join the Arbiter waitlist for early access.