Generating typed API clients for Stripe’s 588 operations costs $0.07 on Claude Opus. That’s impressive. Then I watched Postmark’s generated client set BASE_URL = "https://api.example.com" — a value the LLM fabricated instead of copying the literal string from the spec. The gap between what LLMs can do and what they actually get wrong is the design problem worth solving.
The problem
AI agents are only as useful as their integrations. Composio catalogs 901 toolkits. Filter to the high-value ones — payment processors, CRMs, communication platforms — and you get about 90 services. At 15 operations each, that’s 1,350 typed functions your agent needs to call correctly.
Four options for generating that code:
| Approach | Cost | Correctness | Understanding |
|---|---|---|---|
| Hand-code | High | High | High |
| Platform (Composio, Zapier) | Medium | High | Low — you’re locked in |
| Raw LLM generation | $0.07/service | Unreliable — fabricates URLs, mistypes params | High — writes docs, selects operations |
| Spec-based codegen (openapi-generator, Jinja) | Free | Perfect — reads the spec literally | None — no docs, no operation selection, no context |
Spec-based codegen copies the base URL exactly as the spec states it, maps every parameter type literally, and reproduces every endpoint path without invention. If the spec is wrong, the output is wrong too — but that’s a problem LLM enrichment can catch in a pre-generation step. What codegen can’t do: write a useful docstring, decide which operations matter, or infer anything the spec doesn’t state.
Option 4 gets the plumbing right. LLMs get the understanding right. The hybrid combines both.
The pipeline
The first version followed the ACE paper’s approach: enrich the full OpenAPI spec with LLM-generated summaries, then pick which operations to implement. That’s backwards. Enrichment costs tokens. Selection doesn’t.
I researched six papers, designed the scoring function, wrote the pipeline, and tested it against three production APIs. One engineer, AI-assisted, from paper review to working pipeline.
I reversed the order. A deterministic scoring function picks the 15 most useful operations before any LLM call runs. The scoring: +3 for matching a known catalog of top actions for that service category, +2 for capability-matched HTTP methods (GET for read, POST for create), +1 for CRUD verb prefixes in the operation ID, -1 per path depth beyond two segments, -2 for admin, batch, or webhook paths. No model involved. Fully reproducible. Run it twice, get the same result.
This saves 97% of LLM cost. Stripe’s spec has 588 operations. Enriching all of them burns tokens describing endpoints you’ll never use. Enriching 15 is cheap.
The pipeline has six stages: fetch the OpenAPI spec, select operations deterministically, enrich only the selected operations with an LLM, generate the adapter code, generate tests, then self-correct if tests fail. Each stage feeds the next. If one fails, you know exactly where.
Generated adapters don’t contain HTTP plumbing. They import shared _http and _auth modules that handle requests, retries, and authentication. The LLM generates business logic — parameter mapping, response parsing, the parts that actually differ between APIs. This keeps the generated code focused and prevents the model from reinventing request handling in each function.
The self-correction loop has one rule: tests are immutable. If a generated adapter fails its tests, the LLM fixes the adapter. It never touches the test file. Without this constraint, models solve failures by weakening assertions. That’s not fixing — that’s hiding.
Total cost per service: $0.07 for Stripe’s 15 selected operations on Claude Opus (February 2026 pricing).
What I tested
Three real APIs, chosen to stress different failure modes.
| Resend | Postmark | Stripe | |
|---|---|---|---|
| Spec size | 156 KB, 70 ops | 40 KB, 22 ops | 7.2 MB, 588 ops |
| Selected | 15 (21%) | 15 (68%) | 15 (2.6%) |
| Generated functions | 15 | 15 | 15 |
| Auth | Bearer token | Custom header (X-Postmark-Account-Token) | Bearer token |
| Parses cleanly | Yes | Yes | Yes |
| Fully typed | Yes | Yes | No — falls back to **kwargs: Any |
| Fix-and-retry cycles | 0 | 0 | 0 |
All three generated cleanly on the first pass. No self-correction needed. That should have been a warning sign.
Where LLM generation succeeds
Given an OpenAPI spec, the LLM reads the schema, infers intent, and produces function signatures that look like a human wrote them.
Auth patterns were correct across all three services. Return types matched the API responses. Docstrings were clear and accurate. Resend’s send_email function correctly assembled a conditional request body from eight optional parameters — cc, bcc, reply_to, headers, and four others — each included only when provided. That’s not trivial template work.
The LLM also filled in what the specs left out. Parameter descriptions, usage examples, type annotations that the raw OpenAPI document didn’t include.
Where LLM generation fails
The problems are in plumbing. Copying a URL from one field to another. Mapping a schema type to a language type. Interpolating a path parameter. Mechanical operations where the correct answer is sitting in the spec, waiting to be transcribed.
Postmark’s BASE_URL came back as api.example.com. The correct value — api.postmarkapp.com — was right there in the spec’s servers field. A Jinja template would copy it verbatim. The LLM fabricated a placeholder instead.
Type mappings broke in predictable ways. Postmark’s request body was defined as type: object in the schema. The generated adapter declared it body: Optional[str]. Object became string. The schema was unambiguous. Stripe’s adapter included **kwargs: Any on its core methods. Pragmatic if you’re writing a quick script. Useless if the whole point of generating typed adapters is to catch errors at development time. The escape hatch defeats the purpose.
Auth handling showed a subtler failure. Postmark uses a custom X-Postmark-Server-Token header. The generated adapter passed it as a function parameter on every call instead of encapsulating it in an auth configuration object. It works. It’s also the kind of repetitive, error-prone pattern that adapter generation is supposed to eliminate.
Every one of these is a lookup operation. Read a field, write it somewhere else. The LLM adds nothing here. A template copies servers[0].url into BASE_URL with zero risk. The LLM reads the same field and sometimes invents a different value.
The self-correction loop ran zero iterations. Not because the generated code was correct. Because the LLM-generated tests can’t catch semantic errors. They mock HTTP calls. They never hit real URLs. A fabricated BASE_URL passes every test. Wrong types pass every test. The loop that was supposed to catch mistakes is LLMs checking LLMs — and the errors that matter are invisible to both.
The hybrid architecture
The evidence points to a specific split. Use LLMs to add what’s missing from the spec. Use templates and code generation to extract what’s already there.
Phase 1 — Enrich with LLM. Feed the OpenAPI spec to a language model and ask it to add what’s missing: parameter descriptions, usage examples, error response schemas, edge case documentation. The model reads SDK documentation, community forums, existing MCP servers — and produces a richer spec than the one you started with. In IBM’s internal evaluation of their Watsonx Orchestrate platform, ACE-enriched metadata improved tool selection and invocation accuracy by 27 percentage points over minimal metadata — and matched human-authored descriptions. The value is in understanding, not generation.
Phase 2 — Generate with templates. Take the enriched spec and run it through deterministic code generation. BASE_URL extraction, function signatures, URL construction, request serialization, authentication injection — all produced by templates that read the spec literally. OpenAPI Generator and similar tools have done this for years. The template doesn’t improvise. It doesn’t invent query parameters. It doesn’t decide a path segment looks like it should be plural. Same input, same output, every time.
Phase 3 — Wrap with LLM. Now layer understanding back on top. Docstrings that explain when you’d use each endpoint. Type aliases that match the domain language. Convenience functions that bundle common multi-step workflows. The LLM adds the developer experience. The plumbing underneath is already correct.
I haven’t built this hybrid pipeline yet. The pure-LLM approach is what found the boundary. Enrichment works. Structural comprehension works. Mechanical transcription is where the failures cluster. That’s the evidence, and this is the architecture it points to.
With templates generating the mechanical code, correctness is a property of the toolchain, not the output. The only remaining quality gate is live API testing — does the API actually behave as documented? That’s a simpler question than “did the LLM correctly transcribe twelve URL patterns and four authentication schemes?”
Template-generated code is deterministic and auditable. Same spec in, same adapter out. You can explain that toolchain to a compliance reviewer in healthcare or finance. You can version it, diff it, trace every line back to a spec field. LLM-generated code offers none of that — you’d need to re-verify every output, every time.
This is an architecture conviction, not an AI decision. Deterministic tools for deterministic tasks. Probabilistic tools for probabilistic tasks. Match the tool to what the problem actually requires.
What the research misses
Every paper in this space — ACE, ToolMaker, ToolFactory — uses an LLM for the full code generation step. ACE separates enrichment from generation, which is the right instinct, but still hands the code output to a language model. The hybrid split — enrich with LLM, generate with templates, wrap with LLM — doesn’t appear in any of them.
The comparison experiment hasn’t been run either. ToolFactory benchmarked 167 APIs with 744 endpoints, the largest dataset in the literature. A hybrid pipeline tested against 90 APIs with three-level evaluation (static analysis, mock execution, live API calls) would already be broader than most published work.
Licensing is unaddressed. If an LLM trains on open source code and produces an adapter that resembles it, who owns that output? What license applies? For any company shipping these adapters, that question matters.
Each paper also picks a single input source. Real APIs have documentation scattered across the OpenAPI spec, the official SDK, and community MCP servers. The gaps in one are often filled by another. Combining them into a single enrichment step is an obvious move that nobody’s tried.
ToolMaker generates from arbitrary GitHub repositories. My pipeline operates on structured OpenAPI specs — a different constraint that changes what’s possible with deterministic generation. That distinction doesn’t appear in the literature.
When to use what
| Scenario | Approach | Approximate cost per service |
|---|---|---|
| Fewer than 5 integrations, stable APIs | Hand-code them | ~4 hours engineer time ($200-800) |
| 50+ integrations, customization not a priority | Platform (Composio, Zapier, etc.) | $0.50-5.00/month per connector |
| 50+ integrations, need type safety + tests + ownership | Hybrid pipeline (deterministic gen + LLM) | ~$0.07 one-time generation cost |
If you have three Stripe webhooks and a Slack bot, a code generation pipeline is overkill. Write the clients by hand. You’ll finish before the pipeline finishes bootstrapping.
If you need 200 integrations and your team doesn’t care about owning the code, a managed platform gets you there faster. You pay monthly instead of upfront, and you accept their abstractions.
The hybrid approach earns its complexity when you need scale and control. Fifty-plus services, each with typed clients, generated tests, and code you can read and modify without vendor lock-in. At $0.07 per service, the generation cost rounds to zero. The value is in what you don’t spend afterward — no platform fees compounding monthly, no engineer-hours hand-coding boilerplate that a deterministic template handles in seconds.
Pick based on your constraints, not on what sounds most sophisticated.
Where this leads
Three APIs was enough to see the pattern. The LLM fabricated a URL it could have copied. It mistyped a schema it could have mapped. It escaped into **kwargs when the types got complex. Every failure was a mechanical task the LLM had no business doing. The deterministic path is quieter, but it’s correct.