MCP in Production: What the Spec Doesn't Tell You

February 27, 20265 min read

MCPAI AgentsArchitectureProduction

Building your first MCP server takes a weekend. The spec is clean, the SDKs are good, and the demo where Claude calls your tools feels like magic. I know because I shipped that demo. Then I spent the months that followed learning everything the spec doesn't cover.

Product details anonymized. Real engineering patterns.

The context: AdPilot, a platform where an AI agent manages advertising campaigns across Meta, LinkedIn and Google Ads. The agent does real things with real money. It pauses campaigns, adjusts budgets, pulls performance data. Everything goes through MCP. Here is what I wish I had known on day one.

Problem one: your tools are eating the context window

We started with one MCP server per platform. The LinkedIn server alone exposed 27 tools. Google Ads needed 36 to cover campaigns, ad groups, keywords, budgets, bidding strategies and reporting. Add Meta and you are pushing 80+ tool definitions into the model's context before the user has typed a single word.

Two things happen. First, you pay for those tokens on every request. Second, and this is worse, the agent gets dumber. With 80 tools that all sound vaguely similar (update_campaign, update_campaign_budget, update_ad_group_budget...), tool selection accuracy drops. The model picks the LinkedIn tool when the user asked about Google. It chains three calls where one would do.

The fix was not fewer features. It was consolidation: we migrated from standalone per-platform servers to a single multi-source server with shared middleware and routing by source. Fewer, smarter tools with a source parameter beat many narrow ones. Tool descriptions became the highest-leverage prompt engineering surface in the entire product. We rewrote them more often than the system prompt.

Problem two: auth is entirely your problem

The MCP spec gives you a transport, a protocol, and (since the 2025 revisions) an OAuth framework for authorizing clients against your server. What it does not cover is everything downstream of your server: how to handle a workspace whose Meta token expires every 60 days, or what to do when a user revokes access from inside the Meta UI without telling you.

In a multi-tenant product, every tool call has to resolve credentials at runtime: which organization, which workspace, which platform account, which token. We ended up with API keys scoped per workspace, a token refresh helper that fires proactively before expiry, and revoked-token detection that propagates a status the UI can surface. None of this is glamorous. All of it is mandatory, because an agent that silently fails on an expired token does not look like an auth problem to the user. It looks like a stupid agent.

For agency customers managing client accounts long-term, the standard OAuth flow was not even enough. We added a second onboarding path with non-expiring system tokens. Two auth flows for one platform, because the protocol's job ends where your users' reality begins.

Problem three: error messages are prompts

This one took me embarrassingly long to internalize. When a tool call fails, the error message is not for a developer reading logs. It is for the model, which will read it and decide what to do next.

Error 400: invalid parameter sends the agent into a retry loop with the same broken arguments. The date range exceeds 90 days. Split the request into smaller ranges. gets you a correct second attempt. We started writing error messages the way we write prompts: state what went wrong, state what to do instead. Tool error handling became part of the agent's reasoning loop, not an afterthought.

Same logic for idempotency. Agents retry. Sometimes they retry things that already succeeded. Any tool with side effects, like creating a campaign or adjusting a budget, needs to tolerate being called twice with the same intent, or you will explain to a customer why their budget doubled.

Problem four: sometimes the best tool is SQL

The biggest unlock came from deleting tools, not adding them.

Early on, answering "compare CPA over the last three weeks versus the previous month, per campaign" meant the agent chained paginated API calls: list campaigns, fetch insights per campaign, fetch the comparison period, aggregate in its head. Slow, expensive, and wrong often enough to be embarrassing.

We had already built a data pipeline syncing ad platform data into ClickHouse, with native row-level security per tenant. So we replaced the chain of reporting tools with a single one: execute raw SQL against the reporting tables, RLS applied automatically based on the user's scope.

Loading diagram…

One tool, near-unlimited analytical flexibility. The model is good at SQL, much better than at orchestrating five paginated endpoints. The security boundary moved where it belongs: into the database. Every query must carry the tenant context, and the row policies fail with an explicit error if it is missing. Not fail-silent. Fail-loud. Even a bug in the application layer cannot leak data across workspaces.

The lesson generalizes: a tool is an interface for a reasoning engine, not a wrapper around your REST API. Design for what the model is good at.

What I would tell myself at the start

The spec is the easy 20%. The real work is auth lifecycle, tenant isolation, error messages written for a model, and ruthless tool curation. Budget for that, not for the protocol.

And measure tool selection accuracy the way you measure latency. It is the metric that decides whether your agent feels smart or stupid, and nobody warns you about it.

Working on a similar AI project? Let's talk about it.