Skip to main content
All posts

I Validated 15 Popular MCP Servers. Most Have the Same Blind Spot.

March 25, 2026 · 9 min read

MCP tool definitions are how LLMs decide which tools to call, what arguments to pass, and whether a tool is safe to run without asking. Bad definitions mean tools get ignored, called with wrong arguments, or run destructive operations without confirmation.

I audited 15 popular MCP servers — 75k+ combined GitHub stars — checking how their tool definitions hold up against the WildRun MCP Validator. The top-line scores are surprisingly good. But the same gap keeps appearing.

The Scorecard

ServerStarsScoreGradeAnnotations?
mcp-server-kubernetes1.4k97AYes
exa-mcp-server4.1k97AYes (3 hints)
apple-docs-mcp1.2k97AYes
n8n-mcp-server1.6k94ANo
mcp-server-browserbase3.2k94ANo
DesktopCommanderMCP5.8k94ANo

Expanded Audit (March 25 Update)

After the initial benchmark, I audited 9 more servers. The pattern got clearer with scale: well-maintained servers by large orgs tend to have annotations. Everything else doesn't.

ServerStarsAnnotations?
mcp-chrome10.9kNo — 37 tools, incl. chrome_inject_script, chrome_javascript
BrowserMCP6.2kNo — browser_click, browser_type unannoted
mcp-atlassian4.7kYes — destructiveHint on Jira/Confluence writes
magic-mcp4.6kNo
notion-mcp-server4.1kYes — auto-derived from HTTP methods
mcp-server-cloudflare3.6kYes — per-service annotations (D1, Workers, KV)
mcp-obsidian3.1kNo — DeleteFile has no destructiveHint
dbhub2.4kYes — read/write distinguished
google_workspace_mcp1.9kNo — send_gmail_message has no destructiveHint

Scores based on tool name format, description quality, inputSchema validation, parameter descriptions, annotations, and outputSchema. Validated using wildrunai.com/tools/mcp-validator.

The Annotation Gap

Every server nails the basics: descriptive names, good descriptions, proper inputSchema with typed properties. But only 7 out of 15 use MCP annotations. The eight that don't include servers that execute arbitrary JavaScript (mcp-chrome), send real emails (google_workspace_mcp), delete notes (mcp-obsidian), run shell commands (DesktopCommander), and delete workflows (n8n). And the LLM has no hint about the risk level.

Here's what good looks like. From mcp-server-kubernetes:

{
  "name": "kubectl_delete",
  "description": "Delete Kubernetes resources...",
  "annotations": {
    "destructiveHint": true   // ← LLM knows to confirm
  },
  "inputSchema": { ... }
}

And from exa-mcp-server:

{
  "name": "web_search_exa",
  "annotations": {
    "readOnlyHint": true,     // ← Safe to run anytime
    "destructiveHint": false, // ← Explicitly non-destructive
    "idempotentHint": true    // ← Same input = same output
  }
}

Now compare with n8n-mcp-server:

{
  "name": "delete_workflow",
  "description": "Delete a workflow in n8n",
  // ← No annotations at all
  // ← LLM doesn't know this is destructive
  "inputSchema": { ... }
}

Our validator catches this: “delete_workflow sounds destructive — add annotations: { destructiveHint: true } so LLMs handle with care.”

Nobody Uses Output Schemas

Zero out of six servers define outputSchema. This is optional, but it matters for tool chaining. When an LLM calls tool A and needs to pass the result to tool B, it needs to know what A returns. Without an output schema, the LLM has to guess.

A simple addition makes a real difference:

{
  "name": "kubectl_get",
  "outputSchema": {
    "type": "object",
    "description": "Kubernetes resource(s) in JSON format",
    "properties": {
      "items": { "type": "array" },
      "kind": { "type": "string" }
    }
  }
}

What the Best Servers Do

Three patterns separate the 97s from the 94s:

  1. Every parameter has a description. Not just a type. Not just a name. "description": "Type of resource to get (e.g., pods, deployments, services)" tells the LLM what values are valid.
  2. Annotations on every tool. The K8s server marks reads as readOnlyHint: true and deletes as destructiveHint: true. Exa goes further with idempotentHint on search tools.
  3. Descriptions explain when to use the tool, not just what it does. Exa's web_search_exadescription includes “Best for: Finding current information” and “Query tips: describe the ideal page, not keywords.” This prevents misuse.

Common Mistakes That Drop Your Score

To show what a badtool definition looks like, here's a synthetic example that scores 66/100 (C grade):

{
  "tools": [
    {
      "name": "do_thing",        // ← Vague name
      "description": "Does thing", // ← Useless description
      "inputSchema": {
        "type": "object",
        "properties": {
          "id": { "type": "string" },   // ← No description
          "data": { "type": "object" }  // ← No description
        }
      }
    },
    {
      "name": "delete-all"       // ← No description at all
                                  // ← No destructiveHint
    }
  ]
}

The validator flags 12 issues: missing descriptions, undescribed parameters, no annotations, and a destructive-sounding tool with no safety hint. Every one of these makes the LLM more likely to use the tool incorrectly — or skip it entirely.

Validate Your Own Server

Three ways to check your tool definitions:

  1. Web validator: Paste your tools/list JSON response at wildrunai.com/tools/mcp-validator
  2. GitHub Action (free): Add to your CI pipeline to catch regressions on every PR:
    - uses: wildrunai/mcp-validate-action@v1
      with:
        file: ./tools.json
        min-score: 80
  3. API: POST your tool JSON to https://wildrunai.com/api/tools/mcp-validate for programmatic validation.

The Takeaway

Popular MCP servers are well-built. The basics — names, descriptions, schemas — are solid across the board. But annotations are where the ecosystem has a blind spot. Only 7 of 15 servers tested tell LLMs which tools are destructive. And almost nobody uses output schemas.

These aren't cosmetic issues. Annotations prevent LLMs from running send_gmail_message or chrome_inject_script without asking. Output schemas enable reliable tool chaining. Both are in the MCP spec. Both are easy to add. Both make your server safer and more useful.

Check your server. Fix the warnings. It takes 30 seconds.