Skip to content

fix: add panic recovery to goroutines that execute user code#959

Open
blackwell-systems wants to merge 1 commit into
modelcontextprotocol:mainfrom
blackwell-systems:fix/goroutine-panic-recovery
Open

fix: add panic recovery to goroutines that execute user code#959
blackwell-systems wants to merge 1 commit into
modelcontextprotocol:mainfrom
blackwell-systems:fix/goroutine-panic-recovery

Conversation

@blackwell-systems
Copy link
Copy Markdown
Contributor

Implementation for #958.

Problem

Library goroutines that dispatch user-provided handlers (tool handlers, middleware, etc.) lack defer recover(). A panic in user code crashes the entire host process.

As a library, the SDK doesn't own the process, doesn't control the supervisor, and doesn't know what else runs in the same address space. A panicking goroutine takes down the consumer's HTTP server, metrics exporters, graceful shutdown hooks, and any other in-flight work.

Fix

Add recovery to 5 goroutine sites:

File Goroutine Recovery behavior
internal/jsonrpc2/conn.go handleAsync handler dispatch Sends INTERNAL_ERROR JSON-RPC response to client
mcp/shared.go keepalive goroutine Logs via structured logger, goroutine exits
mcp/sse.go SSE reader goroutine Logs via slog.Default(), defer Close() still fires
mcp/streamable.go handleJSON response processor Calls c.fail() to signal connection error
mcp/streamable.go handleSSE stream processor Calls c.fail() to signal connection error
mcp/transport.go stdio reader goroutine Sends error on incoming channel

The critical path is handleAsync in internal/jsonrpc2/conn.go: this is where ALL tool, prompt, and resource handlers execute. A panicking tool handler currently kills the process; with this fix it returns an error response to the client and the server continues serving other sessions.

Tests

  • TestToolHandler_PanicRecovery: registers a tool that panics, calls it, verifies process survives and error is returned
  • TestToolHandler_PanicDoesNotAffectSubsequentCalls: panics one tool, then calls a different tool successfully, proving the server continues operating

Both tests crash without the fix (verified by reverting and running).

Notes

Library goroutines that dispatch user-provided handlers (tool handlers,
middleware, etc.) lack defer recover(). A panic in user code crashes
the entire host process.

Add recovery to 5 goroutine sites:

- internal/jsonrpc2/conn.go: handleAsync handler dispatch (critical path:
  ALL tool/prompt/resource handlers execute here)
- mcp/shared.go: keepalive goroutine
- mcp/sse.go: SSE reader goroutine
- mcp/streamable.go: handleJSON and handleSSE response processors
- mcp/transport.go: stdio reader goroutine

Recovery behavior:
- handleAsync: sends INTERNAL_ERROR JSON-RPC response to client (graceful)
- handleJSON/handleSSE: calls c.fail() to signal connection error
- keepalive: logs via structured logger, goroutine exits
- SSE reader: logs via slog.Default(), defer Close() still fires
- stdio reader: sends error on incoming channel

Tests added:
- TestToolHandler_PanicRecovery: proves panic doesn't crash process
- TestToolHandler_PanicDoesNotAffectSubsequentCalls: proves server
  continues serving after a panic

Fixes modelcontextprotocol#958
Comment thread mcp/streamable.go
Comment on lines +2061 to +2064
defer func() {
if r := recover(); r != nil {
c.fail(fmt.Errorf("%s: panic in handleJSON: %v", requestSummary, r))
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the threat this recover() is intended to address?
A recover() here wouldn't catch panics from user handlers since those execute on the goroutine spawned in handleAsync.

@blackwell-systems
Copy link
Copy Markdown
Contributor Author

The recover is inside the go func() spawned by handleAsync (line 640), on the same goroutine that calls c.handler.Handle(ctx, req.Request). The full call chain from that goroutine is synchronous:

  1. HandleServerSession.handle (server.go:1445)
  2. ServerSession.handlehandleReceive (shared.go:157)
  3. handleReceivemh(ctx, method, req) (shared.go:171, comment: "mh might be user code")
  4. mh → the user's tool handler

No additional goroutine is spawned between the recover and the user handler. jsonrpc2.Async(ctx) (called at server.go:1466) releases a concurrency semaphore so the next request can start being handled; it does not spawn a new goroutine.

The test TestToolHandler_PanicRecovery confirms: a tool handler calling panic("deliberate panic") is caught, the client receives an error response, and the process survives. Without the recover, that test crashes the binary.

@jba
Copy link
Copy Markdown
Contributor

jba commented May 16, 2026

Please see my comment on #958. Recovering from panics is not the right approach for these sorts of faults.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants