fix: add panic recovery to goroutines that execute user code#959
fix: add panic recovery to goroutines that execute user code#959blackwell-systems wants to merge 1 commit into
Conversation
Library goroutines that dispatch user-provided handlers (tool handlers, middleware, etc.) lack defer recover(). A panic in user code crashes the entire host process. Add recovery to 5 goroutine sites: - internal/jsonrpc2/conn.go: handleAsync handler dispatch (critical path: ALL tool/prompt/resource handlers execute here) - mcp/shared.go: keepalive goroutine - mcp/sse.go: SSE reader goroutine - mcp/streamable.go: handleJSON and handleSSE response processors - mcp/transport.go: stdio reader goroutine Recovery behavior: - handleAsync: sends INTERNAL_ERROR JSON-RPC response to client (graceful) - handleJSON/handleSSE: calls c.fail() to signal connection error - keepalive: logs via structured logger, goroutine exits - SSE reader: logs via slog.Default(), defer Close() still fires - stdio reader: sends error on incoming channel Tests added: - TestToolHandler_PanicRecovery: proves panic doesn't crash process - TestToolHandler_PanicDoesNotAffectSubsequentCalls: proves server continues serving after a panic Fixes modelcontextprotocol#958
| defer func() { | ||
| if r := recover(); r != nil { | ||
| c.fail(fmt.Errorf("%s: panic in handleJSON: %v", requestSummary, r)) | ||
| } |
There was a problem hiding this comment.
What is the threat this recover() is intended to address?
A recover() here wouldn't catch panics from user handlers since those execute on the goroutine spawned in handleAsync.
|
The recover is inside the
No additional goroutine is spawned between the recover and the user handler. The test |
|
Please see my comment on #958. Recovering from panics is not the right approach for these sorts of faults. |
Implementation for #958.
Problem
Library goroutines that dispatch user-provided handlers (tool handlers, middleware, etc.) lack
defer recover(). A panic in user code crashes the entire host process.As a library, the SDK doesn't own the process, doesn't control the supervisor, and doesn't know what else runs in the same address space. A panicking goroutine takes down the consumer's HTTP server, metrics exporters, graceful shutdown hooks, and any other in-flight work.
Fix
Add recovery to 5 goroutine sites:
internal/jsonrpc2/conn.gohandleAsynchandler dispatchmcp/shared.gomcp/sse.goslog.Default(),defer Close()still firesmcp/streamable.gohandleJSONresponse processorc.fail()to signal connection errormcp/streamable.gohandleSSEstream processorc.fail()to signal connection errormcp/transport.goThe critical path is
handleAsyncininternal/jsonrpc2/conn.go: this is where ALL tool, prompt, and resource handlers execute. A panicking tool handler currently kills the process; with this fix it returns an error response to the client and the server continues serving other sessions.Tests
TestToolHandler_PanicRecovery: registers a tool that panics, calls it, verifies process survives and error is returnedTestToolHandler_PanicDoesNotAffectSubsequentCalls: panics one tool, then calls a different tool successfully, proving the server continues operatingBoth tests crash without the fix (verified by reverting and running).
Notes
go vet,-race, and full test suite pass