An uncaught exception in our Node app was causing not only one, but two and then three workers to die. (Fortunately, we hardly ever encounter uncaught exceptions. Really, just this one since launch a few months ago. We’re Node studs! Right?)
The funny thing is that we’re using Express, which (via Connect) wraps each request / response in a try / catch. And we use Express’s error handler, which returns 500 on unhandled errors.
Another funny thing is we use cluster, which isolates workers from each other. They live in separate, solipsistic processes.
But instead of returning 500, our worker simply died. And, as if in sympathy, the rest immediately followed.
Time to get to the bottom of this. A Node stud like me can figure it out. No sweat. Right?
For a sanity check, I went to Chrome and Firefox’s network inspectors. Only one POST, the bad request that triggered the exception. Everything else looks normal. Sanity: verified.
Then it was on to the cluster module. That magical “OS load balancing” seemed highly suspicious. But nope, I asked in #nodejs and they said that only applies at the TCP connection level. Once a connection is assigned to a worker, it never goes to another worker. Meaning that the bad request was isolated—only the worker who received the initial connection could encounter it.
But the workers kept on dying.
These workers morted out fast. They didn’t even return 500, or any kind of response. The more I thought about it, that didn’t really seem right. Not right at all. Why no 500?
But I can only tackle one mystery at a time. I wanted to understand: why did so many workers die?
Furious googling ensued. My efforts were rewarded with this nugget:
If an HTTP/1.1 client sends a request which includes a request body, but which does not include an Expect request-header field with the “100-continue” expectation, and if the client is not directly connected to an HTTP/1.1 origin server, and if the client sees the connection close before receiving any status from the server, the client SHOULD retry the request.
My mind was somewhat blown. The browers were right after all. They were just following HTTP. And—helpfully!—hiding the resent POSTs from the network inspector.
But POSTs are dangerous. They mutate resources! I must only click the Order button once or I may get charged multiple times!
I had a thought. One I have often, yet each time, it seems new again: I have much to learn.
Back to the 500s. Or lack thereof. Which got funnier still when I realized that
other errors in our controllers that threw exceptions did return 500s. Being a
hands-on kind of guy, I added one right at the top of a route controller:
new Error("uh-oh"). My dev server spat back:
500 Error: uh-oh.
So why did that one particular error never, ever return a 500, or any response of any kind?
It’s my fault, really. I’m still a Node newbie (I must never forget this). I had missed that because async IO callbacks occur in a different call stack from the request / response cyle, one that originates from the event loop, Express’s try / catch doesn’t catch them.
It makes total sense. I have much to learn.
But how do I get this domain into my controller? My solution was to attach it to
res.locals._domain. Good solution? I don’t know. I suspect there’s a better
way. Good enough? It solved my immediate problem:
1 2 3
Sweet. Now, armed with a reference to
res in the domain error handler, I can
return a 500. Voila, the browser gets its response. No more helpful resent
POSTs. The silent gratitude of the spared workers is its own reward.
I have much to learn.