The better solution? That the 0mq libs do the right thing and don't get wedged. ...

rumcajz · on June 26, 2012

The problem with that is that 0MQ socket abstracts multiple underlying connections. Reporting error would mean making the connections visible to the user. There would have to be connection IDs, accept function, error notifications etc. In the end the whole thing would boil down to ugly version of standard BSD TCP sockets.

The right thing to do re-send the request after disconnection or after timeout have expired. It can be done easily in the application, however, if you want it inside the library, feel free to submit a patch.

sausagefeet · on June 26, 2012

I've been experimenting with being completely asynchronous (and working on being connection-less). The protocol layer just wraps up payloads and unpacks them. There is a background heartbeat and when the heartbeat is not met there is a notification that the heartbeat has not been met but the user is in charge of if this should be considered a disconnection. This is mostly inspired by how Oz does distribution. I don't have any good results yet, though.

noselasd · on June 26, 2012

Receiving and handling I/O errors is easy, the harder part is when something goes wrong on the peer and you don't receive an error.

When dealing with network code, you need 1) Timeouts, 2) Keepalives.

What kind of keepalives and timeouts depend entierly on your needs. The problem is that most libraries/protocols doesn't have either, and most example code never shows this. (Any TCP example that does a read() or write() without any form of timeout is a DOS waiting to happen)

stavros · on June 26, 2012

There's a problem when you restart servers at the wrong moment, though, as the article mentions...

rdtsc · on June 26, 2012

So ... in other words there is a serious problem.

Restarting server / a server crash / a network outage and now potentially thousands of clients are in a bad state they cannot recover from. And this was by design. That was the point of the post I think. This isn't as much as an oversight as a bad design decision.

rarrrrrr · on June 26, 2012

Agreed, but the bad design is not in ZMQ, but the way it is being used.

A client should never just wait forever for a response to a message. Any reliable system has to implement something like timeouts our immediate message acknowledgement (at which point maybe you can wait forever for a reply.)

There's a comprehensive discussion of this in chapter 4 "Reliable Request Reply" of the ZMQ guide.

TCP doesn't give you guaranteed response either. Just guaranteed delivery (or error.) In this case, this is exactly what the author's getting with ZMQ. The client's Request socket makes a successful delivery, then the server crashes before generating a response, and the client waits forever for the response that will not come.