> Get used to predicate calculus, temporal logic, and writing proofs. But at tha...

agentultra · on Aug 9, 2016

> But at that point, you need to start considering bugs in proofs.

If there's a flaw in your specification it is much more obvious than in your code. A high-level specification is usually built up from formal mathematics to model the execution of a system. If you use a model-checker it will take the specification of the model of your system and check that the invariants and properties you specify will hold across all possible executions of that model. If there is a problem in your specification it will be immediately obvious which cannot be said of computer programs.

In other words if you cannot see the error in your specification what makes you think you should be implementing it?

Now there is a long-held myth that we cannot verify the entire software stack... so what's the point?

You do not have to verify the entire software stack. A specification of a critical component is often sufficient. Take Paxos as an example: the author, Leslie Lamport, wrote the specification in TLA+. You can follow the mathematics to convince yourself of its efficacy or you can trust that the model checker is implemented correctly and will be sufficient for your purposes. The same is implied for a proof assistant. The point is that you do not need to assume these things are true: you can verify them by hand if you have to... the math will check out.

The same is true for other critical components of distributed systems. A single and even multi-threaded application on a single computer is difficult enough to reason about (what with multiple cores executing tasks out-of-order and sharing different levels of cache... it's almost like a tiny distributed system itself). Humans make terrible predictions of performance and correctness that fail time and again when they are benchmarked and profiled. It gets magnitudes worse in large, distributed systems and specifications are only a tool to manage thinking in those terms.

It's actually a tool for simplifying the development of such complex systems. I don't trust a consensus protocol or a lock-free algorithm if it is not accompanied by some sort of verifiable specification... not some hand-waving English prose, but a hard formal mathematical specification.

Do we still use these under-specified systems? Yes, sometimes. The trade-off being that the risk is acceptable and we design ways to tolerate failures as best as we can. However I will prefer a system that provides specifications for its key components.

On the myth of cost... it's true, thinking hard about your system up-front has a cost to it. Where and what you choose to specify will have some correlation to how much time you choose to spend. However the other component of the equation to consider is how critical is it to get this system correct? Another way to think about it is to consider what is the worst thing that could happen if your design contains a flaw.

This is obscenely easy to do in distributed systems. Our brains are not well enough equipped to imagine a model of our system and trace its execution and all of its state 49 steps deep. When confronted with such information many of us are not even capable of seeing the forest for the trees! We think, "Bah, that could only happen once in 250,000 executions.. it's not worth worrying about!" They don't realize that when you make that bet a couple of million times a minute then it's quite often you will see that error. Finding those kinds of traces postmortem is incredibly difficult... and why waste the time searching for that error when spending the extra time up-front will save you the headache down the road?

Yes it requires more work and it costs more at the beginning of a project... but if you cannot afford to spend it then know what you're getting yourself into and act accordingly.

I honestly believe that these mathematical tools can be taught to intermediate-to-advanced programmers who could use them to check their designs and help them write better software. It may seem intimidating but it's just booleans: there are only two possible values for any expression in predicate logic! In all seriousness the tools that are available today to do this work are becoming use-able by mere mortals and are not the eldritch scrawls of snooty ivory-tower academics.

You don't need to spend billions. You can learn TLA+ in a couple of weeks and start writing specifications in a month. There's a fairly slim tome called, Programming in the 1990s that demonstrates a practical technique for proofs that is digestible by programmers.

I say, "don't accept 'battle-tested' as the hallmark of a reliable system," because it says nothing about the underlying, undiscovered errors in the implementation (I made the mistake of using the word, bug, earlier... a habit I am trying to correct). All that buzz-word signifies is that the developers have attempted to implement a white-paper as best they could and wrote unit tests for the code they care about and used their customers to find the errors and edge-cases they didn't initially consider. While that is valuable to you after the fact you must be wary that you'll probably find some of those errors yourself. Hopefully the paper they based their critical sections on published a specification or proof. That's always a bonus... but you need to check these things.

Worse, someone might try to sell you on their distributed mutex based on their well-reasoned essay describing said algorithm... don't trust it unless you can check the model or understand the proof. It may sound reasonable but engineers don't build sky-scrapers without blueprints.

viraptor · on Aug 9, 2016

> In other words if you cannot see the error in your specification what makes you think you should be implementing it?

Since we're talking about proving correctness, nobody can prove they didn't miss an error - either in the specification or implementation. Even wikipedia provides a trivial case of that in the article on TLA+ https://en.wikipedia.org/wiki/TLA%2B#Liveness

Would you never miss the fact that you didn't specify that the clock must tick? Would you never miss an error like that in a system orders of magniture more complicated?

So by extension - if you cannot ensure you see an error in your specifications, what makes you think you should write them? I'm not saying the specification will be incorrect - you can check that of course. I'm saying it can be incomplete.

It's the same as a test. You write more of them, you catch more bugs. You specify more, you verify more. In either one you can't prove you covered all the cases you implicitly meant to.

Honestly, at this point, I'd rather take an implementation that's battle-tested by a large company for months than a verified implementation which hasn't been used by anyone apart from the authors in their lab. (and yeah, that may not be a popular view with some people :) )

nickpsecurity · on Aug 9, 2016

"So by extension - if you cannot ensure you see an error in your specifications, what makes you think you should write them?"

Your standard here is that it must achieve perfection or we don't use it at all. A realistic approach is to use anything proven to increase correctness, reliability, security, whatever as much as one can justify. Decades of work in formal specification show they catch all kinds of errors. The most recent, which I think agentultra was referring to, was Amazon's use of TLA catching problems that would've required 30 or more steps in testing. Easily. They caught so many problems & improved their understanding enough that the team is totally sold on it. Same kinds of results as safety-critical industry and high-assurance security. Nobody doubted it was helping them.

So, such empirical evidence showing all kinds of problems caught by the method plus benefits means it's a good method to use if one can. The complexity of distributed systems increasing potential problems means one should be increasing their tooling rather than decreasing it. For another good tool there, check out the Verdi system for verifying distributed systems against failures. As usual, applying it already caught problems in real-world stuff.

viraptor · on Aug 9, 2016

> Your standard here is that it must achieve perfection or we don't use it at all.

Quite the opposite actually. It was just a hyperbolic response to "why implement if you don't have proven specification" - which you explained why it's silly as well.

I completely agree - if you have time, resources and proper setting for it - go for proofs and verification. Most software as we use it today cannot afford and doesn't need this. The bugs are either acceptable or not, but they're never "purposefully [left] in your system" as the OP claimed. (which was the phrase that really annoyed me / caused the response)

I'm quite excited about the progress in software verification. But outside of very special cases today, I don't think it's worth it.

agentultra · on Aug 9, 2016

> The bugs are either acceptable or not, but they're never "purposefully [left] in your system" as the OP claimed. (which was the phrase that really annoyed me / caused the response)

It seems you are really annoyed that I give responsibility for "bugs," or as I like to call them: errors, to programmers. If you do not use a specification for your lock-free mutex or consensus protocol then I posit that you are willfully choosing not to use a tool that can help you avoid errors in your design and that you should conduct yourself accordingly. The original article provides many good suggestions for coping with this situation.

And while I'm on the subject it is worth pointing out that while the specification might check out the implementation may still contain errors. This is normal. The specification gives you a compass that helps you to track down errors in your implementation. Once found it gives you another invariant to add to your specifications and another assertion or contract to add to your code.

These are good tools to have and as I and others have suggested one does not need to "verify" the entire system to make use of specifications. They are a great design tool in the programmer's tool belt that help us to write better software. Distributed systems are hard enough and we need all the help we can get.

nickpsecurity · on Aug 9, 2016

"If you do not use a specification for your lock-free mutex or consensus protocol then I posit that you are willfully choosing not to use a tool that can help you avoid errors in your design and that you should conduct yourself accordingly."

Sort of. Remember that, if they work for companies, programmers' job is not to write bug-free software. It's to translate requirements into software within constraints imposed by management. They should put in what quality they can. They often don't have time to spec out stuff because management won't allow it. "Ship, ship, ship!" It's why I typically recommend them re-use previously-proven solutions for stuff like you mentioned.

So, they certainly are leaving errors in intentionally if they avoid tools that would catch them. Yet, it's not always their fault. It's also not always a bad thing if we're talking hobbyists doing stuff for fun where they and/or their community accept some occasional bugs. On opposite end, people would get fired at Altran/Praxis for not doing what they could for correctness. To each their own based on their needs or situation.

agentultra · on Aug 10, 2016

Hence the caveat: conduct yourself accordingly.

Just because you don't use specifications doesn't make you a bad programmer but you must be aware of your own short-comings and plan for them. If you're not allowed to use specifications you're being told you're not allowed to think. If that's the case you know there are going to be errors produced by the team either in the design itself or in the implementation.