An introduction to Clear Containers (2015)

rwmj · on June 2, 2016

I've been writing a paper and hoping to give a talk about this at KVM Forum. I wasn't ready to publish this, but here's an early version of the paper for anyone interested:

http://oirase.annexia.org/tmp/paper.pdf

Note: I'm not connected to Intel, just trying to reproduce their work in QEMU/KVM and regular kernels. KVM Forum details: http://events.linuxfoundation.org/events/kvm-forum/

wyldfire · on June 2, 2016

> "DAX is also working in modern kernels, and it was a relatively trivial job to implement DAX."

What is DAX? Is that like an execute-in-place feature that means we don't need paging to execute code?

If so, is this feature generally useful for all QEmu use cases?

rwmj · on June 2, 2016

In real hardware, DAX comes in two parts. There is an NVDIMM device (basically ordinary DRAM, but backed by additional flash chips so it preserves its contents when powered off). NVDIMMs work at RAM speeds so although you can use them as fast block devices, it's better to direct map them to avoid the block layer completely. The other part is a filesystem implementation (I have used ext4, but xfs exists too) which lets you mmap files directly if they come from a device which is backed by NVDIMMs. Also execute-in-place for binaries (like the old XIP support in ext2 which is obsoleted by DAX).

QEMU has a virtual NVDIMM ("vNVDIMM") so you can test this without needing the real NVDIMM hardware. But in this case it's also a useful way to reduce memory usage, since you're avoiding the block layer and page cache in the guest. That's the theory. Although I've been able to make it function correctly, I didn't observe any great improvements in speed or memory usage (see the paper for details).

Here is a patch which adds DAX support to libguestfs which should give you some ideas how to try out DAX in QEMU: https://www.redhat.com/archives/libguestfs/2016-May/msg00138...

GlennS · on June 2, 2016

I enjoyed this, although it was mostly over my head.

I'd suggest adding definitions or simple explanations for the acronyms. There are a lot of them, and I spent quite a lot of time Googling as a result.

ashitlerferad · on June 2, 2016

How do I compile Linux to get these results? Is there a ready-made Linux build config for this?

rwmj · on June 2, 2016

There is a config in the source of the talk: http://git.annexia.org/?p=libguestfs-talks.git;a=tree;f=2016... which could be a good place to start. Or you could follow the instructions in the paper about starting from the absolutely minimal config and adding features until you get something bootable.

mbakke · on June 2, 2016

Needs (2015) tag. See also the official release announcement[0] and documentation[1].

0: https://coreos.com/blog/rkt-0.8-with-new-vm-support/

1: https://github.com/coreos/rkt/blob/master/Documentation/runn...

sillysaurus3 · on June 2, 2016

Good catch. I've added the year.

sillysaurus3 · on June 2, 2016

This explains how it's possible to boot a VM + a container in 150ms.

It's surprising that you can boot a server in a blink of an eye.

rwmj · on June 2, 2016

It's not really surprising, it's just a lot of hard work on small details of the boot process. Intel are way ahead here. See my other comment for a draft of a paper I've been writing about this.

nickpsecurity · on June 2, 2016

This is an improvement on regular containers but an expected one given it's old news for microkernels and separation kernels. Dresden I think was first with L4 and L4Linux showing you could boot a VM every second on a slow machine with low TCB. I imagine it could've been faster. Then, OKL4 and LynxSecure both showed the context-switch time could be negligible with joint Intel and Lynx work showing 100,000 context switches per second with 97% idle CPU or something. Vastly stronger isolation in both of those since the Linux part is deprivileged plus you can run apps directly on separation kernel.

So, the problem becomes easier once you transition from building on components and architecture not designed for security to those designed to build it ground up. An example of latter is GenodeOS framework. You can have as much or as little complexity in your app deployment as you want.

sargun · on June 2, 2016

I have a few questions if anyone can answer them: 1) how did this deal with networking? In containers, we can delegate them IPs in a fine grained way, and they can even use our own IPs? 2) how did this deal with storage?