This is possibly the next generation of video/audio/image codecs.
What he did was create a specialized compression algorithm that works very well to compress the data that is each frame of Blade Runner, and decompress it (lossily, like mp3) back into a video stream.
To put this into perspective: Blade Runner is 117 minutes long. At 25 frames per second, that is 175_500 frames.
As he says, the input data he used was 256x144 with 3 colour channels, meaning each frame was 110_592 bytes. This results in an input amount of 18_509 MB, uncompressed.
His neural network compresses each image down to 200 floats though, i.e. 800 bytes. So the whole movie as compressed by the NN is 132 MB.
A friend of mine who works with neural networks estimated his NN to be roughly 90MB at 256x144, so the storage needed for the movie he created is about 222 MB.
This means that if this technology can be made fast enough, and the reproduction high enough in fidelity, we could be looking at replacing compressors (Fraunhofer MP3, Lame, MP4, DIVX, JPEG) made to handle all types of input equally well; with compressors that can are so good for one specific set of data that compressor + data is smaller than anything the traditional compressors could create.
And even if its not fast enough, it can still be a very efficient compressor, trading size for CPU.
Plus, lastly, the NN itself can be compressed as well, and classic compressors can possibly be applied to the frame data.
"with compressors that can are so good for one specific set of data that compressor + data is smaller than anything the traditional compressors could create."
This is well, hopeful but probably wrong ... because
we already know how to make it smaller for this type of data.
The algorithms used for interframe/intraframe prediction are chosen to tradeoff speed vs size. If you built a really large-scale predictor that was able to generate very small representations for changes, you would get what he's built.
(Note that encoders already complexly select from tons of different prediction algorithms for each set of frames, etc)
We can already do this if someone wanted to.
We just don't.
Because it's not fast enough (and nothing in that work changes this)
Would it be useful to apply NN's to video encoders to better select among prediction modes, etc. Probably. But that's already being done, and is not this guy's work.
"And even if its not fast enough, it can still be a very efficient compressor, trading size for CPU.
"
The problem you have is not just compression time.
It's decompression time.
The bitstream of H265/etc is meant to be decodable fast.
What this guy is building is not.
If you were to make it so, it would probably look closer to a normal video codec bitstream, and take up that much space.
In fact, he hasn't built anything truly new, he's just using existing papers and making an implementation. He also says, in his masters thesis, that is primarily an artistic exploration.
Even with hardware decoding, you can only make stuff so fast.
TL;DR While interesting, there are people working on the things you are talking about, and it's not this guy (at least in this work).
I would not expect magic here. We already create video codec algorithms by trading off cpu cost and size. The trick is trying to get better size without increasing CPU cost significantly.
As these resources change (and remember, moore's law is pretty much dead), the video codecs will change, and videos will get smaller, but you aren't likely to see serious breakthroughs.
We already could produce very small videos by applying tremendous amounts of CPU power.
It wouldn't be decoded on the CPU, but on the GPU. Or even specialized hardware. As convnets are being used more in image processing, that isn't too unrealistic. There is interest in making specialized consumer hardware for convnets.
And it doesn't need to work on every frame, it could pump out I-frames every 15 seconds or so.
Except that in my understand, the decoded version looks like shit. So, while optimistic that this technology might be quite good eventually, more heuristics are probably needed for the right way to extract an optimal encoding.
As part of a compression suite, that may not matter all that much. I can think of a hybridized version of this that starts with this ANN and then applies 'fixes' by a traditional video codec stream. If the ANN can represent most of the bits of the frame pretty well, the ANN + Codec data can match a traditional codec-only approach with substantially less data.
The thing is that diff between the film and "encoded" version look pretty fuzzy/muddy and so compressing that diff would have to be difficult.
Anyway, the article doesn't mention this idea, the first hn discussion had someone claim some other ridiculous use of the fuzzy/muddy images that are supposed to mean something and ultimately I'd say it's an example of "my research project has cool images, I can use them for publicity".
Right, but my point is that your numbers don't indicate anything about the actual potential as a compression format. It is certainly possible to use it that way, but there is still much that is uncertain about whether a neural network can do better than hand-crafted compression formats. So it's a little early to call it "next generation." It is literally meaningless to point out that "his neural network compresses each image down to 200 floats", since it doesn't actually work well with such a small latent space.
It's not exactly unknown that autoencoders can reconstruct images, so I don't see why it's "impressive that it works at all."
> So it's a little early to call it "next generation."
Please don't ignore words other people write, or at least reread before hitting post. I did say possibly.
> I don't see why it's "impressive that it works at all."
I thought it was implicit from my explanation that the "works at all" includes the fact that the data sizes are reasonable already. If we had the same result, but the intermediary data was 18 gigabytes, then there would be nothing impressive about it. As it is, we're a lot below that, before further compression, so it is.
Remember: Prototype, Proof of Concept. You're looking at the first step, not the last step. You're looking at a motor carriage, not a Tesla.
> I thought it was implicit from my explanation that the "works at all" includes the fact that the data sizes are reasonable already.
Right, but why is that impressive if it doesn't actually result in a good reconstruction? I can take any collection of numbers and summarize it with the mean value, that doesn't imply that averaging is a good compression method.
If you consider this proof of concept, then what, exactly, concept does it prove? That statistics can represent a dataset?
You would store the diff between the reconstructed version and the original. The diff could be compressed to much less space than the original image, because the NN can predict most of the pixels. And get within a few bits of the rest.
It sounds to me, almost superficially, that this technique might benefit from work done in compressive sensing. The mechanism of encoding and decoding seem very related.
What he did was create a specialized compression algorithm that works very well to compress the data that is each frame of Blade Runner, and decompress it (lossily, like mp3) back into a video stream.
To put this into perspective: Blade Runner is 117 minutes long. At 25 frames per second, that is 175_500 frames.
As he says, the input data he used was 256x144 with 3 colour channels, meaning each frame was 110_592 bytes. This results in an input amount of 18_509 MB, uncompressed.
His neural network compresses each image down to 200 floats though, i.e. 800 bytes. So the whole movie as compressed by the NN is 132 MB.
A friend of mine who works with neural networks estimated his NN to be roughly 90MB at 256x144, so the storage needed for the movie he created is about 222 MB.
This means that if this technology can be made fast enough, and the reproduction high enough in fidelity, we could be looking at replacing compressors (Fraunhofer MP3, Lame, MP4, DIVX, JPEG) made to handle all types of input equally well; with compressors that can are so good for one specific set of data that compressor + data is smaller than anything the traditional compressors could create.
And even if its not fast enough, it can still be a very efficient compressor, trading size for CPU.
Plus, lastly, the NN itself can be compressed as well, and classic compressors can possibly be applied to the frame data.