Hi HN, I’m one of the two authors of the post and the Linum v2 text-to-video model (https://news.ycombinator.com/item?id=46721488). We're releasing our Image-Video VAE (open weights) and a deep dive on how we built it. Happy to answer questions about the work!
its cool to see the iterative improvements to your model laid out, but for everything that workedm i imagine there were at least a million other things you also tried but didnt work out. whats your process of trying these different techniques/architectures? do you just wait for one experiment to finish and visually inspect the results everytime. seems hard since these take a while to train. how do you shorten the feedback loop in this space?
honestly, it's really hard to shorten the feedback loop in this space. For this, we really just did run one experiment at a time and visually inspect the results everywhere. when you're going 0 -> 1, you're looking for "signs of life" to make sure the basic thing is working. when it comes to testing which (of the infinite levers) to the pull, a lot of it comes from intuition (which i know isn't the most fun answer). we spent a week or so just running experiments on the amount of compression we could squeeze out the VAE without significant degradation in the final results). In hindsight, spending a week on that seems like a waste, since we got the 8x spatial, 4x compression within the first 1-2 days. But in the moment, you're often unsure WHAT will be the key unlock. So, when you're in the middle of storm you're running a quick bayesian process in your head, measuring what you might learn from the outcome of the experiment vs. the time/money it would take to run the experiment. And you, hope that your intuitions become stronger over time, as you take more repetitions. More money, might help the problem (e.g. parallel experiments, more detailed explorations). But, I don't think money is a cure-all. At some point, you get lost in the sauce trying to tie the threads between all the empirical findings you have at your finger tips. Maybe one day AI models could help here integrating these all results. As it stands, they still struggle to reason about this stuff, in context of other research papers and findings (likely because all the context on arxiv is so noisy; you can't trust any particular finding and verifying findings is so hard to do, that it's hard to meta-reason about your experiments correctly).
This seems like a great model to experiment fine tuning with original art, given it’s relatively small and with open license. Is that a fair assessment?
Thanks for the great write up and making it available to us all.
Hadn’t seen that before! Seems very in line with what with the broader points about regularization. In table 4 they show faster convergence in 200 epochs when used alongside REPA. I’d be curious to see if it ended up beating REPA by itself with full 800 epochs of training — or if something about this new latent space, leads to plateauing itself (learns faster but caps out on expressivity). We’ve seen that phenomena before in other situations (eg UNET learns faster than DiT because of convolutions, but stops learning beyond a certain point).
The kind that I like so much on HN. It tickle your mind but is still clear enough for an advanced beginner.
Thanks for the great write up and making it available to us all.