Merve Noyan PRO

merve

AI & ML interests

VLMs, vision & co

Articles

Organizations

Posts 27

view post
Post
1086
it's raining vision language models β˜”οΈ
CuMo is a new vision language model that has MoE in every step of the VLM (image encoder, MLP and text decoder) and uses Mistral-7B for the decoder part πŸ€“
You can try it yourself here: shi-labs/CuMo-7b-zero

the authors firstly did pre-training of MLP with the by freezing the image encoder and text decoder, then they warmup the whole network by unfreezing and finetuning which they state to stabilize the visual instruction tuning when bringing in the experts. πŸ€“

the mixture of experts MLP blocks above are simply the same MLP blocks initialized from the single MLP that was trained during pre-training and fine-tuned in pre-finetuning.
it works very well (also tested myself) that it outperforms the previous sota of it's size LLaVA NeXt and IDEFICS2-8B in several benchmarks! 😍