ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models Paper • 2405.15738 • Published 8 days ago • 41 • 7
iVideoGPT: Interactive VideoGPTs are Scalable World Models Paper • 2405.15223 • Published 9 days ago • 11 • 4
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models Paper • 2405.15574 • Published 8 days ago • 45 • 5
Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation Paper • 2404.19752 • Published Apr 30 • 19 • 3