News
Future MLLMs: Contribution of MIL-Based Techniques and Enriched Visual Signals
This paper concludes that MIVPG is a general, powerful component for fusing enriched visual representations in MLLMs.
MIVPG on E-commerce: Multi-Image/Multi-Patch Aggregation for Captioning
MIVPG uses hierarchical MIL to outperform patch concatenation and single-image baselines, proving CSA is key for correlation.
Gigapixel Pathology: MIVPG Outperforms Baselines in Medical Captioning
MIVPG significantly outperforms baselines by using instance correlation and shows strong domain adaptation over epochs.
Evaluating Visual Adapters: MIVPG Performance on Single and Multi-Image Inputs
Details MIVPG experiments across single- and multi-image scenarios. Model uses frozen LLM and Visual Encoder, updating only the MI...
MIVPG and Instance Correlation: Enhanced Multi-Instance Learning
MIVPG uses a Correlated Self-Attention (CSA) module to unveil instance correlation, fulfilling all MIL properties while outperform...
Multimodal Fusion: MIVPG's Hierarchical MIL Approach for Multi-Image Samples
Details MIVPG's hierarchical approach to MIL for multi-image samples. It treats both image patches and whole images as 'instances'...
