News
Nov 19, 2025
Visual Prompt Generation: Cross-Attention in Q-Former
Details the Q-Former architecture: a 12-layer BERT-based model using 32 learnable query embeddings. These queries use cross-attent...
Nov 18, 2025
MIVPG on E-commerce: Multi-Image/Multi-Patch Aggregation for Captioning
MIVPG uses hierarchical MIL to outperform patch concatenation and single-image baselines, proving CSA is key for correlation.
Nov 15, 2025
Evaluating Visual Adapters: MIVPG Performance on Single and Multi-Image Inputs
Details MIVPG experiments across single- and multi-image scenarios. Model uses frozen LLM and Visual Encoder, updating only the MI...
