Blog
Nov 19, 2025
Visual Prompt Generation: Cross-Attention in Q-Former
Details the Q-Former architecture: a 12-layer BERT-based model using 32 learnable query embeddings. These queries use cross-attention to extract visual information for MLLM input.
Source: HackerNoon →