Enabling VLMs to recognize visual inputs and retrieve associated personal memories from accumulated multimodal dialogue history.
Seoul National University (*: Equal contribution)
When users show a VLM a new image of someone they have talked about before—a friend, a pet, a familiar place—existing models respond as if it were the first time. They cannot connect the new visual input with the rich multimodal dialogue history that the user has already shared.
Figure 1. A qualitative comparison. Without contextualized visual personalization (Qwen3-VL, 8B), the model treats Martin Freeman as a generic celebrity. CoViP (8B, Ours) correctly recalls Jeffrey's personal details—his nickname, dietary preferences, and past encounter—from the user's accumulated dialogue history.
Despite recent progress in vision-language models (VLMs), existing approaches often fail to generate personalized responses based on the user's specific experiences, as they lack the ability to associate visual inputs with a user's accumulated visual-textual context. We newly formalize this challenge as contextualized visual personalization, which requires the visual recognition and textual retrieval of personalized visual experiences by VLMs when interpreting new images.
To address this issue, we propose CoViP, a unified framework that treats personalized image captioning as a core task for contextualized visual personalization and improves this capability through reinforcement-learning-based post-training and caption-augmented generation (CAG). We further introduce diagnostic evaluations that explicitly rule out textual shortcut solutions and verify whether VLMs truly leverage visual context.
Extensive experiments demonstrate that existing open-source and proprietary VLMs exhibit substantial limitations, while CoViP not only improves personalized image captioning but also yields holistic gains across downstream personalization tasks.
Given a user's accumulated multimodal dialogue history (visual-textual context), a VLM must: (1) visually recognize a concept in a new query image that was previously encountered in the context, and (2) internalize its associated personal memories from past dialogues into a grounded, context-aware caption—without relying on textual shortcuts.
Identify whether a concept in the query image matches any concept seen in the multimodal dialogue history—even under appearance or viewpoint changes.
Internalize personalized details (nicknames, habits, dates, locations) from past image–dialogue pairs into the generated caption. This goes beyond simple text lookup: the model must bridge visual identity with textual memory, making retrieval inherently multimodal.
Negative concepts are interleaved in the context so models cannot succeed by text matching alone—correct answers require genuine visual grounding to distinguish positive from negative concepts.
Building a benchmark for contextualized visual personalization requires realistic multimodal contexts. We construct a fully automated three-stage pipeline:
Single-concept real images are combined via a generative VLM to produce multi-concept synthesized images, followed by quality filtering to confirm all concepts are present.
For each concept image, a VLM generates naturalistic dialogue between a user and the model. Concept names are replaced by random pseudonyms to enforce visual grounding.
Positive concept dialogues (relevant) and negative concept dialogues (distractors) are interleaved to form the full context window, with the multi-concept query image as the test input.
Figure 2. The three-stage automated pipeline. Query Image Generation synthesizes multi-concept scenes; Text Dialogue Generation produces personalized dialogues per concept; Context Construction assembles interleaved positive/negative concept-dialogue pairs as the VLM's in-context input.
CoViP improves personalized image captioning capability through two complementary components: an RL-based post-training scheme with a verifiable reward signal, and Caption-Augmented Generation (CAG) for inference-time grounding.
Figure 3. The CoViP training and evaluation framework. The model receives interleaved image-text in-context demonstrations. Generated captions are evaluated offline by an LLM-as-a-Judge via auto-generated MCQA questions about each concept's personal details. Accuracy on these MCQAs serves as the verifiable reward (VR) for on-policy RL training.
MCQA questions are generated offline from each concept's dialogue. An LLM-as-a-Judge scores whether the generated caption correctly answers them, producing a clean, binary reward signal.
CoViP supports GRPO, DR-GRPO, and GSPO training. GSPO applies importance sampling at the sequence level, proving most effective for this long-context generation task.
At inference time, CAG first generates a personalized caption from context, then appends it alongside the context to ground the final task response.
We evaluate VLMs on the CoViP benchmark using an MCQA-based LLM-as-a-Judge protocol. Positive Accuracy measures how well a model recalls details of the target concept; Negative Accuracy evaluates whether the model avoids incorporating irrelevant contextual information.
Table 1 — Comparison between existing personalization baselines and CoViP
| Method | Post-Training | Multi-Concept | External VLM | Interactive-Dialogues | Long-Contexts | Generalize | Use case | Evaluation |
|---|---|---|---|---|---|---|---|---|
| MyVLM | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | Cap/ VQA | Name recall |
| Yo'LLaVA | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | Cap/ VQA | Name recall |
| TAME | ✗ | ✗ | ✓ | ✓ (1-turn) | △ | ✗ | VQA | VQA accuracy |
| RAP | ✓ (SFT) | ✓ | ✗ | ✗ | △ | ✗ | Cap/ VQA | Name recall |
| RePIC | ✓ (RL) | ✓ | ✗ | ✗ | △ | ✓ (Cap) | Cap | Name recall |
| CoViP (Ours) | ✓ (RL) | ✓ | ✗ | ✓ (3-turns) | ✓ | ✓ (Tasks ≥ 3) | Cap/ VQA | CapEval-QAs |
Table 2 — CapEval-QAs performances on our personalized image captioning benchmark.
Acc+: positive concept accuracy ↑, Acc−: negative concept accuracy ↑. △ denotes performance gain relative to the base VLM.
| Models | 1-Concept | 2-Concepts | 3-Concepts | 4-Concepts | ||||
|---|---|---|---|---|---|---|---|---|
| Acc+ | Acc− | Acc+ | Acc− | Acc+ | Acc− | Acc+ | Acc− | |
| Proprietary VLMs (Close-sourced) | ||||||||
| GPT-4o | 34.2 | 98.2 | 21.6 | 98.6 | 20.4 | 99.3 | 15.3 | 99.2 |
| GPT-5 | 48.3 | 97.3 | 28.2 | 97.9 | 26.1 | 98.7 | 18.9 | 98.7 |
| Gemini-2.0-Flash | 41.9 | 96.7 | 28.6 | 97.3 | 26.6 | 98.3 | 23.1 | 98.3 |
| Gemini-3.0-Pro | 58.1 | 96.6 | 45.1 | 97.2 | 39.0 | 98.3 | 32.4 | 97.9 |
| Open-Sourced VLMs | ||||||||
| Qwen3-VL-8B | 39.0 | 97.5 | 25.6 | 97.7 | 23.3 | 98.1 | 18.6 | 98.1 |
| Qwen3-VL-30B-A3B | 40.2 | 96.2 | 27.5 | 97.7 | 25.3 | 97.7 | 20.1 | 98.1 |
| Post-Training-based Personalized VLMs | ||||||||
| Qwen3-VL-8B + RAP | 20.5 | 99.0 | 10.4 | 99.1 | 9.9 | 99.5 | 7.3 | 99.2 |
| Qwen3-VL-8B + RePIC | 44.0 | 97.1 | 31.7 | 97.0 | 29.2 | 97.8 | 24.0 | 97.2 |
| Qwen3-VL-8B + CoViP Ours | 77.4 | 94.8 | 68.4 | 94.1 | 65.2 | 94.8 | 59.7 | 92.8 |
| △ (Increased) | +38.4 | — | +42.8 | — | +41.9 | — | +41.1 | — |
Table 3 — Recall score performances on the downstream diagnostic personalization tasks
| Models | LSD | LAR | ITR | |||
|---|---|---|---|---|---|---|
| Direct | w/ CAG | Direct | w/ CAG | Direct | w/ CAG | |
| Proprietary VLMs (Close-sourced) | ||||||
| GPT-4o | 28.7 | 33.6 | 4.80 | 7.40 | 8.40 | 13.5 |
| GPT-5 | 28.5 | 34.4 | 50.8 | 59.3 | 18.6 | 10.5 |
| Gemini-2.0-Flash | 52.7 | 46.0 | 11.6 | 42.3 | 66.1 | 12.2 |
| Gemini-3.0-Pro | 76.2 | 89.3 | 9.40 | 44.0 | 89.4 | 19.0 |
| Open-Sourced VLMs | ||||||
| Qwen3-VL-8B | 29.8 | 48.8 | 17.4 | 19.6 | 9.40 | 6.80 |
| Qwen3-VL-30B-A3B | 25.6 | 42.1 | 7.60 | 16.8 | 8.80 | 0.40 |
| Post-Training-based Personalized VLMs | ||||||
| Qwen3-VL-8B + RAP | 27.0 | 28.8 | 1.40 | 0.80 | 0.00 | 0.20 |
| Qwen3-VL-8B + RePIC | 32.7 | 52.1 | 16.2 | 17.8 | 27.2 | 27.8 |
| Qwen3-VL-8B + CoViP (Ours) | 37.2 | 58.2 | 34.8 | 49.2 | 28.0 | 42.8 |
| △ (Increased) | +7.4 | +9.4 | +17.4 | +29.6 | +18.6 | +36.0 |
Existing VLMs lack the ability to generate context-grounded captions.
CoViP substantially improves the VLM's contextual grounding capability through RL-based post-training.
Personalized image captioning provides a reliable bridge for downstream personalization by enabling CoViP to effectively leverage CAG.
CoViP (Qwen3-VL-8B) achieves 77.4% Acc⁺ on 1-Concept and 59.7% on 4-Concept—an average gain of +38~42 pts over the base Qwen3-VL-8B. It also substantially outperforms all proprietary VLMs on Acc⁺, including Gemini-3.0-Pro (58.1%), showing that targeted RL post-training is more effective than scale for context-grounded personalized captioning.
Despite strong performance on some diagnostic tasks, proprietary VLMs exhibit inconsistent and task-dependent gains from CAG. Their relatively low captioning Acc⁺ limits the effectiveness of CAG, underscoring the need for an explicit post-training stage focused on personalized image captioning prior to downstream inference.
Among post-training methods, CoViP consistently outperforms RAP and RePIC across all three diagnostic tasks. With CAG, gains over RePIC reach +6.1 pts on LSD, +31.4 pts on LAR, and +15.0 pts on ITR—demonstrating that strong personalized captioning directly enables reliable downstream personalization.
To assess whether CoViP learns genuine visual personalization—not just benchmark overfitting—we introduce three diagnostic downstream tasks that explicitly test memory recall from multimodal context.
Figure 4. Visualization of diagnostic personalization tasks. Each task explicitly precludes shortcut behaviors, requiring the model to ground visual input in user-specific contextual history.
The model must recognize the individual in a query image and identify the most recent encounter with that person from the user's contextual history—requiring temporal reasoning across multiple dialogue entries.
The model must identify the most recent encounter and retrieve the fine-grained action described in that interaction—going beyond location recall to episodic memory retrieval.
The context contains a planted instruction (e.g., "recall 'SKS' when you see this person again"). The model must proactively surface this keyword upon visual recognition—without an explicit request in the current turn.
CoViP consistently outperforms post-training baselines (RAP, RePIC) across all three tasks. CAG further amplifies performance by capitalizing on the fine-grained details of the generated personalized caption. Notably, proprietary VLMs show unstable CAG gains due to weaker captioning quality, reinforcing that personalized image captioning is a necessary prerequisite for reliable downstream personalization.
A key question is what CoViP actually improves. Figure 5 analyzes the relationship between recognition (measured by F1 score of entity name inclusion in generated captions) and retrieval (measured by Acc⁺). Baseline models already achieve reasonable recognition capability (Avg F1 ≈ 0.810), yet their retrieval accuracy remains low—indicating that recognition alone is insufficient and that retrieval is the primary bottleneck.
CoViP addresses this gap directly. While the average F1 improves modestly (0.810 → 0.897), retrieval accuracy improves at a substantially larger margin for equivalent F1 levels, as shown by the steeper regression slope (m = 0.65 vs. 0.35 for Base and 0.39 for RePIC). This indicates that CoViP's gains are driven primarily by more effective integration of implicit personal cues through contextualized reasoning, rather than by improvements in recognition itself.
Figure 5. Scatter plot of recognition vs. retrieval on the proposed benchmark. Recognition is measured by the F1 score of entity name inclusion between generated captions and ground-truth dialogues; retrieval is measured by positive MCQA accuracy (Acc⁺). m denotes the slope of the linear regression line. CoViP's steeper slope (m=0.65) shows it converts recognition capability into retrieval far more effectively than Base (m=0.35) or RePIC (m=0.39).
If you find CoViP useful in your research, please cite our paper:
@article{oh2026contextualized,
title = {Contextualized Visual Personalization in Vision-Language Models},
author = {Oh, Yeongtak and Yu, Sangwon and Park, Junsung and
Moon, Han Cheol and Mok, Jisoo and Yoon, Sungroh},
journal = {arXiv preprint arXiv:2602.03454},
year = {2026}
}