arXiv 2026

CoViP: Contextualized Visual Personalization
in Vision-Language Models

Enabling VLMs to recognize visual inputs and retrieve associated personal memories from accumulated multimodal dialogue history.

Yeongtak Oh*  ·  Sangwon Yu*  ·  Junsung Park  ·  Han Cheol Moon  ·  Jisoo Mok  ·  Sungroh Yoon

Seoul National University   (*: Equal contribution)

arXiv GitHub Model (HF) Dataset (HF)

VLMs Forget Who You Know

When users show a VLM a new image of someone they have talked about before—a friend, a pet, a familiar place—existing models respond as if it were the first time. They cannot connect the new visual input with the rich multimodal dialogue history that the user has already shared.

Motivation: VLM with and without contextualized visual personalization

Figure 1. A qualitative comparison. Without contextualized visual personalization (Qwen3-VL, 8B), the model treats Martin Freeman as a generic celebrity. CoViP (8B, Ours) correctly recalls Jeffrey's personal details—his nickname, dietary preferences, and past encounter—from the user's accumulated dialogue history.

Overview

Despite recent progress in vision-language models (VLMs), existing approaches often fail to generate personalized responses based on the user's specific experiences, as they lack the ability to associate visual inputs with a user's accumulated visual-textual context. We newly formalize this challenge as contextualized visual personalization, which requires the visual recognition and textual retrieval of personalized visual experiences by VLMs when interpreting new images.

To address this issue, we propose CoViP, a unified framework that treats personalized image captioning as a core task for contextualized visual personalization and improves this capability through reinforcement-learning-based post-training and caption-augmented generation (CAG). We further introduce diagnostic evaluations that explicitly rule out textual shortcut solutions and verify whether VLMs truly leverage visual context.

Extensive experiments demonstrate that existing open-source and proprietary VLMs exhibit substantial limitations, while CoViP not only improves personalized image captioning but also yields holistic gains across downstream personalization tasks.

What is Contextualized Visual Personalization?

Given a user's accumulated multimodal dialogue history (visual-textual context), a VLM must: (1) visually recognize a concept in a new query image that was previously encountered in the context, and (2) internalize its associated personal memories from past dialogues into a grounded, context-aware caption—without relying on textual shortcuts.

👁

Visual Recognition

Identify whether a concept in the query image matches any concept seen in the multimodal dialogue history—even under appearance or viewpoint changes.

🔗

Multimodal Retrieval

Internalize personalized details (nicknames, habits, dates, locations) from past image–dialogue pairs into the generated caption. This goes beyond simple text lookup: the model must bridge visual identity with textual memory, making retrieval inherently multimodal.

🚫

Shortcut-Free Evaluation

Negative concepts are interleaved in the context so models cannot succeed by text matching alone—correct answers require genuine visual grounding to distinguish positive from negative concepts.

Automated Pipeline for Personalized Contexts

Building a benchmark for contextualized visual personalization requires realistic multimodal contexts. We construct a fully automated three-stage pipeline:

I

Query Image Generation

Single-concept real images are combined via a generative VLM to produce multi-concept synthesized images, followed by quality filtering to confirm all concepts are present.

II

Text Dialogue Generation

For each concept image, a VLM generates naturalistic dialogue between a user and the model. Concept names are replaced by random pseudonyms to enforce visual grounding.

III

Context Construction

Positive concept dialogues (relevant) and negative concept dialogues (distractors) are interleaved to form the full context window, with the multi-concept query image as the test input.

Three-stage data construction pipeline for the CoViP benchmark

Figure 2. The three-stage automated pipeline. Query Image Generation synthesizes multi-concept scenes; Text Dialogue Generation produces personalized dialogues per concept; Context Construction assembles interleaved positive/negative concept-dialogue pairs as the VLM's in-context input.

CoViP: RL Training with LLM-as-a-Judge Rewards

CoViP improves personalized image captioning capability through two complementary components: an RL-based post-training scheme with a verifiable reward signal, and Caption-Augmented Generation (CAG) for inference-time grounding.

CoViP RL training pipeline: interleaving image-text in-context demonstrations with LLM-as-a-Judge reward

Figure 3. The CoViP training and evaluation framework. The model receives interleaved image-text in-context demonstrations. Generated captions are evaluated offline by an LLM-as-a-Judge via auto-generated MCQA questions about each concept's personal details. Accuracy on these MCQAs serves as the verifiable reward (VR) for on-policy RL training.

🎯

Verifiable Reward

MCQA questions are generated offline from each concept's dialogue. An LLM-as-a-Judge scores whether the generated caption correctly answers them, producing a clean, binary reward signal.

RL Algorithms

CoViP supports GRPO, DR-GRPO, and GSPO training. GSPO applies importance sampling at the sequence level, proving most effective for this long-context generation task.

📄

Caption-Augmented Generation

At inference time, CAG first generates a personalized caption from context, then appends it alongside the context to ground the final task response.

Quantitative Results

We evaluate VLMs on the CoViP benchmark using an MCQA-based LLM-as-a-Judge protocol. Positive Accuracy measures how well a model recalls details of the target concept; Negative Accuracy evaluates whether the model avoids incorporating irrelevant contextual information.

Table 1 — Comparison between existing personalization baselines and CoViP

Method Post-Training Multi-Concept External VLM Interactive-Dialogues Long-Contexts Generalize Use case Evaluation
MyVLM Cap/ VQAName recall
Yo'LLaVA Cap/ VQAName recall
TAME ✓ (1-turn) VQAVQA accuracy
RAP ✓ (SFT) Cap/ VQAName recall
RePIC ✓ (RL)✓ (Cap) CapName recall
CoViP (Ours) ✓ (RL)✓ (3-turns)✓ (Tasks ≥ 3) Cap/ VQACapEval-QAs

Table 2 — CapEval-QAs performances on our personalized image captioning benchmark.
Acc+: positive concept accuracy ↑, Acc: negative concept accuracy ↑. △ denotes performance gain relative to the base VLM.

Models 1-Concept 2-Concepts 3-Concepts 4-Concepts
Acc+Acc Acc+Acc Acc+Acc Acc+Acc
Proprietary VLMs (Close-sourced)
GPT-4o 34.298.221.698.620.499.315.399.2
GPT-5 48.397.328.297.926.198.718.998.7
Gemini-2.0-Flash 41.996.728.697.326.698.323.198.3
Gemini-3.0-Pro 58.196.645.197.239.098.332.497.9
Open-Sourced VLMs
Qwen3-VL-8B 39.097.525.697.723.398.118.698.1
Qwen3-VL-30B-A3B 40.296.227.597.725.397.720.198.1
Post-Training-based Personalized VLMs
Qwen3-VL-8B + RAP 20.599.010.499.19.999.57.399.2
Qwen3-VL-8B + RePIC 44.097.131.797.029.297.824.097.2
Qwen3-VL-8B + CoViP Ours 77.494.868.494.165.294.859.792.8
△ (Increased) +38.4 +42.8 +41.9 +41.1

Table 3 — Recall score performances on the downstream diagnostic personalization tasks

Models LSD LAR ITR
Directw/ CAG Directw/ CAG Directw/ CAG
Proprietary VLMs (Close-sourced)
GPT-4o 28.733.64.807.408.4013.5
GPT-5 28.534.450.859.318.610.5
Gemini-2.0-Flash 52.746.011.642.366.112.2
Gemini-3.0-Pro 76.289.39.4044.089.419.0
Open-Sourced VLMs
Qwen3-VL-8B 29.848.817.419.69.406.80
Qwen3-VL-30B-A3B 25.642.17.6016.88.800.40
Post-Training-based Personalized VLMs
Qwen3-VL-8B + RAP 27.028.81.400.800.000.20
Qwen3-VL-8B + RePIC 32.752.116.217.827.227.8
Qwen3-VL-8B + CoViP (Ours) 37.258.234.849.228.042.8
△ (Increased) +7.4 +9.4 +17.4 +29.6 +18.6 +36.0
1
Key Finding 1

Existing VLMs lack the ability to generate context-grounded captions.

2
Key Finding 2

CoViP substantially improves the VLM's contextual grounding capability through RL-based post-training.

3
Key Finding 3

Personalized image captioning provides a reliable bridge for downstream personalization by enabling CoViP to effectively leverage CAG.

📈

Large Gains on the Personalized Captioning Benchmark

CoViP (Qwen3-VL-8B) achieves 77.4% Acc⁺ on 1-Concept and 59.7% on 4-Concept—an average gain of +38~42 pts over the base Qwen3-VL-8B. It also substantially outperforms all proprietary VLMs on Acc⁺, including Gemini-3.0-Pro (58.1%), showing that targeted RL post-training is more effective than scale for context-grounded personalized captioning.

Proprietary VLMs Show Unstable CAG Behavior

Despite strong performance on some diagnostic tasks, proprietary VLMs exhibit inconsistent and task-dependent gains from CAG. Their relatively low captioning Acc⁺ limits the effectiveness of CAG, underscoring the need for an explicit post-training stage focused on personalized image captioning prior to downstream inference.

🔗

Consistent Downstream Improvements over Post-Training Baselines

Among post-training methods, CoViP consistently outperforms RAP and RePIC across all three diagnostic tasks. With CAG, gains over RePIC reach +6.1 pts on LSD, +31.4 pts on LAR, and +15.0 pts on ITR—demonstrating that strong personalized captioning directly enables reliable downstream personalization.

Generalization to Real-World Personalization Scenarios

To assess whether CoViP learns genuine visual personalization—not just benchmark overfitting—we introduce three diagnostic downstream tasks that explicitly test memory recall from multimodal context.

LSD — Last Seen Detection LAR — Last Action Recall ITR — Instruction Triggered Recall
Three downstream personalization tasks: LSD, LAR, ITR

Figure 4. Visualization of diagnostic personalization tasks. Each task explicitly precludes shortcut behaviors, requiring the model to ground visual input in user-specific contextual history.

🔍

Last Seen Detection (LSD)

The model must recognize the individual in a query image and identify the most recent encounter with that person from the user's contextual history—requiring temporal reasoning across multiple dialogue entries.

🎬

Last Action Recall (LAR)

The model must identify the most recent encounter and retrieve the fine-grained action described in that interaction—going beyond location recall to episodic memory retrieval.

🔑

Instruction Triggered Recall (ITR)

The context contains a planted instruction (e.g., "recall 'SKS' when you see this person again"). The model must proactively surface this keyword upon visual recognition—without an explicit request in the current turn.

CoViP consistently outperforms post-training baselines (RAP, RePIC) across all three tasks. CAG further amplifies performance by capitalizing on the fine-grained details of the generated personalized caption. Notably, proprietary VLMs show unstable CAG gains due to weaker captioning quality, reinforcing that personalized image captioning is a necessary prerequisite for reliable downstream personalization.

CoViP Enhances Retrieval, Not Just Recognition

A key question is what CoViP actually improves. Figure 5 analyzes the relationship between recognition (measured by F1 score of entity name inclusion in generated captions) and retrieval (measured by Acc⁺). Baseline models already achieve reasonable recognition capability (Avg F1 ≈ 0.810), yet their retrieval accuracy remains low—indicating that recognition alone is insufficient and that retrieval is the primary bottleneck.

CoViP addresses this gap directly. While the average F1 improves modestly (0.810 → 0.897), retrieval accuracy improves at a substantially larger margin for equivalent F1 levels, as shown by the steeper regression slope (m = 0.65 vs. 0.35 for Base and 0.39 for RePIC). This indicates that CoViP's gains are driven primarily by more effective integration of implicit personal cues through contextualized reasoning, rather than by improvements in recognition itself.

Figure 5: Scatter plot of recognition (F1) vs retrieval (Accuracy) for Base, RePIC, and CoViP

Figure 5. Scatter plot of recognition vs. retrieval on the proposed benchmark. Recognition is measured by the F1 score of entity name inclusion between generated captions and ground-truth dialogues; retrieval is measured by positive MCQA accuracy (Acc⁺). m denotes the slope of the linear regression line. CoViP's steeper slope (m=0.65) shows it converts recognition capability into retrieval far more effectively than Base (m=0.35) or RePIC (m=0.39).

BibTeX

If you find CoViP useful in your research, please cite our paper:

@article{oh2026contextualized,
  title     = {Contextualized Visual Personalization in Vision-Language Models},
  author    = {Oh, Yeongtak and Yu, Sangwon and Park, Junsung and
               Moon, Han Cheol and Mok, Jisoo and Yoon, Sungroh},
  journal   = {arXiv preprint arXiv:2602.03454},
  year      = {2026}
}