CoViP: Contextualized Visual Personalization in VLMs

Abstract

Overview

Despite recent progress in vision-language models (VLMs), existing approaches often fail to generate personalized responses based on the user's specific experiences, as they lack the ability to associate visual inputs with a user's accumulated visual-textual context. We newly formalize this challenge as contextualized visual personalization, which requires the visual recognition and textual retrieval of personalized visual experiences by VLMs when interpreting new images.

To address this issue, we propose CoViP, a unified framework that treats personalized image captioning as a core task for contextualized visual personalization and improves this capability through reinforcement-learning-based post-training and caption-augmented generation (CAG). We further introduce diagnostic evaluations that explicitly rule out textual shortcut solutions and verify whether VLMs truly leverage visual context.

Extensive experiments demonstrate that existing open-source and proprietary VLMs exhibit substantial limitations, while CoViP not only improves personalized image captioning but also yields holistic gains across downstream personalization tasks.

Problem Formulation

What is Contextualized Visual Personalization?

Given a user's accumulated multimodal dialogue history (visual-textual context), a VLM must: (1) visually recognize a concept in a new query image that was previously encountered in the context, and (2) internalize its associated personal memories from past dialogues into a grounded, context-aware caption—without relying on textual shortcuts.

👁

Visual Recognition

Identify whether a concept in the query image matches any concept seen in the multimodal dialogue history—even under appearance or viewpoint changes.

🔗

Multimodal Retrieval

Internalize personalized details (nicknames, habits, dates, locations) from past image–dialogue pairs into the generated caption. This goes beyond simple text lookup: the model must bridge visual identity with textual memory, making retrieval inherently multimodal.

🚫

Shortcut-Free Evaluation

Negative concepts are interleaved in the context so models cannot succeed by text matching alone—correct answers require genuine visual grounding to distinguish positive from negative concepts.

Benchmark Construction

Automated Pipeline for Personalized Contexts

Building a benchmark for contextualized visual personalization requires realistic multimodal contexts. We construct a fully automated three-stage pipeline:

Query Image Generation

Single-concept real images are combined via a generative VLM to produce multi-concept synthesized images, followed by quality filtering to confirm all concepts are present.

Text Dialogue Generation

For each concept image, a VLM generates naturalistic dialogue between a user and the model. Concept names are replaced by random pseudonyms to enforce visual grounding.

III

Context Construction

Positive concept dialogues (relevant) and negative concept dialogues (distractors) are interleaved to form the full context window, with the multi-concept query image as the test input.

Three-stage data construction pipeline for the CoViP benchmark

Figure 2. The three-stage automated pipeline. Query Image Generation synthesizes multi-concept scenes; Text Dialogue Generation produces personalized dialogues per concept; Context Construction assembles interleaved positive/negative concept-dialogue pairs as the VLM's in-context input.

Method

CoViP: RL Training with LLM-as-a-Judge Rewards

CoViP improves personalized image captioning capability through two complementary components: an RL-based post-training scheme with a verifiable reward signal, and Caption-Augmented Generation (CAG) for inference-time grounding.

CoViP RL training pipeline: interleaving image-text in-context demonstrations with LLM-as-a-Judge reward

Figure 3. The CoViP training and evaluation framework. The model receives interleaved image-text in-context demonstrations. Generated captions are evaluated offline by an LLM-as-a-Judge via auto-generated MCQA questions about each concept's personal details. Accuracy on these MCQAs serves as the verifiable reward (VR) for on-policy RL training.

🎯

Verifiable Reward

MCQA questions are generated offline from each concept's dialogue. An LLM-as-a-Judge scores whether the generated caption correctly answers them, producing a clean, binary reward signal.

⚡

RL Algorithms

CoViP supports GRPO, DR-GRPO, and GSPO training. GSPO applies importance sampling at the sequence level, proving most effective for this long-context generation task.

📄

Caption-Augmented Generation

At inference time, CAG first generates a personalized caption from context, then appends it alongside the context to ground the final task response.

Experiments

Quantitative Results

We evaluate VLMs on the CoViP benchmark using an MCQA-based LLM-as-a-Judge protocol. Positive Accuracy measures how well a model recalls details of the target concept; Negative Accuracy evaluates whether the model avoids incorporating irrelevant contextual information.

Table 1 — Comparison between existing personalization baselines and CoViP

Method	Post-Training	Multi-Concept	External VLM	Interactive-Dialogues	Long-Contexts	Generalize	Use case	Evaluation
MyVLM	✗	✗	✗	✗	✗	✗	Cap/ VQA	Name recall
Yo'LLaVA	✗	✗	✗	✗	✗	✗	Cap/ VQA	Name recall
TAME	✗	✗	✓	✓ (1-turn)	△	✗	VQA	VQA accuracy
RAP	✓ (SFT)	✓	✗	✗	△	✗	Cap/ VQA	Name recall
RePIC	✓ (RL)	✓	✗	✗	△	✓ (Cap)	Cap	Name recall
CoViP (Ours)	✓ (RL)	✓	✗	✓ (3-turns)	✓	✓ (Tasks ≥ 3)	Cap/ VQA	CapEval-QAs

Table 2 — CapEval-QAs performances on our personalized image captioning benchmark.
Acc⁺: positive concept accuracy ↑, Acc⁻: negative concept accuracy ↑. △ denotes performance gain relative to the base VLM.

Models	1-Concept		2-Concepts		3-Concepts		4-Concepts
Models	Acc⁺	Acc⁻	Acc⁺	Acc⁻	Acc⁺	Acc⁻	Acc⁺	Acc⁻
Proprietary VLMs (Close-sourced)
GPT-4o	34.2	98.2	21.6	98.6	20.4	99.3	15.3	99.2
GPT-5	48.3	97.3	28.2	97.9	26.1	98.7	18.9	98.7
Gemini-2.0-Flash	41.9	96.7	28.6	97.3	26.6	98.3	23.1	98.3
Gemini-3.0-Pro	58.1	96.6	45.1	97.2	39.0	98.3	32.4	97.9
Open-Sourced VLMs
Qwen3-VL-8B	39.0	97.5	25.6	97.7	23.3	98.1	18.6	98.1
Qwen3-VL-30B-A3B	40.2	96.2	27.5	97.7	25.3	97.7	20.1	98.1
Post-Training-based Personalized VLMs
Qwen3-VL-8B + RAP	20.5	99.0	10.4	99.1	9.9	99.5	7.3	99.2
Qwen3-VL-8B + RePIC	44.0	97.1	31.7	97.0	29.2	97.8	24.0	97.2
Qwen3-VL-8B + CoViP Ours	77.4	94.8	68.4	94.1	65.2	94.8	59.7	92.8
△ (Increased)	+38.4	—	+42.8	—	+41.9	—	+41.1	—

Table 3 — Recall score performances on the downstream diagnostic personalization tasks

Models	LSD		LAR		ITR
Models	Direct	w/ CAG	Direct	w/ CAG	Direct	w/ CAG
Proprietary VLMs (Close-sourced)
GPT-4o	28.7	33.6	4.80	7.40	8.40	13.5
GPT-5	28.5	34.4	50.8	59.3	18.6	10.5
Gemini-2.0-Flash	52.7	46.0	11.6	42.3	66.1	12.2
Gemini-3.0-Pro	76.2	89.3	9.40	44.0	89.4	19.0
Open-Sourced VLMs
Qwen3-VL-8B	29.8	48.8	17.4	19.6	9.40	6.80
Qwen3-VL-30B-A3B	25.6	42.1	7.60	16.8	8.80	0.40
Post-Training-based Personalized VLMs
Qwen3-VL-8B + RAP	27.0	28.8	1.40	0.80	0.00	0.20
Qwen3-VL-8B + RePIC	32.7	52.1	16.2	17.8	27.2	27.8
Qwen3-VL-8B + CoViP (Ours)	37.2	58.2	34.8	49.2	28.0	42.8
△ (Increased)	+7.4	+9.4	+17.4	+29.6	+18.6	+36.0

Key Findings

Key Finding 1

Existing VLMs lack the ability to generate context-grounded captions.

Key Finding 2

CoViP substantially improves the VLM's contextual grounding capability through RL-based post-training.

Key Finding 3

Personalized image captioning provides a reliable bridge for downstream personalization by enabling CoViP to effectively leverage CAG.

📈

Large Gains on the Personalized Captioning Benchmark

CoViP (Qwen3-VL-8B) achieves 77.4% Acc⁺ on 1-Concept and 59.7% on 4-Concept—an average gain of +38~42 pts over the base Qwen3-VL-8B. It also substantially outperforms all proprietary VLMs on Acc⁺, including Gemini-3.0-Pro (58.1%), showing that targeted RL post-training is more effective than scale for context-grounded personalized captioning.

⚡

Proprietary VLMs Show Unstable CAG Behavior

Despite strong performance on some diagnostic tasks, proprietary VLMs exhibit inconsistent and task-dependent gains from CAG. Their relatively low captioning Acc⁺ limits the effectiveness of CAG, underscoring the need for an explicit post-training stage focused on personalized image captioning prior to downstream inference.

🔗

Consistent Downstream Improvements over Post-Training Baselines

Among post-training methods, CoViP consistently outperforms RAP and RePIC across all three diagnostic tasks. With CAG, gains over RePIC reach +6.1 pts on LSD, +31.4 pts on LAR, and +15.0 pts on ITR—demonstrating that strong personalized captioning directly enables reliable downstream personalization.

Downstream Tasks

Generalization to Real-World Personalization Scenarios

To assess whether CoViP learns genuine visual personalization—not just benchmark overfitting—we introduce three diagnostic downstream tasks that explicitly test memory recall from multimodal context.

LSD — Last Seen Detection LAR — Last Action Recall ITR — Instruction Triggered Recall

Three downstream personalization tasks: LSD, LAR, ITR

Figure 4. Visualization of diagnostic personalization tasks. Each task explicitly precludes shortcut behaviors, requiring the model to ground visual input in user-specific contextual history.

🔍

Last Seen Detection (LSD)

The model must recognize the individual in a query image and identify the most recent encounter with that person from the user's contextual history—requiring temporal reasoning across multiple dialogue entries.

🎬

Last Action Recall (LAR)

The model must identify the most recent encounter and retrieve the fine-grained action described in that interaction—going beyond location recall to episodic memory retrieval.

🔑

Instruction Triggered Recall (ITR)

The context contains a planted instruction (e.g., "recall 'SKS' when you see this person again"). The model must proactively surface this keyword upon visual recognition—without an explicit request in the current turn.

CoViP consistently outperforms post-training baselines (RAP, RePIC) across all three tasks. CAG further amplifies performance by capitalizing on the fine-grained details of the generated personalized caption. Notably, proprietary VLMs show unstable CAG gains due to weaker captioning quality, reinforcing that personalized image captioning is a necessary prerequisite for reliable downstream personalization.

Analysis

CoViP Enhances Retrieval, Not Just Recognition

A key question is what CoViP actually improves. Figure 5 analyzes the relationship between recognition (measured by F1 score of entity name inclusion in generated captions) and retrieval (measured by Acc⁺). Baseline models already achieve reasonable recognition capability (Avg F1 ≈ 0.810), yet their retrieval accuracy remains low—indicating that recognition alone is insufficient and that retrieval is the primary bottleneck.

CoViP addresses this gap directly. While the average F1 improves modestly (0.810 → 0.897), retrieval accuracy improves at a substantially larger margin for equivalent F1 levels, as shown by the steeper regression slope (m = 0.65 vs. 0.35 for Base and 0.39 for RePIC). This indicates that CoViP's gains are driven primarily by more effective integration of implicit personal cues through contextualized reasoning, rather than by improvements in recognition itself.

Figure 5: Scatter plot of recognition (F1) vs retrieval (Accuracy) for Base, RePIC, and CoViP

Figure 5. Scatter plot of recognition vs. retrieval on the proposed benchmark. Recognition is measured by the F1 score of entity name inclusion between generated captions and ground-truth dialogues; retrieval is measured by positive MCQA accuracy (Acc⁺). m denotes the slope of the linear regression line. CoViP's steeper slope (m=0.65) shows it converts recognition capability into retrieval far more effectively than Base (m=0.35) or RePIC (m=0.39).

CoViP: Contextualized Visual Personalization
in Vision-Language Models

VLMs Forget Who You Know

Overview