Motivation

Vision-language models describe images in a generic, averaged-out voice. Human visual description is not generic — an introvert notices different things than an extrovert; a sensing type anchors to concrete details while an intuitive type reaches for patterns. Can RLHF-style fine-tuning produce VLMs that adapt description style to personality dimensions?

Approach

Starting from LLaVA-1.6, I construct a preference dataset by:

  1. Generating N candidate image descriptions per input
  2. Scoring them against personality-keyed rubrics (MBTI dimensions: IE, SN, TF, JP)
  3. Forming preference pairs (chosen, rejected) per dimension
  4. Fine-tuning with DPO (Direct Preference Optimization) via TRL

PEFT (LoRA) adapters are trained per personality dimension, enabling runtime mixing via linear adapter combination.

Evaluation

Automatic: BERTScore between generated descriptions and personality-keyed reference descriptions. Human: pairwise preference study on AMT with MBTI-tested annotators. Downstream: VQA accuracy across personality-aligned vs. generic descriptions on COCO-QA.

Status

Dataset construction complete. DPO training done for IE and SN dimensions. TF and JP in progress. Human evaluation planned for Q2 2026.

← Back to Projects