Personality-Aligned Vision-Language Model
Motivation
Vision-language models describe images in a generic, averaged-out voice. Human visual description is not generic — an introvert notices different things than an extrovert; a sensing type anchors to concrete details while an intuitive type reaches for patterns. Can RLHF-style fine-tuning produce VLMs that adapt description style to personality dimensions?
Approach
Starting from LLaVA-1.6, I construct a preference dataset by:
- Generating N candidate image descriptions per input
- Scoring them against personality-keyed rubrics (MBTI dimensions: IE, SN, TF, JP)
- Forming preference pairs (chosen, rejected) per dimension
- Fine-tuning with DPO (Direct Preference Optimization) via TRL
PEFT (LoRA) adapters are trained per personality dimension, enabling runtime mixing via linear adapter combination.
Evaluation
Automatic: BERTScore between generated descriptions and personality-keyed reference descriptions. Human: pairwise preference study on AMT with MBTI-tested annotators. Downstream: VQA accuracy across personality-aligned vs. generic descriptions on COCO-QA.
Status
Dataset construction complete. DPO training done for IE and SN dimensions. TF and JP in progress. Human evaluation planned for Q2 2026.