LLaVA-PLLuM: a Polish Vision-Language Model

Introduction

Recent advances in multimodal large language models (MLLMs) have shown impressive capabilities in combining text and visual understanding. However, most state-of-the-art solutions are trained primarily on English data, which limits their applicability in other languages and cultural contexts. Our goal is to bridge this gap by creating a Polish multimodal model that not only understands text and images but also reflects Polish linguistic and cultural nuances.

In this blog post, we describe the methodology used to deliver a proof-of-concept for a Polish Large Language Model capable of handling both text and visual data. Our approach builds on the LLaVA-NeXT framework [3], which aligns a pretrained visual encoder with a large language model (LLM) via a lightweight MLP (Multi-Layer Perceptron) projector. We use the following components:

Language Model: PLLuM-12B (Polish Large Language Model) [1] - a Polish-native, instruction-tuned LLM.
Vision Encoder: SigLIP2 So400m/14, 384px [4] - chosen for strong multilingual image-text alignment and improved localization.

We trained our models using automatic translation combined with manual filtering, resulting in approximately 550 thousand samples for pretraining and 2 million samples for instruction fine-tuning. The models accurately describe images, incorporate Polish cultural context, and handle basic visual tasks such as OCR and object counting.

Evaluation on open-source benchmarks and qualitative analysis show notable improvements in Polish language understanding as well as recognizing some of Polish cultural elements, while maintaining general image understanding and reasoning capabilities compared to existing open-source models.

This proof-of-concept marks an initial step toward robust multimodal models for Polish. To accelerate progress and foster collaboration, we are releasing our model weights on Hugging Face.

Methodology

Model Architecture

We build on the LLaVA-NeXT architecture [3] which aligns a pretrained visual encoder with a large language model (LLM) via a lightweight two-layer MLP projector. This design preserves the LLM’s strong language prior while enabling efficient multimodal grounding. Compared to the original LLaVA, LLaVA-NeXT supports higher input resolutions and dynamic tiling, features that have been observed to improve fine-grained perception and OCR performance.

As the language backbone, we use PLLuM-12B-nc-instruct-250715 [1], a Polish-native, instruction-tuned LLM. For the vision tower, we replace the CLIP-like encoder commonly used in LLaVA variants with SigLIP2 So400m/14, 384px [4], selected for its strong multilingual image-text alignment.

Training Procedure

We train the model in two stages following the LLaVA-NeXT procedure:

Stage 1 (Pre-training): Freeze the LLM backbone and vision encoder, optimize only the MLP projector on pretraining dataset to align the connector.
Stage 2 (Instruction Tuning): Joint training of vision tower and projector, with LLM adaptation using LoRA on instruction dataset.

Parameter	Stage 1	Stage 2
Training Samples	558K	2M
Vision Encoder (Trainable)	N/A	400M
Projector (Trainable)	30M	30M
Language Model (Trainable)	N/A	12B
Context Size (Tokens)	8,192	8,192
Batch Size	256	128
Learning Rate (Vision Encoder)	N/A	2x10⁻⁶
Learning Rate (Projector)	1x10⁻³	2x10⁻⁵
Learning Rate (Language Model)	N/A	2x10⁻⁵
LoRA Rank (Language Model)	N/A	128
LoRA Alpha (Language Model)	N/A	256
LoRA Dropout (Language Model)	N/A	0.05
Epochs	1	1

Training Data

As the pretraining dataset, we use the LLaVA-LCS-558K [16] following the LLaVA paper [19]. This dataset is a subset of the LAION/CC/SBU collection, filtered for balanced concept coverage. It consists of 558k image-caption pairs augmented with BLIP synthetic captions, which we translate to Polish to align the visual features with our language model.

Our instruction dataset spans four skill categories:

General: We translate English datasets: ALLaVA [5], LLaVA-Instruct [6], Q-Instruct [7], LVIS-Instruct4V [8], and A-OKVQA [9].
OCR: Synthetic document-style images. We generate a Polish version (SynthDoG-PL) and use the English version (SynthDoG-EN) following the SynthDoG procedure [10].
Knowledge: Based on the WIT dataset [12]. We select samples with human-written Polish and English captions.
Counting: We translate TallyQA [13]

For translation, we use the Tower+ 72B model [14] and the COMET reference-free metric [15] for filtering. The resulting datasets are instantiated mostly in Polish (85%), with a smaller sample in English (15%).

Fine-tuning data distribution

Evaluation & Results

We conduct a two-fold evaluation to assess the performance of our Polish vision-language model: (1) quantitative benchmarking using MMBench v1.1, and (2) a model-as-a-judge study on image captioning quality in Polish.

MMBench v1.1

Due to the absence of established multimodal evaluation benchmarks in Polish, we adopt existing English benchmarks for quantitative assessment. As a primary benchmark, we selected MMBench v1.1 [17], which evaluates multiple dimensions of visual understanding, including object recognition, OCR, commonsense reasoning, and fine-grained perception. Because the official MMBench test split has not been released, we choose to evaluate on the development set.

To enable Polish evaluation, we translated all MMBench v1.1 questions into Polish using Tower+ 72B [14], followed by manual expert correction to ensure linguistic accuracy and eliminate translation artifacts. The resulting MMBench-PL dataset is therefore human-validated and suitable for assessing Polish multimodal reasoning.

The usage of development split makes comparisons strictly fair only against the LLaVA family of models, whose training data and fine-tuning procedures are publicly documented. For other open-source VLMs (e.g., Pixtral, Qwen2.5-VL, PaliGemma), the extent of exposure to MMBench during fine-tuning is unknown. Only PaliGemma partially discloses pre-training information, but not fine-tuning, and therefore direct leaderboard-style comparison should be interpreted with caution.

Model	MMBench (Polish)	MMBench (English)
LLaVA-1.6-Mistral-7B	66.41%	72.37%
LLaVA-1.6-Vicuna-13B	68.29%	74.14%
LLaVA-PLLuM-12b-nc (Ours)	73.89% +5.6%	73.89%
Additional Open-Source Models (different architectures)
PaliGemma2-10B	77.63%	79.59%
Pixtral-12B	79.04%	81.52%
Qwen2.5-VL-7B	74.38%	79.02%

Key Finding: Our model achieves +5.6% improvement on Polish benchmark compared to LLaVA-1.6-Vicuna-13B while maintaining comparable English performance, demonstrating significantly improved recognition of Polish context.

Model-as-a-Judge Evaluation

To evaluate abilities that go beyond multiple-choice recognition and involve open-ended text generation, we conducted a second study based on image captioning. For this purpose, we used the Polish portion of the XM3600 dataset [18]. The task in XM3600 requires models to produce accurate, relevant, and grammatically correct descriptions of images, making it a suitable testbed for generative multimodal performance.

We benchmarked our model against three competitive open-source vision-language models of different architectures: Qwen2.5-VL-7B-Instruct, Pixtral-12B, and PaliGemma-3B, complementing the MMBench evaluation.

Because no Polish human-annotated standard for caption quality currently exists, we adopted an LLM-as-a-judge evaluation strategy using LLaVA-OneVision-72B, the strongest open-source VLM at the time of evaluation and capable of jointly processing the image and candidate captions. We used a pairwise comparison setup in which the judge is presented with an image and two captions and determines which description is better. Since prompt wording and input order can influence the outcome, we employed two prompt formulations—one presenting caption A before B and one reversing the order—and tested each with both model assignments (our model as A and as B). The resulting four judgments for each comparison were then averaged to obtain a stable final score.

Together, these steps provide a controlled and replicable protocol for assessing Polish-language caption quality in the absence of human-annotated ground truth, while capturing the generative multimodal capabilities of the evaluated models.

Comparison	Vision-Language Model Judge Winrate
LLaVA-PLLuM-12b-nc vs PaliGemma-3B	95.2% vs 4.8%
LLaVA-PLLuM-12b-nc vs Qwen2.5-VL-7B	62.7% vs 37.3%
LLaVA-PLLuM-12b-nc vs Pixtral-12B	59.3% vs 40.7%

Key Finding: Across all comparisons, LLaVA-PLLuM is consistently preferred by the judge, indicating higher caption quality in Polish. Our qualitative analysis showed that LLaVA-PLLuM produces more grammatically correct sentences, maintains proper Polish morphology, and avoids inventing non-existent Polish words—a common failure mode observed in baseline models.

Qualitative Results

To examine the models’ ability to grasp and understand Polish cultural context, we collected and annotated a small dataset of pictures.

Summary & Next Steps

We have presented our pipeline for creating: a Polish vision-language model. Crucially, this system was developed with minimal data curation, relying primarily on synthetic and machine-translated datasets, without human correction or manual annotation. Starting from the open-source LLaVA model family and equipping it with the PLLuM language model, we managed to improve the VLM's ability to understand the Polish language as well as aspects of Polish cultural context. We show gains of 5.6 percentage points over LLaVA-based baselines on a manually corrected Polish-language version of MMBench dataset, underscoring the effectiveness of our data-efficient approach.

This is only the first step toward creating a more capable family of Polish vision-language models. We expect that further scaling of data and leveraging more recent vision-language architectures will lead to additional improvements. We also intend to enhance the evaluation protocols by incorporating human assessments and expanding the benchmark datasets to better capture Polish-specific challenges.

References

PLLuM: A Family of Polish Large Language Models - arXiv:2511.03823
PLLuM Model - Hugging Face
LLaVA-NeXT - Blog Post
SigLIP2 - arXiv:2502.14786
ALLaVA - arXiv:2402.11684
Visual Instruction Tuning (LLaVA) - arXiv:2304.08485
Q-Instruct - arXiv:2311.06783
LVIS-Instruct4V - arXiv:2311.07574
A-OKVQA - arXiv:2206.01718
SynthDoG - arXiv:2111.15664
MS COCO - arXiv:1405.0312
WIT Dataset - ACM Digital Library
TallyQA - arXiv:1810.12440
Tower+ Translation Model - Hugging Face
COMET Metric - Documentation
LLaVA-Pretrain Dataset - Hugging Face
MMBench - OpenCompass Leaderboard
Crossmodal-3600: A Massively Multilingual Multimodal Evaluation Dataset - EMNLP 2022
Improved Baselines with Visual Instruction Tuning (LLaVA-1.5) - arXiv:2310.03744

BibTeX


@misc{statkiewicz2025llavapllum,
  title={LLaVA-PLLuM: Building an Open Polish Vision-Language Model},
  author={Statkiewicz, Grzegorz and Dobrzeniecka, Alicja and 
          Krasnodębska, Aleksandra and Cygert, Sebastian and Kusa, Wojciech},
  year={2025},
  note={Blog post}
}

🇵🇱 LLaVA-PLLuM: Building an Open Polish Vision-Language Model

Bridging the gap in multilingual AI with culturally-aware image understanding