LLaVA-PLLuM: a Polish Vision-Language Model

🔥Update Feb 2026: Added two new models—NASK-PIB/LLaVA-PLLuM-12b-nc-instruct and NASK-PIB/LLaVA-Bielik-11b-v2.6-instruct—and released our Polish translation of the MMBench dataset [HuggingFace]

Introduction

Recent advances in multimodal large language models (MLLMs) have shown impressive capabilities in combining text and visual understanding. However, most state-of-the-art solutions are trained primarily on English data, which limits their applicability in other languages and cultural contexts. Our goal is to bridge this gap by creating a Polish multimodal model that not only understands text and images but also reflects Polish linguistic and cultural nuances.

In this blog post, we describe the methodology used to deliver a proof-of-concept for a Polish Large Language Model capable of handling both text and visual data. Our approach builds on the LLaVA-NeXT framework [3], which aligns a pretrained visual encoder with a large language model (LLM) via a lightweight MLP (Multi-Layer Perceptron) projector. We use the following components:

Language Model: PLLuM-12B (Polish Large Language Model) [1] - a Polish-native, instruction-tuned LLM.
Vision Encoder: SigLIP2 So400m/14, 384px [4] - chosen for strong multilingual image-text alignment and improved localization.

We also train a model based on the Bielik-11B-v2.6 language model [20], as an alternative backbone, to explore the impact of different LLMs on Polish multimodal performance. We train our models using automatic translation combined with manual filtering, resulting in approximately 550 thousand samples for pretraining and 2 million samples for instruction fine-tuning. The models accurately describe images, incorporate Polish cultural context, and handle basic visual tasks such as OCR and object counting.

Evaluation on open-source benchmarks and qualitative analysis show notable improvements in Polish language understanding as well as recognizing some of Polish cultural elements, while maintaining general image understanding and reasoning capabilities compared to existing open-source models.

This proof-of-concept marks an initial step toward robust multimodal models for Polish. To accelerate progress and foster collaboration, we are releasing our model weights on Hugging Face.

Methodology

Model Architecture

We build on the LLaVA-NeXT architecture [3] which aligns a pretrained visual encoder with a large language model (LLM) via a lightweight two-layer MLP projector. This design preserves the LLM’s strong language prior while enabling efficient multimodal grounding. Compared to the original LLaVA, LLaVA-NeXT supports higher input resolutions and dynamic tiling, features that have been observed to improve fine-grained perception and OCR performance.

As the language backbone, we use three different Polish-native, instruction-tuned LLMs: PLLuM-12B-nc-instruct-250715 [1], PLLuM-12B-nc-instruct, and Bielik-11b-v2.6 [20]. For the vision tower, we replace the CLIP-like encoder commonly used in LLaVA variants with SigLIP2 So400m/14, 384px [4], selected for its strong multilingual image-text alignment.

Training Procedure

We train the model in two stages following the LLaVA-NeXT procedure:

Stage 1 (Pre-training): Freeze the LLM backbone and vision encoder, optimize only the MLP projector on pretraining dataset to align the connector.
Stage 2 (Instruction Tuning): Joint training of vision tower and projector, with LLM adaptation using LoRA on instruction dataset.

Parameter	Stage 1	Stage 2
Training Samples	558K	2M
Vision Encoder (Trainable)	N/A	400M
Projector (Trainable)	30M	30M
Language Model (Trainable)	N/A	12B
Context Size (Tokens)	8,192	8,192
Batch Size	256	128
Learning Rate (Vision Encoder)	N/A	2x10⁻⁶
Learning Rate (Projector)	1x10⁻³	2x10⁻⁵
Learning Rate (Language Model)	N/A	2x10⁻⁵
LoRA Rank (Language Model)	N/A	128
LoRA Alpha (Language Model)	N/A	256
LoRA Dropout (Language Model)	N/A	0.05
Epochs	1	1

Training Data

As the pretraining dataset, we use the LLaVA-LCS-558K [16] following the LLaVA paper [19]. This dataset is a subset of the LAION/CC/SBU collection, filtered for balanced concept coverage. It consists of 558k image-caption pairs augmented with BLIP synthetic captions, which we translate to Polish to align the visual features with our language model.

Our instruction dataset spans four skill categories:

General: We translate English datasets: ALLaVA [5], LLaVA-Instruct [6], Q-Instruct [7], LVIS-Instruct4V [8], and A-OKVQA [9].
OCR: Synthetic document-style images. We generate a Polish version (SynthDoG-PL) and use the English version (SynthDoG-EN) following the SynthDoG procedure [10].
Knowledge: Based on the WIT dataset [12]. We select samples with human-written Polish and English captions.
Counting: We translate TallyQA [13]

For translation, we use the Tower+ 72B model [14] and the COMET reference-free metric [15] for filtering. The resulting datasets are instantiated mostly in Polish (85%), with a smaller sample in English (15%).

Fine-tuning data distribution

Evaluation & Results

We conduct a two-fold evaluation to assess the performance of our Polish vision-language model: (1) quantitative benchmarking using MMBench v1.1, and (2) a model-as-a-judge study on image captioning quality in Polish.

MMBench v1.1

Due to the absence of established multimodal evaluation benchmarks in Polish, we adopt existing English benchmarks for quantitative assessment. As a primary benchmark, we selected MMBench v1.1 [17], which evaluates multiple dimensions of visual understanding, including object recognition, OCR, commonsense reasoning, and fine-grained perception. Because the official MMBench test split has not been released, we choose to evaluate on the development set.

To enable Polish evaluation, we translated all MMBench v1.1 questions into Polish using Tower+ 72B [14], followed by manual expert correction to ensure linguistic accuracy and eliminate translation artifacts. The resulting MMBench-PL dataset is therefore human-validated and suitable for assessing Polish multimodal reasoning.

The usage of development split makes comparisons strictly fair only against the LLaVA family of models, whose training data and fine-tuning procedures are publicly documented. For other open-source VLMs (e.g., Pixtral, Qwen2.5-VL, PaliGemma), the extent of exposure to MMBench during fine-tuning is unknown. Only PaliGemma partially discloses pre-training information, but not fine-tuning, and therefore direct leaderboard-style comparison should be interpreted with caution.

Model	MMBench (Polish)	MMBench (English)
LLaVA-1.6-Mistral-7B	68.18%	76.54%
LLaVA-1.6-Vicuna-13B	69.80%	74.39%
LLaVA-PLLuM-12b-nc-250715 (Ours)	76.73%	75.23%
LLaVA-Bielik-11b-v2.6 (Ours)	78.24%	77.75%
LLaVA-PLLuM-12b-nc (Ours)	79.35% +9.55%	78.43%
Additional Open-Source Models (different architectures)
Qwen2.5-VL-7B	75.56%	80.62%
PaliGemma2-10B	78.39%	80.46%
Pixtral-12B	82.06%	84.31%

Key Finding: Our best model achieves 79.35% on the Polish MMBench v1.1 benchmark, representing a +9.55% improvement over LLaVA-1.6-Vicuna-13B (69.80%) while maintaining strong English performance at 78.43%. This demonstrates improved recognition of Polish context and linguistic understanding. When compared to other open-source models, LLaVA-PLLuM shows notably better Polish language understanding, outperforming Qwen2.5-VL-7B (75.56%) and PaliGemma2-10B (78.39%) on the Polish benchmark.

Open-ended Generation Evaluation

To evaluate abilities that go beyond multiple-choice recognition and involve open-ended text generation, we conducted a second study based on image captioning. For this purpose, we used the Polish portion of the XM3600 dataset [18]. The task in XM3600 requires models to produce accurate, relevant, and grammatically correct descriptions of images, making it a suitable testbed for generative multimodal performance. We benchmarked our models against three competitive open-source vision-language models of different architectures: Qwen2.5-VL-7B-Instruct, Pixtral-12B, and PaliGemma-3B, complementing the MMBench evaluation. Because no Polish human-annotated standard for caption quality currently exists, we adopted a threefold evaluation strategy: (1) Open-source LLM and VLM judges, (2) Closed-source VLM judge, and (3) Human evaluation. Please refer to the full paper for complete details of the evaluation methodology and results.

Model	LLaVA-PLLuM-12B-nc-250715	LLaVA-PLLuM-12B-nc	LLaVA-Bielik-11B-v2.6
LLaVA-1.6-Mistral-7B	84.91%	85.81%	82.35%
LLaVA-1.6-Vicuna-13B	63.64%	66.71%	60.32%
PaliGemma2-10B	77.47%	77.53%	74.10%
Pixtral-12B	43.38%	48.33%	40.31%
Qwen2.5-VL-7B	42.69%	43.15%	34.76%
Preference rate (%) of our models over baseline judged by LLM (Llama-3.3-70B-Instruct) on XM3600 dataset for linguistic correctness of descriptions.

Qualitative Results

To examine the models’ ability to grasp and understand Polish cultural context, we collected and annotated a small dataset of pictures.

Summary & Next Steps

We have presented our pipeline for creating: a Polish vision-language model. Crucially, this system was developed with minimal data curation, relying primarily on synthetic and machine-translated datasets, without human correction or manual annotation. Starting from the open-source LLaVA model family and equipping it with the PLLuM language model, we managed to improve the VLM's ability to understand the Polish language as well as aspects of Polish cultural context. We show gains of 9.5 percentage points over LLaVA-based baselines on a manually corrected Polish-language version of MMBench dataset, underscoring the effectiveness of our data-efficient approach.

This is only the first step toward creating a more capable family of Polish vision-language models. We expect that further scaling of data and leveraging more recent vision-language architectures will lead to additional improvements. We also intend to enhance the evaluation protocols by incorporating human assessments and expanding the benchmark datasets to better capture Polish-specific challenges.

References

PLLuM: A Family of Polish Large Language Models - arXiv:2511.03823
PLLuM Model - Hugging Face
LLaVA-NeXT - Blog Post
SigLIP2 - arXiv:2502.14786
ALLaVA - arXiv:2402.11684
Visual Instruction Tuning (LLaVA) - arXiv:2304.08485
Q-Instruct - arXiv:2311.06783
LVIS-Instruct4V - arXiv:2311.07574
A-OKVQA - arXiv:2206.01718
SynthDoG - arXiv:2111.15664
MS COCO - arXiv:1405.0312
WIT Dataset - ACM Digital Library
TallyQA - arXiv:1810.12440
Tower+ Translation Model - Hugging Face
COMET Metric - Documentation
LLaVA-Pretrain Dataset - Hugging Face
MMBench - OpenCompass Leaderboard
Crossmodal-3600: A Massively Multilingual Multimodal Evaluation Dataset - EMNLP 2022
Improved Baselines with Visual Instruction Tuning (LLaVA-1.5) - arXiv:2310.03744
Bielik 11B v2 Technical Report - arXiv:2505.02410

BibTeX


@inproceedings{statkiewicz2026annotation,
  title     = {Annotation-Efficient Vision-Language Model Adaptation to the Polish Language Using the LLaVA Framework},
  author    = {Statkiewicz, Grzegorz and
               Dobrzeniecka, Alicja and
               Seweryn, Karolina and
               Krasnodębska, Aleksandra and
               Piosek, Karolina and
               Bogusz, Katarzyna and
               Cygert, Sebastian and
               Kusa, Wojciech},
  booktitle = {Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop},
  month     = mar,
  year      = {2026},
  address   = {Rabat, Morocco},
  publisher = {Association for Computational Linguistics}
}

🇵🇱 LLaVA-PLLuM: Building an Open Polish Vision-Language Model

Bridging the gap in multilingual AI with culturally-aware image understanding