We conduct a two-fold evaluation to assess the performance of our Polish vision-language model: (1)
quantitative benchmarking using MMBench v1.1, and (2) a model-as-a-judge study on image captioning quality
in Polish.
Due to the absence of established multimodal evaluation benchmarks in Polish, we adopt existing
English benchmarks for quantitative assessment.
As a primary benchmark, we selected MMBench v1.1 [17], which evaluates multiple
dimensions of visual understanding, including object recognition, OCR, commonsense reasoning, and
fine-grained perception.
Because the official MMBench test split has not been released, we choose to evaluate on the
development set.
To enable Polish evaluation, we translated all MMBench v1.1 questions into Polish using Tower+ 72B [14], followed by manual expert correction to ensure linguistic accuracy and
eliminate translation artifacts. The resulting MMBench-PL dataset is therefore human-validated and
suitable for assessing Polish multimodal reasoning.
The usage of development split makes comparisons strictly fair only against the LLaVA family of
models, whose training data and fine-tuning procedures are publicly documented. For other open-source
VLMs (e.g., Pixtral, Qwen2.5-VL, PaliGemma), the extent of exposure to MMBench during fine-tuning is
unknown.
Only PaliGemma partially discloses pre-training information, but not fine-tuning, and therefore direct
leaderboard-style comparison should be interpreted with caution.
| Model |
MMBench (Polish) |
MMBench (English) |
| LLaVA-1.6-Mistral-7B |
66.41% |
72.37% |
| LLaVA-1.6-Vicuna-13B |
68.29% |
74.14% |
| LLaVA-PLLuM-12b-nc (Ours) |
73.89% +5.6% |
73.89% |
|
Additional Open-Source Models (different architectures)
|
| PaliGemma2-10B |
77.63% |
79.59% |
| Pixtral-12B |
79.04% |
81.52% |
| Qwen2.5-VL-7B |
74.38% |
79.02% |
Key Finding: Our model achieves +5.6% improvement on Polish
benchmark compared to LLaVA-1.6-Vicuna-13B while maintaining comparable English performance,
demonstrating significantly improved recognition of Polish context.
To evaluate abilities that go beyond multiple-choice recognition and involve open-ended text
generation, we conducted a second study based on image captioning. For this purpose, we used the
Polish portion of the XM3600 dataset [18].
The task in XM3600 requires models to produce accurate, relevant, and grammatically correct
descriptions of images, making it a suitable testbed for generative multimodal performance.
We benchmarked our model against three competitive open-source vision-language models of different
architectures: Qwen2.5-VL-7B-Instruct, Pixtral-12B, and PaliGemma-3B, complementing the MMBench
evaluation.
Because no Polish human-annotated standard for caption quality currently exists, we adopted an
LLM-as-a-judge evaluation strategy using LLaVA-OneVision-72B, the strongest open-source VLM at the
time of evaluation and capable of jointly processing the image and candidate captions.
We used a pairwise comparison setup in which the judge is presented with an image and two captions and
determines which description is better.
Since prompt wording and input order can influence the outcome, we employed two prompt
formulations—one presenting caption A before B and one reversing the order—and tested each with both
model assignments (our model as A and as B).
The resulting four judgments for each comparison were then averaged to obtain a stable final score.
Together, these steps provide a controlled and replicable protocol for assessing Polish-language
caption quality in the absence of human-annotated ground truth, while capturing the generative
multimodal capabilities of the evaluated models.
| Comparison |
Vision-Language Model Judge Winrate |
| LLaVA-PLLuM-12b-nc vs PaliGemma-3B |
95.2% vs 4.8% |
| LLaVA-PLLuM-12b-nc vs Qwen2.5-VL-7B |
62.7% vs 37.3% |
| LLaVA-PLLuM-12b-nc vs Pixtral-12B |
59.3% vs 40.7% |
Key Finding: Across all comparisons, LLaVA-PLLuM is consistently preferred by the judge,
indicating higher caption quality in Polish. Our qualitative analysis showed that LLaVA-PLLuM produces more
grammatically correct sentences, maintains proper Polish morphology, and avoids inventing non-existent
Polish words—a common failure mode observed in baseline models.