We conduct a two-fold evaluation to assess the performance of our Polish vision-language model: (1)
quantitative benchmarking using MMBench v1.1, and (2) a model-as-a-judge study on image captioning quality
in Polish.
Due to the absence of established multimodal evaluation benchmarks in Polish, we adopt existing
English benchmarks for quantitative assessment.
As a primary benchmark, we selected MMBench v1.1 [17], which evaluates multiple
dimensions of visual understanding, including object recognition, OCR, commonsense reasoning, and
fine-grained perception.
Because the official MMBench test split has not been released, we choose to evaluate on the
development set.
To enable Polish evaluation, we translated all MMBench v1.1 questions into Polish using Tower+ 72B [14], followed by manual expert correction to ensure linguistic accuracy and
eliminate translation artifacts. The resulting MMBench-PL dataset is therefore human-validated and
suitable for assessing Polish multimodal reasoning.
The usage of development split makes comparisons strictly fair only against the LLaVA family of
models, whose training data and fine-tuning procedures are publicly documented. For other open-source
VLMs (e.g., Pixtral, Qwen2.5-VL, PaliGemma), the extent of exposure to MMBench during fine-tuning is
unknown.
Only PaliGemma partially discloses pre-training information, but not fine-tuning, and therefore direct
leaderboard-style comparison should be interpreted with caution.
| Model |
MMBench (Polish) |
MMBench (English) |
| LLaVA-1.6-Mistral-7B |
68.18% |
76.54% |
| LLaVA-1.6-Vicuna-13B |
69.80% |
74.39% |
| LLaVA-PLLuM-12b-nc-250715 (Ours) |
76.73% |
75.23% |
| LLaVA-Bielik-11b-v2.6 (Ours) |
78.24% |
77.75% |
| LLaVA-PLLuM-12b-nc (Ours) |
79.35% +9.55% |
78.43% |
|
Additional Open-Source Models (different architectures)
|
| Qwen2.5-VL-7B |
75.56% |
80.62% |
| PaliGemma2-10B |
78.39% |
80.46% |
| Pixtral-12B |
82.06% |
84.31% |
Key Finding: Our best model achieves 79.35% on the Polish MMBench
v1.1 benchmark, representing a +9.55% improvement over LLaVA-1.6-Vicuna-13B (69.80%)
while maintaining strong English performance at 78.43%. This demonstrates improved
recognition of Polish context and linguistic understanding. When compared to other open-source models,
LLaVA-PLLuM shows notably better Polish language understanding, outperforming Qwen2.5-VL-7B (75.56%)
and PaliGemma2-10B (78.39%) on the Polish benchmark.
To evaluate abilities that go beyond multiple-choice recognition and involve open-ended text
generation, we conducted a second study based on image captioning. For this purpose, we used the
Polish portion of the XM3600 dataset [18].
The task in XM3600 requires models to produce accurate, relevant, and grammatically correct
descriptions of images, making it a suitable testbed for generative multimodal performance.
We benchmarked our models against three competitive open-source vision-language models of different
architectures: Qwen2.5-VL-7B-Instruct, Pixtral-12B, and PaliGemma-3B, complementing the MMBench
evaluation. Because no Polish human-annotated standard for caption quality currently exists, we adopted a threefold evaluation strategy:
(1) Open-source LLM and VLM judges, (2) Closed-source VLM judge, and (3) Human evaluation.
Please refer to the full paper for complete details of the evaluation methodology and results.
| Model |
LLaVA-PLLuM-12B-nc-250715 |
LLaVA-PLLuM-12B-nc |
LLaVA-Bielik-11B-v2.6 |
| LLaVA-1.6-Mistral-7B |
84.91% |
85.81% |
82.35% |
| LLaVA-1.6-Vicuna-13B |
63.64% |
66.71% |
60.32% |
| PaliGemma2-10B |
77.47% |
77.53% |
74.10% |
| Pixtral-12B |
43.38% |
48.33% |
40.31% |
| Qwen2.5-VL-7B |
42.69% |
43.15% |
34.76% |
|
Preference rate (%) of our models over baseline judged by LLM (Llama-3.3-70B-Instruct) on XM3600 dataset for linguistic correctness of descriptions.
|