Applied Multilingual NLP

Problem

Real-world applications rarely operate in one language. Building and maintaining separate models per locale is expensive and brittle, especially for low-resource languages and domain shifts. We needed a scalable approach that reuses supervision across languages and keeps quality stable in production.

Solution

We designed multilingual pipelines around transformer encoders (e.g., XLM-R / mBERT) with domain adaptation and lightweight parameter-efficient tuning (adapters/LoRA). Data efficiency came from back-translation and synthetic augmentation, plus language ID (LID) routing and locale-aware normalisation (tokenisation, diacritics/casing, number/date formats).
As one key source we integrated ParlaMint (comparable parliamentary corpora across many European languages), enabling cross-country and cross-lingual evaluation. We targeted document-level tasks (sentiment/emotion, topic aspects) with multi-class and multi-label settings, supported by slice-based evaluation per language and domain.

Outcome

The approach reduced per-language annotation needs while improving cross-lingual generalisation. In typical tasks the pipelines achieved robust macro-F1 across languages and scaled to new locales with minimal additional training. Cross-lingual evaluation showed that macro-precision, recall, and F1 remained consistently high across multiple languages (e.g. Slovak, Czech, French, Hungarian, English, Polish, German), confirming the robustness of the multilingual pipeline even in low-resource scenarios. Clear model cards, per-language thresholds and monitoring supported reliable deployment in production-like settings.

Related publications

Multilingual pipeline — *Macro Precision, Recall, and F1-score across languages, showing consistently high performance with minimal variation.*

Cross-lingual transfer results — *Effect of using a pivot language (English) versus direct translation on multilingual model performance.*

← Back to Projects