Plain Language Detection (Legal)

Problem

Legal and administrative texts are often written in a way that overwhelms non-expert readers. This lack of clarity creates communication barriers, slows down processes, and erodes trust. Organizations need scalable, reliable tools to automatically identify unclear passages and support plain language rewriting.

Solution

We explored a wide range of machine learning approaches, starting from classical models (TF–IDF + SVM, fastText) through transformer-based architectures (huBERT, XLM-RoBERTa, RoBERTa) to cutting-edge large language models (GPT-4o-mini, Gemini 1.0 Pro). Each step provided insights into how different levels of representation handle the challenge of distinguishing plain vs. complex legal sentences.
To build trust and interpretability, we applied SHAP explainability methods to highlight which features and tokens drive model decisions. This not only improved transparency but also allowed legal experts to better understand and refine the simplification process.

Outcome

The experiments demonstrated that both lightweight models and large-scale transformers can reliably separate plain and complex texts. Transformer-based models achieved the best performance, while classical approaches proved useful for quick and resource-efficient deployment.
The combination of corpus-driven analysis, model training, and explainability provides a practical framework that organizations can adopt to monitor, evaluate, and improve their communication. Results have been published in peer-reviewed journals and showed strong generalization across domains.

Related publications

Embedding map — plain vs. original sentences
How plain and original sentences separate in embedding space.
SHAP explanation of a classification
Different models' performance on predicting Plain Language.

← Back to Projects