Synthetic Data Augmentation

Problem

Many NLP tasks suffer from unbalanced or limited training data. Minority classes are often underrepresented, making it difficult for models to generalize reliably. Manual expansion of datasets is costly and time-consuming, creating demand for scalable augmentation strategies.

Solution

We explored a spectrum of augmentation techniques. On the lightweight side, we applied Easy Data Augmentation (EDA) methods such as synonym replacement, random insertion, swap, and deletion to create balanced datasets. On the more advanced side, we generated synthetic examples with LLMs, controlling prompts to produce diverse but linguistically consistent samples. Both approaches were tested in sentiment and emotion classification pipelines, supported by preprocessing, deduplication, and linguistic sanity checks, complemented by comprehensive automated quality control based primarily on linguistic metrics.

Outcome

Augmentation consistently improved classification scores for underrepresented classes. EDA techniques yielded up to double-digit gains in F1 scores in political sentiment tasks, while LLM-generated examples provided even richer coverage of edge cases and nuanced expressions.
Importantly, we systematically measured the real effect of synthetic data, distinguishing between performance gains due to better balance within the training set and genuine improvements in model generalization. This analysis ensured that improvements reflected true robustness, not just artificial inflation.
The combined workflow offers a practical way to close data gaps, reduce annotation needs, and improve robustness across domains. Results were presented at conferences and published in peer-reviewed venues.

Related publications

EDA impact on F1 scores
Synthetic data often inflates apparent F1 improvements (blue), while the true generalization gains (orange) remain smaller but consistent across emotion categories. The green line shows the ratio of synthetic to original data.
LLM augmentation workflow
Quality check for LLM-generated synthetic data: POS trigram distributions show overall similarity to the original corpus, but with systematic shifts that highlight the need for careful evaluation.

← Back to Projects