Unsupervised Corpus Mapping

Problem

Large-scale corpora are difficult to explore without labels. Hidden structures in the data remain invisible, which slows down qualitative analysis and makes it harder to design downstream classifiers. For a planned multi-class, multi-label document classifier with 50+ categories, we needed to understand how the corpus self-organizes before committing to annotation and training.

Solution

We applied modern unsupervised learning workflows, combining UMAP/t-SNE visualisation with HDBSCAN clustering and topic modelling (BERTopic). In parallel, we benchmarked classical k-means clustering with the elbow method to estimate the optimal number of clusters.
This allowed us to assess whether documents formed coherent thematic groups, how many clusters might exist, and whether noisy or ambiguous items could be filtered out. The workflow surfaced overlapping clusters, topic hierarchies, and outliers — providing a first map of the corpus.

Outcome

The unsupervised mapping produced actionable insights for the design of the 50+ category multilabel classifier. Clusters revealed recurring argumentation patterns and stylistic differences, while outlier detection flagged noisy documents. These results guided corpus curation, balancing of topic distributions, and the architecture of the downstream supervised model.
Even without labels, unsupervised exploration proved essential for structuring and curating the dataset before scaling to annotation-intensive supervised learning.
HDBSCAN condensed tree
Condensed tree from HDBSCAN showing stable clusters and their relative densities.
Elbow method for k-means
Elbow method applied to k-means clustering to estimate the optimal number of clusters.

← Back to Projects