Unsupervised Corpus Mapping
Problem
Large-scale corpora are difficult to explore without labels. Hidden structures in the data remain invisible, which slows down qualitative analysis and makes it harder to design downstream classifiers. For a planned multi-class, multi-label document classifier with 50+ categories, we needed to understand how the corpus self-organizes before committing to annotation and training.
Solution
We applied modern unsupervised learning workflows, combining UMAP/t-SNE visualisation with HDBSCAN clustering and topic modelling (BERTopic). In parallel, we benchmarked classical k-means clustering with the elbow method to estimate the optimal number of clusters.
This allowed us to assess whether documents formed coherent thematic groups, how many clusters might exist, and whether noisy or ambiguous items could be filtered out. The workflow surfaced overlapping clusters, topic hierarchies, and outliers — providing a first map of the corpus.
This allowed us to assess whether documents formed coherent thematic groups, how many clusters might exist, and whether noisy or ambiguous items could be filtered out. The workflow surfaced overlapping clusters, topic hierarchies, and outliers — providing a first map of the corpus.
Outcome
The unsupervised mapping produced actionable insights for the design of the 50+ category multilabel classifier. Clusters revealed recurring argumentation patterns and stylistic differences, while outlier detection flagged noisy documents. These results guided corpus curation, balancing of topic distributions, and the architecture of the downstream supervised model.
Even without labels, unsupervised exploration proved essential for structuring and curating the dataset before scaling to annotation-intensive supervised learning.
Even without labels, unsupervised exploration proved essential for structuring and curating the dataset before scaling to annotation-intensive supervised learning.

