AI-Driven Mobile UI Pattern Recognition and Design Topic Mining on RICO: Semantic Clustering and Screenshot-Based Topic Classification

Jason Kuhn; Yushan Chen; Evelyn Chan

doi:10.69987/JACS.2024.40506

Authors

Jason Kuhn Data Science, University of Pittsburgh, PA, USA Author
Yushan Chen Service Design, Savannah College of Art and Design, GA, USA Author
Evelyn Chan Computer Engineering, Dartmouth College, NH, USA Author

DOI:

https://doi.org/10.69987/JACS.2024.40506

Keywords:

mobile UI, design mining, topic modeling, RICO dataset, vision transformer

Abstract

Mobile UI ecosystems contain recurring layout patterns, interaction structures, and visual motifs that collectively form “design topics”. This paper presents a data-driven pipeline that mines design topics from the RICO v0.1 semantic-annotation subset and then evaluates screenshot-based topic classification. Using 66,261 RICO screens (PNG screenshots paired with JSON view hierarchies containing semantic fields such as componentLabel, iconClass, text, bounds, and clickable), we extract a compact semantic feature vector per screen and apply MiniBatch K-Means (K=20) to obtain interpretable topic clusters. These clusters serve as pseudo-labels for downstream visual recognition. We compare three lightweight models that predict the mined topics from UI screenshots alone: (i) a small convolutional neural network (CNN), (ii) a compact vision transformer (ViT), and (iii) a lightweight vision–language model (LightVLM) trained with contrastive alignment between screenshots and semantic feature vectors. Experiments use a stratified subset of 4,782 screens (train/val/test = 3,000/594/1,188; 150/30/60 per topic) with deterministic seed 42. On the held-out test set, the ViT achieves the strongest overall performance (Accuracy = 0.345, Macro-F1 = 0.284, Macro-AUC = 0.820), outperforming the CNN (Accuracy = 0.222, Macro-F1 = 0.138, Macro-AUC = 0.764) and LightVLM (Accuracy = 0.243, Macro-F1 = 0.189, Macro-AUC = 0.782). We provide topic distribution analysis, clustering visualizations, confusion matrices, and embedding plots to characterize common failure modes. Finally, a semantic-only prototype baseline (Macro-F1 = 0.605, Macro-AUC = 0.945) clarifies how strongly the mined topics are grounded in view-hierarchy semantics.