AI-Driven Mobile UI Pattern Recognition and Design Topic Mining on RICO: Semantic Clustering and Screenshot-Based Topic Classification
DOI:
https://doi.org/10.69987/JACS.2024.40506Keywords:
mobile UI, design mining, topic modeling, RICO dataset, vision transformerAbstract
Mobile UI ecosystems contain recurring layout patterns, interaction structures, and visual motifs that collectively form “design topics”. This paper presents a data-driven pipeline that mines design topics from the RICO v0.1 semantic-annotation subset and then evaluates screenshot-based topic classification. Using 66,261 RICO screens (PNG screenshots paired with JSON view hierarchies containing semantic fields such as componentLabel, iconClass, text, bounds, and clickable), we extract a compact semantic feature vector per screen and apply MiniBatch K-Means (K=20) to obtain interpretable topic clusters. These clusters serve as pseudo-labels for downstream visual recognition. We compare three lightweight models that predict the mined topics from UI screenshots alone: (i) a small convolutional neural network (CNN), (ii) a compact vision transformer (ViT), and (iii) a lightweight vision–language model (LightVLM) trained with contrastive alignment between screenshots and semantic feature vectors. Experiments use a stratified subset of 4,782 screens (train/val/test = 3,000/594/1,188; 150/30/60 per topic) with deterministic seed 42. On the held-out test set, the ViT achieves the strongest overall performance (Accuracy = 0.345, Macro-F1 = 0.284, Macro-AUC = 0.820), outperforming the CNN (Accuracy = 0.222, Macro-F1 = 0.138, Macro-AUC = 0.764) and LightVLM (Accuracy = 0.243, Macro-F1 = 0.189, Macro-AUC = 0.782). We provide topic distribution analysis, clustering visualizations, confusion matrices, and embedding plots to characterize common failure modes. Finally, a semantic-only prototype baseline (Macro-F1 = 0.605, Macro-AUC = 0.945) clarifies how strongly the mined topics are grounded in view-hierarchy semantics.







