Multimodal UI Representation Learning: Ablation of Screenshot, Wireframe, and View-Hierarchy Proxies on an Uploaded 168-Screen Dataset

Yushan Chen; Evelyn Chan

doi:10.69987/JACS.2023.30101

Authors

Yushan Chen Service Design, Savannah College of Art and Design, GA, USA Author
Evelyn Chan Computer Engineering, Dartmouth College, NH, USA Author

DOI:

https://doi.org/10.69987/JACS.2023.30101

Keywords:

UI representation learning, multimodal embeddings, wireframe, view hierarchy

Abstract

We conduct an empirical ablation study on multimodal user-interface (UI) representation learning by integrating three complementary modalities: screenshot pixels, derived wireframes, and a proxy view-hierarchy structure. Consistent with prior UI datasets such as RICO and Enrico, which combine visual and structural metadata, our evaluation is performed on a dataset of 168 mobile UI screenshots (UI168) containing only raster images. Since no structural annotations are available, we deterministically generate two additional modalities from each screenshot: wireframes extracted using Canny edge detection and a hierarchy proxy constructed from bounding-box containment relations of edge-connected components. We compare unimodal embeddings and four fusion strategies on pseudo-topic classification and retrieval tasks using a fixed data split (seed=42). Pseudo-topics are created through K-means clustering (K=8) on early-fused training embeddings and transferred to validation and test partitions. Lightweight, CPU-reproducible representations are employed, including grayscale tiny-image features with color histograms for screenshots, edge-based descriptors for wireframes, and TF–IDF with truncated SVD for hierarchy tokens. On the 35-image test set, early fusion achieves the strongest retrieval performance (mAP=0.413), outperforming late and attention-based fusion, while unimodal screenshot features remain competitive. Gated fusion optimized on validation data yields moderate improvements with learned weights (0.40, 0.45, 0.15). For pseudo-topic classification, early fusion attains the highest Macro-F1 (0.863). Cross-modal retrieval experiments demonstrate strong screenshot-to-wireframe alignment under CCA and improved screenshot-to-hierarchy mapping using ridge regression. Additional robustness analysis evaluates degradation under occlusion, noise, and blur. All results are empirically derived from the provided dataset using fixed hyperparameters and a fully reproducible pipeline.