Accessible UI Semantics: Reproducible Evaluation of Generated Labels for Desktop Interface Components

Dingyuan Zhang

doi:10.69987/AIMLR.2026.70207

Authors

Dingyuan Zhang Business Analytics, University of Rochester, NY, USA Author

DOI:

https://doi.org/10.69987/AIMLR.2026.70207

Keywords:

accessible labels, accessible names, UI semantics, interface components, inclusive design, screen readers, vision-language models, evaluation, human-computer interaction

Abstract

Accessible names are central to screen-reader navigation, speech control, and inclusive interaction. This paper reports a reproducible empirical evaluation of generated accessible labels for desktop interface components using the specified UI Component Semantic Description Dataset. The experimental corpus contains 559 human-labelled UI events, eight component classes, three screen-density levels, and ground-truth semantic descriptions such as “refresh button” and “email inbox searchbar.” The source data package contains 100 screenshots, while the CSV used for quantitative evaluation references 82 unique screenshot filenames with labelled left-click events; the experiments therefore evaluate every record with a human reference label. We compare seven deterministic label generators and one non-deployable label-reuse oracle under screenshot-level 10-fold group validation. The evaluated systems include role-only labelling, class-majority labelling, coordinate heuristics, TF-IDF nearest-neighbor retrieval, logistic-regression closed-label prediction, and a hybrid rule-retrieval variant. The primary metric is intent F1, which excludes generic role words such as button, icon, link, and input so that a system receives credit for predicting the user-facing function rather than merely restating the component type. The best deployable generator was logistic regression, with exact match 0.036, token F1 0.392, intent F1 0.060, and approximate label-in-name pass rate 0.041. The role-only baseline had higher token F1 at 0.481 but intent F1 of only 0.009, demonstrating that surface lexical overlap can overstate accessibility usefulness. The label-reuse oracle reached exact match 0.166 and token F1 0.559, establishing an empirical upper bound for methods restricted to labels observed in training folds. The results show that available class, depth, density, and coordinate metadata are sufficient to preserve component roles but insufficient to recover fine-grained functions. This finding supports the need for screenshot-aware and structure-aware language or vision-language systems, while also providing a transparent baseline package against which such systems can be audited.

Author Biography

Dingyuan Zhang, Business Analytics, University of Rochester, NY, USA

Accessible UI Semantics: Reproducible Evaluation of Generated Labels for Desktop Interface Components

Authors

DOI:

Keywords:

Abstract

Author Biography

Downloads

Published

Issue

Section

License

How to Cite

Share

Final Sidebar