Research / 2025

UI-E2I-Synth Advancing GUI Grounding with Large-Scale Instruction Synthesis

Xinyi Liu, Xiaoyi Zhang, Ziyun Zhang, Yan Lu

GUI grounding
vision-language models
instruction synthesis
data augmentation

In brief

The paper proposes a method to synthesize large-scale user instructions (explicit + implicit) for GUI screens using GPT-4o and uses it to train vision-language models for GUI instruction grounding, improving performance on challenging benchmarks.

Executive Summary

The authors focus on the task of GUI instruction grounding: mapping a user’s natural language instruction (e.g. “click the don’t allow checkbox”) to the correct element location in a GUI screenshot. Rather than relying on platform metadata (which can vary and be unreliable across systems), they adopt a vision-based approach.

Key challenges they identify are:

The element-to-screen size ratio (elements may be small relative to the full screen),
Imbalanced frequencies of element types (e.g. “button” vs “checkbox” vs “icon”),
Implicit instructions (instructions that reference semantics or relationships, not directly the visible text).

To address this, they propose UI-E2I-Synth, a pipeline using GPT-4o to synthesize training instructions at large scale, and construct a new benchmark UI-I2E-Bench with richer annotations. Models fine-tuned on this synthesized data outperform previous state-of-the-art in GUI grounding.

Key Technical Advancements

UI-E2I-Synth (Instruction Synthesis Pipeline) The pipeline has three main stages:

Raw Data Collection & Parsing Collect screenshot–metadata pairs across platforms (Web, Windows, Android). Use heuristic parsers to extract element attributes (type, content, bounding boxes) to reduce hallucination risk.
Referring Expression Generation Use GPT-4o to generate both explicit referring expressions (directly referencing visual features) and implicit referring expressions (semantic/relational descriptions). These build a candidate pool of expressions per element.
Instruction Synthesis Simulate user behavior by sampling an action type (e.g. click, input) and generating a full instruction combining the referring expression, action, and contextual content. This produces millions of <screenshot, user instruction, element coordinates> triples.

Together, this enables scaling up training examples beyond manual annotation.

UI-I2E-Bench (New Benchmark for GUI Grounding) To better test model generalization and harder cases, the authors build UI-I2E-Bench, with:

Lower element-to-screen ratio (i.e. elements tend to be smaller)
More balanced coverage of element types
Explicit annotation of instruction implicitness (explicit vs implicit)

Compared to prior benchmarks like ScreenSpot, UI-I2E-Bench is more challenging and representative of real-world GUI scenarios.

Model Training & Evaluation They fine-tune vision-language models (based on InternVL2 and Qwen2-VL) on the synthesized data (≈9.9 M instructions). The fine-tuned models (UI-I2E-VLM) outperform prior methods (SeeClick, OS-Atlas, UGround, ShowUI) across multiple benchmarks (ScreenSpot, ScreenSpot-Pro, and their new benchmark).

Notably, UI-I2E-VLM-7B surpasses OS-Atlas-7B by ~9.7% relative improvement in average accuracy, despite using fewer training instructions. They also show stronger gains on implicit instructions and small elements.

They further test on OSWorld, a live GUI agent benchmark with open-ended tasks, showing the model’s practical grounding utility when integrated with a planner.

Practical Implications and Use Cases

Improved GUI agents Better grounding from natural instructions enables more robust GUI agents that can assist users by automating tasks, navigating app interfaces, or responding to user commands across platforms.

Low-cost scaling of training data By using GPT-4o to generate high-quality instructions, the method reduces dependency on manual annotation, making it easier to scale training sets to more applications or domains.

Insight into benchmarking bias Existing datasets may overestimate performance by oversampling large, obvious elements or explicit instructions. The new benchmark reveals performance gaps that models must close to function in realistic settings.

Cross-platform robustness Because the method works from vision (screenshots) rather than relying on GUI metadata (which is platform-dependent), it is more adaptable to different systems, UI styles, and potentially unseen tools or apps.

Challenges and Limitations

Language coverage The work is limited to English instructions. It does not yet handle multilingual or cross-lingual grounding, which is essential in practice for global GUI agents.

Scale and diversity While the synthesized dataset is large, GUI environments are vast and variable. There may still be unseen UI designs, rare element types, or novel layouts not covered in the synthetic set.

Hallucination / correctness risks Even with heuristic parsing and controlled prompting, using GPT models to generate instructions can introduce erroneous or misleading samples. The quality of synthesized instructions depends heavily on the correctness of underlying element parsing and the GPT prompt design.

Agent security / misuse risk As the authors note, powerful automated GUI agents could be misused (e.g. automating phishing, brute-force attacks) if deployed without safeguards.

Future Outlook and Considerations

Future work could explore:

Extending instruction synthesis to multilingual settings and varied human styles
Incorporating online feedback or human-in-the-loop correction to refine synthesized data
Handling more complex GUI interactions (e.g. drag, gestures, dynamic overlays)
Expanding benchmarks to cover more platforms, UI paradigms, and real-world apps
Embedding safety and validation checks to prevent misuse when deployed in real systems

If integrated into larger agent systems, this approach could accelerate deployment of GUI-interaction agents that understand user instructions more robustly across domains.

Conclusion

This paper makes a strong contribution by shifting the dataset bottleneck in GUI grounding to a scalable instruction synthesis pipeline, and by exposing limitations in existing benchmarks via a more challenging evaluation suite. The empirical results show that models trained on synthesized data can surpass state-of-the-art in both explicit and implicit grounding, particularly in challenging settings (small elements, relational references). This work brings us closer to practical, vision-based GUI agents, though challenges in diversity, language, and safety remain ahead.