Research Library · Text-to-App

app.build - A Production Framework for Scaling Agentic Prompt-to-App Generation with Environment Scaffolding

The paper introduces app.build, an open-source framework that enables reliable large-scale AI-driven application generation through validation pipelines and environment scaffolding.

Read the summary

BFCL V4 - Web Search

A focused evaluation of LLM agents’ web-search abilities using a standardized DuckDuckGo interface, a carefully constructed multihop question set, and an exact-match scoring scheme that isolates the final answer from explanatory text.

Read the summary

DesignBench A Comprehensive Benchmark for MLLM-based Front-end Code Generation

DesignBench is a comprehensive benchmark for evaluating Multimodal Large Language Models (MLLMs) in automated front-end engineering across multiple frameworks and tasks, including design generation, editing, and repair.

Read the summary

EvoAgentX - An Automated Framework for Evolving Agentic Workflows

EvoAgentX is an open-source platform that automatically generates, executes, and evolves multi-agent workflows, integrating three optimization algorithms to refine prompts, tools, and topology and delivering sizable gains across reasoning, coding, math, and real-world agent tasks.

Read the summary

How People Use ChatGPT

A large-scale, privacy-preserving analysis of consumer ChatGPT usage (May 2024–July 2025) showing non-work use now dominates (≈73%) and that value is delivered primarily through decision support and writing assistance.

Read the summary

SpecifyUI - Supporting Iterative UI Design Intent Expression through Structured Specifications and Generative AI

The paper introduces SPEC, a structured intermediate representation for UI design that externalizes designer intent and enables controllable, iterative generation. Built on SPEC, the SpecifyUI system extracts specs from references, supports targeted edits, and renders high fidelity UIs with a multi agent pipeline.

Read the summary

The role of large language models in UI/UX design - A systematic literature review

A systematic review of 38 peer-reviewed studies (2022–2025) mapping how LLMs are integrated across the UI/UX lifecycle, the best practices that enable effective use, and the limitations that still constrain reliability, creativity, and adoption.

Read the summary

UI-E2I-Synth Advancing GUI Grounding with Large-Scale Instruction Synthesis

The paper proposes a method to synthesize large-scale user instructions (explicit + implicit) for GUI screens using GPT-4o and uses it to train vision-language models for GUI instruction grounding, improving performance on challenging benchmarks.

Read the summary

UXAgent An LLM-Agent-Based Usability Testing Framework for Web Design

UXAgent is an LLM-agent-based framework that automates usability testing for web designs, enabling researchers to simulate thousands of user personas and collect multimodal data to iterate on study designs before human-subject studies.

Read the summary

ViMo - A Generative Visual GUI World Model for App Agent

ViMo is a multimodal GUI world model that predicts future mobile app interfaces as images, improving agent planning compared to text-only approaches.

Read the summary

WebGen-Bench Evaluating LLMs on Generating Interactive and Functional Websites from Scratch

The paper introduces WebGen-Bench, a new benchmark to evaluate an LLM-based agent's ability to create functional, multi-file websites from scratch, and WebGen-Instruct, a corresponding training dataset.

Read the summary

Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences

This paper introduces EvalGen, a mixed-initiative system that aligns LLM-assisted evaluations with human preferences by combining automated assertion generation with human grading feedback.

Read the summary

2021 Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, Wojciech Zaremba

Evaluating Large Language Models Trained on Code

This paper introduces Codex, a GPT-based model fine-tuned on public GitHub code, and evaluates its ability to generate Python functions from natural language docstrings. The authors propose a new benchmark (HumanEval) and show Codex significantly outperforms GPT-3 and other models on functional correctness.

Read the summary

SPoC Search-based Pseudocode to Code

The paper introduces SPoC, a framework and dataset for translating human-written pseudocode into functionally correct C++ programs via search guided by compilation-error–driven credit assignment.

Read the summary

Curated Research on Text-to-App