Text-to-App

Research Library

Curated Research on Text-to-App

This library curates research on text-to-app (vibe code), evaluation of multimodal LLMs and related tools, security, and extended workflows such as MCP servers.

2025 Huanzhi Mao, Raymond Tsao, Jingzhuo Zhou, Shishir G. Patil, Joseph E. Gonzalez

BFCL V4 - Web Search

A focused evaluation of LLM agents’ web-search abilities using a standardized DuckDuckGo interface, a carefully constructed multihop question set, and an exact-match scoring scheme that isolates the final answer from explanatory text.

Read the summary

2025 Yingxu Wang, Siwei Liu, Jinyuan Fang, Zaiqiao Meng

EvoAgentX - An Automated Framework for Evolving Agentic Workflows

EvoAgentX is an open-source platform that automatically generates, executes, and evolves multi-agent workflows, integrating three optimization algorithms to refine prompts, tools, and topology and delivering sizable gains across reasoning, coding, math, and real-world agent tasks.

Read the summary

2025 Aaron Chatterji; Tom Cunningham; David Deming; Zoë Hitzig; Christopher Ong; Carl Shan; Kevin Wadman

How People Use ChatGPT

A large-scale, privacy-preserving analysis of consumer ChatGPT usage (May 2024–July 2025) showing non-work use now dominates (≈73%) and that value is delivered primarily through decision support and writing assistance.

Read the summary

2025 Yuxuan Lu, Bingsheng Yao, Hansu Gu, Jing Huang, Zheshen (Jessie) Wang, Yang Li, Jiri Gesi, Qi He, Toby Jia-Jun Li, Dakuo Wang

UXAgent An LLM-Agent-Based Usability Testing Framework for Web Design

UXAgent is an LLM-agent-based framework that automates usability testing for web designs, enabling researchers to simulate thousands of user personas and collect multimodal data to iterate on study designs before human-subject studies.

Read the summary

2021 Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, Wojciech Zaremba

Evaluating Large Language Models Trained on Code

This paper introduces Codex, a GPT-based model fine-tuned on public GitHub code, and evaluates its ability to generate Python functions from natural language docstrings. The authors propose a new benchmark (HumanEval) and show Codex significantly outperforms GPT-3 and other models on functional correctness.

Read the summary

2019 Sumith Kulal*, Panupong Pasupat*, Kartik Chandra, Mina Lee, Oded Padon, Alex Aiken, Percy Liang

SPoC Search-based Pseudocode to Code

The paper introduces SPoC, a framework and dataset for translating human-written pseudocode into functionally correct C++ programs via search guided by compilation-error–driven credit assignment.

Read the summary