This library curates research on text-to-app (vibe code), evaluation of
multimodal LLMs and related tools, security, and extended workflows such
as MCP servers.
2025
·
Evgenii Kniazev, Arseny Kravchenko, Igor Rekun, James Broadhead, Nikita Shamgunov, Pranav Sah, Pratik Nichite, Ivan Yamshchikov
The paper introduces app.build, an open-source framework that enables reliable large-scale AI-driven application generation through validation pipelines and environment scaffolding.
A focused evaluation of LLM agents’ web-search abilities using a standardized DuckDuckGo interface, a carefully constructed multihop question set, and an exact-match scoring scheme that isolates the final answer from explanatory text.
DesignBench is a comprehensive benchmark for evaluating Multimodal Large Language Models (MLLMs) in automated front-end engineering across multiple frameworks and tasks, including design generation, editing, and repair.
EvoAgentX is an open-source platform that automatically generates, executes, and evolves multi-agent workflows, integrating three optimization algorithms to refine prompts, tools, and topology and delivering sizable gains across reasoning, coding, math, and real-world agent tasks.
A large-scale, privacy-preserving analysis of consumer ChatGPT usage (May 2024–July 2025) showing non-work use now dominates (≈73%) and that value is delivered primarily through decision support and writing assistance.
The paper introduces SPEC, a structured intermediate representation for UI design that externalizes designer intent and enables controllable, iterative generation. Built on SPEC, the SpecifyUI system extracts specs from references, supports targeted edits, and renders high fidelity UIs with a multi agent pipeline.
A systematic review of 38 peer-reviewed studies (2022–2025) mapping how LLMs are integrated across the UI/UX lifecycle, the best practices that enable effective use, and the limitations that still constrain reliability, creativity, and adoption.
The paper proposes a method to synthesize large-scale user instructions (explicit + implicit) for GUI screens using GPT-4o and uses it to train vision-language models for GUI instruction grounding, improving performance on challenging benchmarks.
UXAgent is an LLM-agent-based framework that automates usability testing for web designs, enabling researchers to simulate thousands of user personas and collect multimodal data to iterate on study designs before human-subject studies.
The paper introduces WebGen-Bench, a new benchmark to evaluate an LLM-based agent's ability to create functional, multi-file websites from scratch, and WebGen-Instruct, a corresponding training dataset.
This paper introduces EvalGen, a mixed-initiative system that aligns LLM-assisted evaluations with human preferences by combining automated assertion generation with human grading feedback.
2021
·
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, Wojciech Zaremba
This paper introduces Codex, a GPT-based model fine-tuned on public GitHub code, and evaluates its ability to generate Python functions from natural language docstrings. The authors propose a new benchmark (HumanEval) and show Codex significantly outperforms GPT-3 and other models on functional correctness.
The paper introduces SPoC, a framework and dataset for translating human-written pseudocode into functionally correct C++ programs via search guided by compilation-error–driven credit assignment.