Research / 2025

DesignBench A Comprehensive Benchmark for MLLM-based Front-end Code Generation

Jingyu Xiao, Ming Wang, Man Ho Lam, Yuxuan Wan, Junliang Liu, Yintong Huo, Michael R. Lyu

LLMs
Code Generation
Front-end Development
Benchmarking
Design & UX

In brief

DesignBench is a comprehensive benchmark for evaluating Multimodal Large Language Models (MLLMs) in automated front-end engineering across multiple frameworks and tasks, including design generation, editing, and repair.

Executive Summary

This white paper introduces DesignBench, a novel and comprehensive benchmark designed to assess the capabilities of Multimodal Large Language Models (MLLMs) in automated front-end engineering. It addresses critical gaps in existing evaluations by incorporating modern front-end frameworks (React, Vue, Angular) and evaluating three essential real-world tasks: design generation, editing, and repair. For engineers, designers, and product managers, DesignBench provides crucial insights into the current limitations and potential of MLLMs to transform the web development workflow, highlighting areas where these models excel and where significant challenges remain in achieving robust, framework-integrated, and human-quality code generation.

Key Technical Advancements

Multi-Framework Evaluation: DesignBench is the first benchmark to integrate popular front-end frameworks like React, Vue, and Angular, in addition to vanilla HTML/CSS. This allows for a more realistic assessment of MLLMs’ ability to generate framework-specific syntax and leverage component-based paradigms.

Comprehensive Task Coverage: Beyond initial UI code generation, DesignBench evaluates MLLMs on two critical, iterative tasks in real-world development: design editing (modifying existing code based on instructions) and design repair (fixing UI display issues). This provides a holistic view of MLLM utility across the entire front-end development lifecycle.

Multi-Dimensional Analysis: The benchmark enables detailed analysis across various dimensions, including task difficulty (visual complexity, instruction complexity, issue severity), input context modalities (code-only, image-only, multimodal), and code metrics (compilation success, code modification similarity, MLLM-as-judge scores).

Real-world Data Curation: DesignBench utilizes 900 real-world webpage samples across 11 topics, 9 edit types, and 6 issue categories, meticulously curated from GitHub projects, top-visited websites, and platforms like Vercel’s V0. This ensures high fidelity to practical development scenarios.

Practical Implications & Use Cases

Technical impact: DesignBench highlights that MLLMs currently struggle with framework-specific syntax (JSX parsing, template syntax, TypeScript architecture) and component-based implementations, leading to compilation errors and diminished code reusability. This implies that while MLLMs can accelerate initial design-to-code, significant post-generation manual refinement is still required, particularly for framework-based projects. Future tool integration should focus on providing code edit location information and clear repair issue statements to aid MLLMs.

UX/UI implications: MLLMs demonstrate limitations in visual rendering accuracy and UI issue identification (e.g., alignment, crowding, occlusion). This means designers and product managers cannot fully trust MLLM-generated UIs without thorough visual inspection and debugging. Prototyping and design systems built on MLLMs will need robust human-in-the-loop validation and explicit error reporting mechanisms.

Strategic implications: The benchmark’s findings suggest that while MLLMs can boost productivity for vanilla HTML/CSS, their integration into complex framework-based workflows is still nascent. Organizations planning to leverage MLLMs for front-end development should manage expectations, focusing on MLLMs as powerful assistants rather than fully autonomous code generators, especially for editing and repairing tasks. Investments should prioritize MLLMs that show better understanding of context and component architecture.

Challenges and Limitations

Framework-Specific Syntax Challenges: MLLMs exhibit substantially lower performance in framework-based development (React, Vue, Angular) compared to vanilla HTML/CSS, struggling with framework-specific syntax and failing to leverage framework features like component-based paradigms.

Code Localization Deficiencies: In design edit and repair tasks, MLLMs face significant bottlenecks in accurately localizing the specific code segments that require modification, even when the generated code successfully compiles.

Visual Understanding Limitations: MLLMs show moderate degradation in performance when confronted with large UI images or complex UI issues, indicating a need for improved visual analysis capabilities for tasks like generation, edit, and repair. The combination of image and code inputs does not always yield significant improvements, suggesting MLLMs underutilize visual information.

Poor UI Issue Identification: MLLMs consistently perform poorly in identifying UI design issues like occlusion, crowding, and alignment, with an average accuracy of only 27.14%.

Future Outlook & Considerations

Based on DesignBench’s findings, future research should focus on enhancing MLLM training datasets with modern web development patterns and framework-specific best practices. Improvements in multimodal information fusion are crucial, particularly developing more effective visual-code alignment and specialized attention mechanisms. For adoption, development teams should consider providing explicit code edit location information and clearly stating repair issues to MLLMs to significantly enhance performance. Additionally, breaking down complex instructions and large UI designs into simpler, atomic tasks can improve practicality. The long-term impact will likely involve MLLMs becoming sophisticated co-pilots that assist developers across the entire front-end workflow, requiring tools that intelligently merge visual and code context to overcome current limitations.

Conclusion

DesignBench provides an unprecedented, multi-faceted evaluation of MLLM capabilities in front-end code generation, edit, and repair. While MLLMs demonstrate promising potential, particularly with vanilla HTML/CSS and larger models, significant challenges remain in framework integration, code localization, and visual issue identification. For engineers, designers, and product managers, this means current text-to-app generation tools are powerful aids but not replacements for human expertise, especially in complex, framework-dependent projects. The benchmark offers clear guidance for future research and development to build more robust and intelligent MLLMs that can truly revolutionize the front-end development process.