Research / 2021

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, Wojciech Zaremba

Code Generation
Program Synthesis
Software Engineering
Product Management
Functional Correctness
LLMs

In brief

This paper introduces Codex, a GPT-based model fine-tuned on public GitHub code, and evaluates its ability to generate Python functions from natural language docstrings. The authors propose a new benchmark (HumanEval) and show Codex significantly outperforms GPT-3 and other models on functional correctness.

Executive Summary

This white paper presents Codex, a large language model fine-tuned on code, and evaluates its ability to generate functionally correct Python programs from natural language prompts. To benchmark this, the authors release HumanEval, a dataset of 164 hand-written programming tasks with unit tests. The paper highlights both the strengths of Codex in solving real-world programming problems and its limitations, along with broader implications for engineers, designers, and product managers.

Key Technical Advancements

Codex fine-tuned on GitHub code: Fine-tuning GPT models on large-scale code data (159 GB) yields significant improvements in program synthesis compared to general-purpose LLMs like GPT-3. HumanEval benchmark: Introduction of a new dataset and evaluation framework based on unit-test functional correctness instead of heuristic metrics like BLEU. Sampling strategy for correctness: Demonstration that repeated sampling and selection heuristics (e.g., highest mean log-probability) substantially increase success rates, solving up to 77.5% of problems with 100 samples. Supervised fine-tuning (Codex-S): Training on curated standalone functions improves performance, making models more parameter-efficient and robust.

Practical Implications & Use Cases

Technical impact: Codex can accelerate coding workflows, aid in code completion, and reduce time spent on boilerplate. However, human oversight remains critical due to potential errors and misalignment. UX/UI implications: Enables natural language–driven code generation, lowering barriers for prototyping and experimentation. Designers may need to adapt workflows to integrate generated code responsibly. Strategic implications: Adoption of Codex-like models could change software development economics, affecting labor markets, productivity, and differentiation. Companies integrating such tools must weigh productivity gains against risks of over-reliance, bias, and insecure code.

Challenges and Limitations

Systematic errors in complex tasks: Codex struggles with long docstrings and multi-step logic, showing exponential performance drop as task complexity increases. Safety and security risks: Generated code can be misaligned, insecure, or biased, requiring careful oversight, sandboxing, and content controls.

Future Outlook & Considerations

The paper suggests that future research should focus on alignment, safety, and robustness of code generation systems. Broader impacts span economics, legal issues, bias, and environmental costs of training. Teams considering adoption must weigh efficiency gains against risks such as insecure outputs, over-reliance, and misalignment with user intent.

Conclusion

The study demonstrates that fine-tuned LLMs like Codex can significantly advance program synthesis, outperforming general LLMs and existing tools. At the same time, it highlights important safety, security, and societal considerations. For engineers, designers, and product managers, Codex represents a powerful but imperfect tool, best used with strong human oversight and thoughtful integration.