Research / 2024

Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences

Shreya Shankar, J.D. Zamfirescu-Pereira, Björn Hartmann, Aditya G. Parameswaran, Ian Arawjo

LLMs
Benchmarking
Human-in-the-Loop
Prompt Engineering
Product Management
Design & UX
Software Engineering

In brief

This paper introduces EvalGen, a mixed-initiative system that aligns LLM-assisted evaluations with human preferences by combining automated assertion generation with human grading feedback.

Executive Summary

The paper addresses the growing reliance on LLMs to evaluate other LLMs, a practice that is both powerful and risky due to misalignment with human expectations. The authors propose EvalGen, a system that leverages human-in-the-loop feedback to validate and refine automated evaluation metrics. This contribution is significant for engineers, designers, and product managers seeking reliable evaluation frameworks for AI systems.

Key Technical Advancements

EvalGen Workflow: EvalGen combines LLM-suggested evaluation criteria with user edits, then generates candidate assertions (code or LLM-based prompts) and aligns them using user-provided thumbs-up/down grades. Assertion Selection Algorithm: The system calculates alignment scores based on coverage and false failure rates, automatically selecting assertions most consistent with user preferences. Human-Centered Design: EvalGen is embedded into the open-source tool ChainForge and tested with practitioners, revealing criteria drift—users refine criteria as they grade outputs, which shapes evaluation iteratively.

Practical Implications & Use Cases

Technical impact: Provides a scalable approach to validating LLM pipelines, making automated evaluations more reliable for production systems in areas like medical records and e-commerce copywriting. UX/UI implications: Highlights the importance of interactive evaluation interfaces where user feedback directly improves evaluation quality, supporting iterative design and auditing workflows. Strategic implications: Enables organizations to reduce over-reliance on opaque LLM-based evaluations, improving trust, accountability, and compliance in AI-driven products.

Challenges and Limitations

Criteria Drift: Users’ evaluation criteria change as they interact with outputs, making it impossible to fully define evaluation rules upfront. Trust in LLM-based Assertions: Practitioners find LLM-generated assertions harder to trust and maintain compared to code-based assertions, raising concerns for long-term deployment.

Future Outlook & Considerations

Future work should explore dynamic evaluator assistants that adjust criteria as users grade, integrate team-based grading workflows, and extend beyond binary judgments to continuous monitoring. The findings suggest that evaluation assistants must embrace iteration, messiness, and human feedback loops to remain effective in real-world LLMOps.

Conclusion

The paper demonstrates that validating the validators is essential for reliable LLM evaluation. EvalGen advances the field by embedding human judgment into automated evaluation pipelines, offering engineers, designers, and product managers a framework that is both practical and adaptive to real-world needs.