LangChain’s Align Evals closes the evaluator trust gap with prompt-level calibration

As businesses increasingly adopt AI models to ensure their applications are both functional and dependable, the discrepancies between model-driven assessments and human evaluations have become more apparent. 

To address this issue, LangChain has introduced Align Evals into LangSmith, a solution designed to close the gap between large language model-based evaluators and human preferences, thus minimizing noise. Align Evals allows LangSmith users to develop their own LLM-based evaluators and adjust them to better match company preferences. 

“However, a common challenge we frequently hear from teams is: ‘Our evaluation scores don’t align with what we’d expect a human on our team to conclude.’ This misalignment results in noisy comparisons and wasted time pursuing misleading signals,” LangChain mentioned in a blog post

LangChain is among the few platforms that incorporate LLM-as-a-judge, or model-driven evaluations for other models, directly into the testing dashboard. 


AI Scaling Hits Its Limits

Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are:

  • Turning energy into a strategic advantage
  • Architecting efficient inference for real throughput gains
  • Unlocking competitive ROI with sustainable AI systems

Secure your spot to stay ahead: https://bit.ly/4mwGngO


The company stated that it developed Align Evals based on a paper by Amazon's principal applied scientist, Eugene Yan. In his paper, Yan presented the framework for an application, also named AlignEval, which would automate parts of the evaluation process. 

Align Evals would enable enterprises and other developers to iterate on evaluation prompts, compare alignment scores from human evaluators and LLM-generated scores, and to establish a baseline alignment score. 

LangChain stated that Align Evals “is the initial step in assisting you in building better evaluators.” Over time, the company plans to integrate analytics to monitor performance and automate prompt optimization, generating prompt variations automatically. 

How to start 

Users will initially identify evaluation criteria for their application. For instance, chat apps generally require accuracy.

Next, users need to select the data they wish for human review. These examples must illustrate both positive and negative aspects so that human evaluators can gain a comprehensive understanding of the application and assign a range of grades. Developers must then manually assign scores for prompts or task goals that will serve as a benchmark. 

Developers then need to craft an initial prompt for the model evaluator and iterate using the alignment results from the human graders. 

“For instance, if your LLM consistently over-scores certain responses, try incorporating clearer negative criteria. Enhancing your evaluator score should be viewed as an iterative process. Explore more about best practices for iterating on your prompt in our documentation,” LangChain advised.

Growing number of LLM evaluations

Increasingly, enterprises are turning to evaluation frameworks to gauge the reliability, behavior, task alignment, and auditability of AI systems, including applications and agents. Being able to point to a clear score of how models or agents perform not only gives organizations the confidence to deploy AI applications but also simplifies the comparison of other models. 

Companies like Salesforce and AWS have started offering customers ways to evaluate performance. Salesforce’s Agentforce 3 features a command center that displays agent performance. AWS provides both human and automated evaluation on the Amazon Bedrock platform, where users can select the model to test their applications on, though these are not user-created model evaluators. OpenAI also offers model-based evaluation services.

Meta’s Self-Taught Evaluator builds on the same LLM-as-a-judge concept that LangSmith employs, though Meta has yet to make it a feature for any of its application-building platforms. 

As more developers and businesses seek easier evaluation and more customized methods to assess performance, more platforms will begin to offer integrated solutions for using models to evaluate other models, and many more will provide tailored options for enterprises. 

Recommended Content