Introducing Composo Align

Composo team
Introducing Composo Align

We've launched Composo Align - the foundation model for evaluating complex LLM applications on any custom criteria

The Problem with Current Evaluation Methods

Today's approaches to LLM evaluation are severely limited, with most teams ultimately resorting to manual human review due to the limitations of automated methods like "LLM as a judge." These approaches fall short, especially when tasked with evaluating complex, real-world outputs:

  • Human evaluations are often subjective, vary from person to person, and struggle to scale, making them inefficient for production-level applications.
  • LLM as a judge can handle simple, binary criteria (e.g., checking grammar or confirming a structure) but lacks the nuanced understanding required for sophisticated evaluations. LLMs need lengthy, intricate scoring rubrics to perform effectively—and even then, they're inconsistent and often fall short of expert human judgment.

Most real-world LLM applications require a nuanced approach that goes beyond simple checklists. Evaluating complex criteria like assessing a "compelling legal argument," determining "appropriate empathy," verifying "alignment with medical guidelines," or measuring "clarity and engagement" for educational content demands flexibility, expertise, and context.

Enter Composo Align: A Radically New Approach to Evaluation

At Composo, we've developed Composo Align specifically for the most complex, nuanced use cases. Using a best-in-class foundation model built for evaluation, Composo Align learns to align LLM outputs with expert-level judgment, providing precise, scalable, and repeatable evaluations that meet the high demands of intricate applications.

How Composo Align Works

Composo Align combines a reward model with a language model architecture, making it highly specialized for determining LLM output quality. Trained on a large dataset of expert evaluations, Composo Align is dedicated to reliably assessing quality across a range of criteria.

Key details:

  • High Expert Alignment: Composo Align achieves up to 94% alignment with expert judgment, outperforming GPTo1 (LLM-as-a-judge) by 2.6 times.
  • Flexible Guidelines: Composo Align evaluates outputs based on guidelines you provide, known as a constitution.
  • Output Scoring: Composo Align generates a precise score from 0 to 1 for each input-output pair, providing a clear, quantitative measure of performance.

Key Characteristics of Composo Align

  1. General Capability Across Complex Domains

    Composo Align's extensive training on diverse datasets makes it highly capable across various use cases, from creative generation to highly structured outputs. This general capability enables it to excel in specialized, complex fields like healthcare, finance, and legal.

  2. Personalization for Specific Use Cases

    While Composo Align is generally capable, we know that many applications require highly customized evaluation. Composo Align's framework allows for easy personalization by incorporating human preference data to adapt its core model to your unique requirements.

  3. Continuous Learning and Adaptability

    Evaluation is not static—quality standards evolve as applications and user expectations shift. Composo Align is designed to continually learn and update in real time, refining its measure of quality with more data over time.

How to Access

Today we're launching a public API for Composo Align, giving you direct access to our highly capable evaluation model to try it out for your use cases. Try out the API for free here.

Get in touch to hear more or to get access to your own personalised version based on our most powerful evaluation models here.