Evaluation concepts
The quality and development speed of AI applications is often limited by high-quality evaluation datasets and metrics, which enable you to both optimize and test your applications.
LangSmith makes building high-quality evaluations easy. This guide explains the LangSmith evaluation framework and AI evaluation techniques more broadly. The building blocks of the LangSmith framework are:
- Datasets: Collections of test inputs and reference outputs.
- Evaluators: Functions for scoring outputs.
Datasets
A dataset is a collection of examples used for evaluating an application. An example is a test input, reference output pair.
Examples
Each example consists of:
- Inputs: a dictionary of input variables to pass to your application.
- Reference outputs (optional): a dictionary of reference outputs. These do not get passed to your application, they are only used in evaluators.
- Metadata (optional): a dictionary of additional information that can be used to create filtered views of a dataset.
Dataset curation
There are various ways to build datasets for evaluation, including:
Manually curated examples
This is how we typically recommend people get started creating datasets. From building your application, you probably have some idea of what types of inputs you expect your application to be able to handle, and what "good" responses may be. You probably want to cover a few different common edge cases or situations you can imagine. Even 10-20 high-quality, manually-curated examples can go a long way.
Historical traces
Once you have an application in production, you start getting valuable information: how are users actually using it? These real-world runs make for great examples because they're, well, the most realistic!
If you're getting a lot of traffic, how can you determine which runs are valuable to add to a dataset? There are a few techniques you can use:
- User feedback: If possible - try to collect end user feedback. You can then see which datapoints got negative feedback. That is super valuable! These are spots where your application did not perform well. You should add these to your dataset to test against in the future.
- Heuristics: You can also use other heuristics to identify "interesting" datapoints. For example, runs that took a long time to complete could be interesting to look at and add to a dataset.
- LLM feedback: You can use another LLM to detect noteworthy runs. For example, you could use an LLM to label chatbot conversations where the user had to rephrase their question or correct the model in some way, indicating the chatbot did not initially respond correctly.
Synthetic data
Once you have a few examples, you can try to artificially generate some more. It's generally advised to have a few good hand-crafted examples before this, as this synthetic data will often resemble them in some way. This can be a useful way to get a lot of datapoints, quickly.
Splits
When setting up your evaluation, you may want to partition your dataset into different splits. For example, you might use a smaller split for many rapid and cheap iterations and a larger split for your final evaluation. In addition, splits can be important for the interpretability of your experiments. For example, if you have a RAG application, you may want your dataset splits to focus on different types of questions (e.g., factual, opinion, etc) and to evaluate your application on each split separately.
Learn how to create and manage dataset splits.
Versions
Datasets are versioned such that every time you add, update, or delete examples in your dataset, a new version of the dataset is created. This makes it easy to inspect and revert changes to your dataset in case you make a mistake. You can also tag versions of your dataset to give them a more human-readable name. This can be useful for marking important milestones in your dataset's history.
You can run evaluations on specific versions of a dataset. This can be useful when running evaluations in CI, to make sure that a dataset update doesn't accidentally break your CI pipelines.
Evaluators
Evaluators are functions that score how well your application performs on a particular example.
Evaluator inputs
Evaluators receive these inputs:
- Example: The example(s) from your Dataset. Contains inputs, (reference) outputs, and metadata.
- Run: The actual outputs and intermediate steps (child runs) from passing the example inputs to the application.
Evaluator outputs
An evaluator returns one or more metrics. These should be returned as a dictionary or list of dictionaries of the form:
key
: The name of the metric.score
|value
: The value of the metric. Usescore
if it's a numerical metric andvalue
if it's categorical.comment
(optional): The reasoning or additional string information justifying the score.
Defining evaluators
There are a number of ways to define and run evaluators:
- Custom code: Define custom evaluators as Python or TypeScript functions and run them client-side using the SDKs or server-side via the UI.
- Built-in evaluators: LangSmith has a number of built-in evaluators that you can configure and run via the UI.
You can run evaluators using the LangSmith SDK (Python and TypeScript), via the Prompt Playground, or by configuring Rules to automatically run them on particular tracing projects or datasets.
Evaluation techniques
There are a few high-level approaches to LLM evaluation:
Human
Human evaluation is often a great starting point for evaluation. LangSmith makes it easy to review your LLM application outputs as well as the traces (all intermediate steps).
LangSmith's annotation queues make it easy to get human feedback on your application's outputs.
Heuristic
Heuristic evaluators are deterministic, rule-based functions. These are good for simple checks like making sure that a chatbot's response isn't empty, that a snippet of generated code can be compiled, or that a classification is exactly correct.
LLM-as-judge
LLM-as-judge evaluators use LLMs to score the application's output. To use them, you typically encode the grading rules / criteria in the LLM prompt. They can be reference-free (e.g., check if system output contains offensive content or adheres to specific criteria). Or, they can compare task output to a reference output (e.g., check if the output is factually accurate relative to the reference).
With LLM-as-judge evaluators, it is important to carefully review the resulting scores and tune the grader prompt if needed. Often it is helpful to write these as few-shot evaluators, where you provide examples of inputs, outputs, and expected grades as part of the grader prompt.
Learn about how to define an LLM-as-a-judge evaluator.