> ## Documentation Index
> Fetch the complete documentation index at: https://docs.hipocap.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Get started with Laminar evaluations

Laminar provides a structured approach to create, run, and track your AI system's performance through these key components:

* **Executors** - Functions that process inputs and produce outputs, such as prompt templates, LLM calls, or production logic
* **Evaluators** - Functions that assess outputs against targets or quality criteria, producing numeric scores
* **Datasets** - Collections of datapoints (test cases) with 3 key elements:
  * `data` - Required JSON input sent to the executor
  * `target` - Optional reference data sent to the evaluator, typically containing expected outputs
  * `metadata` - Optional metadata. This can be used to filter evaluation results in the UI after the evaluation is run.
* **Visualization** - Tools to track performance trends and detect regressions over time
* **Tracing** - Automatic recording of execution flow and model invocations

Example datapoint:

```json5 theme={null}
{
  "data": {
    "question": "What is the capital of France? Respond in one word.",
  },
  "target": {
    "answer": "Paris"
  },
  "metadata": {
    "category": "geography"
  }
}
```

**Evaluation Groups** group related evaluations to assess one feature or component, with results aggregated for comparison.

## Evaluation Lifecycle

For each datapoint in a dataset:

1. The executor receives the `data` as input
2. The executor runs and its output is stored
3. Both the executor output and `target` are passed to the evaluator
4. The evaluator produces either a numeric score or a JSON object with multiple numeric scores
5. Results are stored and can be visualized to track performance over time

This approach helps you continuously measure your AI system's performance as you make changes, showing the impact of model updates, prompt revisions, and code changes.

### Evaluation function types

Each executor takes in the `data` as it is defined in the datapoints.
Evaluator accepts the output of the executor as its first argument,
and `target` as it's defined in the datapoints as the second argument.

This means that the type of the `data` fields in your datapoints must
match the type of the first parameter of the executor function. Similarly,
the type of the `target` fields in your datapoints must match the type of
the second parameter of the evaluator function(s).

Python is a bit more permissive. If you see type errors in TypeScript,
make sure the data types and the parameter types match.

For a more precise description, here's the partial TypeScript type signature of the `evaluate` function:

```javascript theme={null}
evaluate<D, T, O>(
  data: {
    data: D,
    target?: T,
  },
  executor: (data: D, ...args: any[]) => O | Promise<O>;
  evaluators: {
    [key: string]: (output: O, target?: T, ...args: any[]) => 
      number | { [key: string]: number }
  },
  // ... other parameters
)
```

See full reference in the [SDK docs](/sdk/reference).

## Create your first evaluation

### Prerequisites

To get the project API key, go to the Laminar dashboard, click the project settings,
and generate a project API key. This is available both in the cloud and in the self-hosted version of Laminar.

Specify the key at `Laminar` initialization. If not specified,
Laminar will look for the key in the `LMNR_PROJECT_API_KEY` environment variable.

### Create an evaluation file

<Tabs>
  <Tab title="TypeScript">
    Create a file named `my-first-evaluation.ts` and add the following code:

    ```javascript my-first-evaluation.ts theme={null}
    import { evaluate } from '@lmnr-ai/lmnr';
    import { OpenAI } from 'openai';  

    const client = new OpenAI();

    const capitalOfCountry = async (data: {country: string}) => {
      // replace this with your LLM call or custom logic
      const response = await client.chat.completions.create({
        model: 'gpt-4.1-nano',
        messages: [
          {
            role: 'user',
            content: `What is the capital of ${data['country']}? ` +
              'Answer only with the capital, no other text.'
          }
        ]
      });
      return response.choices[0].message.content || '';
    }

    evaluate({
      data: [
        {
          data: { country: 'France' },
          target: 'Paris',
        },
        {
          data: { country: 'Germany' },
          target: 'Berlin',
        },
      ],
      executor: capitalOfCountry,
      evaluators: {
        accuracy: (output: string, target: string | undefined): number => {
          if (!target) return 0;
          return output.includes(target) ? 1 : 0;
        } 
      },
      config: { 
        instrumentModules: {
          openAI: OpenAI
        }
      }
    })
    ```

    <Note>
      It is important to pass the `config` object with `instrumentModules` to `evaluate` to ensure that the OpenAI client and any other instrumented modules are instrumented.
    </Note>
  </Tab>

  <Tab title="Python">
    Create a file named `my-first-evaluation.py` and add the following code:

    ```python my-first-evaluation.py theme={null}
    from lmnr import evaluate
    from openai import OpenAI

    client = OpenAI()

    def capital_of_country(data: dict) -> str:
        # replace this with your LLM call or custom logic
        return client.chat.completions.create(
            model="gpt-4.1-nano",
            messages=[
                {
                    "role": "user",
                    "content": f"What is the capital of {data['country']}? " +
                        "Answer only with the capital, no other text."
                }
            ]
        ).choices[0].message.content

    def accuracy(output: str, target: str) -> int:
        return 1 if target in output else 0

    evaluate(
        data=[
            {
                "data": {"country": "France"},
                "target": "Paris"
            },
            {
                "data": {"country": "Germany"},
                "target": "Berlin"
            },
        ],
        executor=capital_of_country,
        evaluators={"accuracy": accuracy}
    )
    ```
  </Tab>
</Tabs>

### Run the evaluation

You can run evaluations in two ways: using the `lmnr eval` CLI or directly executing the evaluation file.

#### Using the CLI

The Laminar CLI automatically detects top-level `evaluate` function calls in your files - you don't need to wrap them in a `main` function or any special structure.

<Tabs>
  <Tab title="TypeScript">
    ```sh theme={null}
    export LMNR_PROJECT_API_KEY=<YOUR_PROJECT_API_KEY>
    npx lmnr eval my-first-evaluation.ts
    ```

    To run multiple evaluations, place them in an `evals` directory with the naming pattern `*.eval.{ts,js}`:

    ```
    ├─ src/
    ├─ evals/
    │  ├── my-first-evaluation.eval.ts
    │  ├── my-second-evaluation.eval.ts
    │  ├── ...
    ```

    Then run all evaluations with a single command:

    ```sh theme={null}
    npx lmnr eval
    ```

    CLI options:

    * First positional argument: path to the evaluation file (optional).
    * Without a path, `lmnr eval` runs all files matching `*.eval.{ts,js}` in `evals/`.
    * `--fail-on-error`: fail on non-critical errors (e.g., missing `evaluate` call). Default `false`.
  </Tab>

  <Tab title="Python">
    ```sh theme={null}
    # 1. Make sure `lmnr` is installed in a virtual environment
    # lmnr --help
    # 2. Run the evaluation
    export LMNR_PROJECT_API_KEY=<YOUR_PROJECT_API_KEY>
    lmnr eval my-first-evaluation.py
    ```

    To run multiple evaluations, place them in an `evals` directory with the naming pattern `eval_*.py` or `*_eval.py`:

    ```
    ├─ src/
    ├─ evals/
    │  ├── eval_first.py
    │  ├── second_eval.py
    │  ├── ...
    ```

    Then run all evaluations with a single command:

    ```sh theme={null}
    lmnr eval
    ```

    CLI options:

    * First positional argument: path to the evaluation file (optional).
    * Without a path, `lmnr eval` runs all files matching `eval_*.py` or `*_eval.py` in `evals/`.
    * `--fail-on-error`: fail on non-critical errors (e.g., missing `evaluate` call). Default `false`.
  </Tab>
</Tabs>

#### Running as a standalone script

You can also import and call `evaluate` directly from your application code:

<Tabs>
  <Tab title="TypeScript">
    ```bash theme={null}
    ts-node my-first-evaluation.ts
    # or
    npx tsx my-first-evaluation.ts
    ```
  </Tab>

  <Tab title="Python">
    ```bash theme={null}
    python my-first-evaluation.py
    ```
  </Tab>
</Tabs>

The `evaluate` function is flexible and can be used both in standalone scripts processed by the CLI and integrated directly into your application code.

<Tip>
  Evaluator functions must return either a single numeric score or a JSON object where each key is a score name and the value is a numeric score.
</Tip>

<Note>
  No need to initialize Laminar - `evaluate` automatically initializes Laminar behind the scenes. All instrumented function calls and model invocations are traced without any additional setup.
</Note>

### View evaluation results

When you run an evaluation from the CLI, Laminar will output the link to the dashboard where you can view the evaluation results.

Laminar stores every evaluation result. A run for every datapoint is represented as a trace. You can view the results and corresponding traces in the evaluations page.

<Frame>
  <img height="300" src="https://mintcdn.com/hipocap/jGau1UNMEqA4k6Af/images/evaluations/simple-eval-example.png?fit=max&auto=format&n=jGau1UNMEqA4k6Af&q=85&s=bc7cc942de92c4051a4f8ec2753a497e" alt="Example evaluation" data-path="images/evaluations/simple-eval-example.png" />
</Frame>

## Tracking evaluation progress

To track the score progression over time or compare evaluations side-by-side, you need to group them together. This can be achieved by passing the `groupName` parameter to the `evaluate` function.

<Tabs>
  <Tab title="TypeScript">
    ```javascript {7} theme={null}
    import { evaluate, LaminarDataset } from '@lmnr-ai/lmnr';

    evaluate({
        data: new LaminarDataset("name_of_your_dataset"),
        executor: yourExecutorFunction,
        evaluators: {evaluatorName: yourEvaluator},
        groupName: "evals_group_1",
    });
    ```
  </Tab>

  <Tab title="Python">
    ```python {9} theme={null}
    from lmnr import evaluate, LaminarDataset
    import os

    evaluate(
        data=LaminarDataset("name_of_your_dataset"),
        executor=your_executor_function,
        evaluators={"evaluator_name": your_evaluator},
        project_api_key=os.environ["LMNR_PROJECT_API_KEY"],
        group_name="evals_group_1",
        # ... other optional parameters
    )
    ```
  </Tab>
</Tabs>

<Frame>
  <img height="300" src="https://mintcdn.com/hipocap/jGau1UNMEqA4k6Af/images/evaluations/eval-progress.png?fit=max&auto=format&n=jGau1UNMEqA4k6Af&q=85&s=1662974c733164eaa88dc9fbde920ea5" alt="Example evaluation progression" data-path="images/evaluations/eval-progress.png" />
</Frame>
