The evaluation framework is a tool that composes the watsonx Orchestrate Agent Development Kit (ADK), which allows you to test, evaluate and analyze the agents that you have created. To test your agents, you have to set up a file with the expected interaction responses, and run the agent evaluation to check if your agent matches the expectations.

Before you begin

Validating your agent

The validation step checks whether your external agent works without a native agent. To learn more about native agents and external agents, see Creating Agents.

After you build your external agent, follow these steps to validate:

  1. Prepare your user story.
  2. Run the validation command.
  3. Connect your external agent to a collaborator agent to validate its performance.
  4. Iterate and fix issues until you are satisfied with the results.
  5. Submit your results.

Preparing your user story

A user story describes the intention of the user with context information. You must provide all the relevant user information for the agent to process. For example:

You are John Doe and you want to look up your holiday information for the year of 2025.

Prepare a .tsv file with two columns. The first column contains user stories. The second column is the expected summary or output. For an example file, see the external agent validation folder example.

Do not include a column header in the .tsv file.

Running the validation

The validate-external command validates your external agent to the chat completions schema for streamed events and stores validation results including the streamed events from the external agent for later triaging and debugging.

You must use an external agent specification, such as the following:

sample.yaml
spec_version: v1
name: "QA_Agent"
title: "QA_Agent"
category: "agent"
kind: "external"
description: "Agent that answers questions about geography, governments, and world facts."
tags:
    - "productivity"
api_url: "https://<your_external_agent>/chat/completions/"
auth_scheme: "API_KEY"
auth_config:
    token: "123" 
provider: "external_chat"
chat_params:
    stream: true
config:
    hidden: false
    enable_cot: true   

And then you can run the validation command:

orchestrate evaluations validate-external --tsv ./examples/evaluations/external_agent_validation/test.tsv --external-agent-config ./examples/evaluations/external_agent_validation/sample.yaml --credential "<API/BEARER TOKEN>"

You must provide valid credentials to connect to your external agent.

The validation results are saved to a validation_results subfolder under the path provided for the --output flag.

  • Your environment must be properly configured with a .env file. To learn more about how to configure your .env file, see the Setup the environment.
  • If you don’t get the expected outputs, you can keep iterating and improve your agent to get better results.

Analyzing the validation results

The evaluation framework creates two files:

  1. sample_block_validation_results.json
  2. validation_results.json

The sample_block_validation_results.json prepends default messages to the user story. These messages act as context to the agent. Considering that there are n messages, the goal is to validate if the external agent can properly handle an array of messages where the a number of n - 1 messages is the context, and the nth message is the message the external agent should respond to provided the context. The following default messages are prepended:

MESSAGES = [
    {"role": "user", "content": "what's the holiday is June 13th in us?"},
    {"role": "assistant", "content": "tool_name: calendar_lookup, args {\"location\": \"USA\", \"data\": \"06-13-2025\"}}"},
    {"role": "assistant", "content":"it's National Sewing Machine Day"}
]

The validation files contain the following fields:

  1. success: Boolean value that indicates if the events streamed back adhered to the expected schema.
  2. logged-events: Streamed events from the external agent.
  3. messages: The messages that were sent to the external agent.

Evaluating the agent

After validation, you can evaluate the agent by using the provided input. This evaluation checks whether the external agent works when added as a collaborator agent to a native agent.

  1. Import the external agent to your tenant. For more information, see External agents. For example:

    orchestrate agents import -f ./examples/evaluations/external_agent_validation/sample.yaml 
    orchestrate agents list # external agent listed under `External Agents` table 
    
  2. Add the external agent as a collaborator to your native agent. See the documentation for an in depth guide. The following is an example of the native agent specification:

    native_agent.json
    {
        "spec_version": "v1",
        "style": "default",
        "llm": "watsonx/meta-llama/llama-3-405b-instruct",
        "name": "<provide name for native agent>",
        "description": "<provide description for your native agent>",
        "instructions": "<provide instructions for your native agent>",
        "collaborators": ["QA_Agent"] // must match the `name` field from external agent spec
    }
    
  3. Import the native agent:

    orchestrate agents import -f ./examples/evaluations/external_agent_validation/native_agent.json 
    orchestrate agents list # native agent listed under `Agents` table along with the external agent under the `Collaborators` column
    
  4. Run the evaluation:

    orchestrate evaluations validate-external --agent_name "<name of native agent>" --tsv "./examples/evaluations/external_agent_validation/test.tsv" --external-agent-config "./examples/evaluations/external_agent_validation/sample_agent.yaml"
    

After the results are generated, you can review them and keep iterating until you get satisfactory results from the evaluation.

Submitting the results

To submit your results, you must include the all result files from the validation and evaluation stages. You must also provide a valid credential that can be used to test your agent.

Compress the results into a zip file to send them, as follows:

company-name-onboarding-validation.zip/
├── input_sample.tsv
└── evaluation/
    ├── ... # other relevant files for the evaluation
    ├── sample_agent.yaml
    ├── knowledge_base_summary_metrics.json
    └── summary_metrics.csv
└── validation_results/
    ├── sample_block_validation_results.json
    └── validation_results.json