Skip to main content
The evaluation framework is a tool that composes the watsonx Orchestrate Agent Development Kit (ADK), which allows you to test, evaluate and analyze the agents that you have created. To test your agents, you have to set up a file with the expected interaction responses, and run the agent evaluation to check if your agent matches the expectations.

Before you begin

  • If you already have an agent, you must check whether your agent is:
    • External: hosted outside watsonx Orchestrate, connected via configuration.
    • Native: runs within watsonx Orchestrate, built with the Agent Development Kit .
  • If External, you must prepare it to connect to the watsonx Orchestrate platform. For more information, see Connect to external agents .
  • Install the watsonx Orchestrate Agent Development Kit. For more information, see Installing the ADK .
  • Install the watsonx Orchestrate Developer Edition. For more information, see Installing the Developer Edition .
  • For more information about the evaluation framework, see the Evaluation framework overview .

Validating your external agent

External agents refer to agents that were not created with the watsonx Orchestrate Agent Development Kit or the watsonx Orchestrate platform. This step checks:
  • Whether your external agent works without a native agent.
  • Whether the agent is able to ingest events and produce events that follow the watsonx Orchestrate API contract .
After the events validation, if you have errors, you can analyze the event logs from the external agent to remediate any issues. Only then the framework can analyze the external agent behavior. After you build your external agent, follow these steps to validate:
  1. Prepare your user story.
  2. Run the validation command.
  3. Connect your external agent to a collaborator agent to validate its performance.
  4. Iterate and fix issues until you are satisfied with the results.
  5. Submit your results.

Preparing your user story

A user story describes the intention of the user with context information. You must provide all the relevant user information for the agent to process. For example:
You are John Doe and you want to look up your holiday information for the year of 2025.
Prepare a .tsv file with two columns. The first column contains user stories. The second column is the expected final output from the agent. For an example file, see the external agent validation folder example. Please prepare at least one example per use case that is supported by your agent. For an example file, see the external agent validation folder example.
Do not include a column header in the .tsv file.

Running the external agent validation

The validate-external command validates your external agent to the chat completions schema for streamed events and stores validation results including the streamed events from the external agent for later triaging and debugging. You must provide the external agent json specification, such as the following:
sample.json
{
    "name": "<agent_name>_agent",
	"description": "Describe Agent briefly",
	"api_url": "https://partnerDomain/agentPath/v1/chat/completions",
	"auth_scheme": "API_KEY",
    "category": "agent",
	"kind": "external",
    "provider": "external_chat",
	"model": "watsonx/meta-llama/llama-3-2-90b-vision-instruct",
    "version": "1.0.1",
    "publisher": "your_company_name",
	"language_support": ["English"],
    "icon": "<svg>",
	"tags": [
		"Sales"
	]
} 
And then you can run the validation command:
orchestrate evaluations validate-external --tsv ./examples/evaluations/external_agent_validation/test.tsv --external-agent-config ./examples/evaluations/external_agent_validation/sample.json --credential "<API/BEARER TOKEN>"
You must provide valid credentials to connect to your external agent.
The validation results are saved to a validate_external subfolder under the path provided for the --output flag.

Evaluating the agent

After validation, you can evaluate the agent by using the provided input. This evaluation checks whether the external agent works when added as a collaborator agent to a native agent. Running the following validation command will import the external agent, automatically create a native agent, and add the external agent as a collaborator to the native agent.
orchestrate evaluations validate-external --tsv "./examples/evaluations/external_agent_validation/test.tsv" --external-agent-config "./examples/evaluations/external_agent_validation/sample_external_agent_config.json" --perf
The native agent that is automatically created has the name with the format external_agent_validation_{external_agent_name}_{number}. You can see this native agent automatically imported by running the list agents command . After validation, if you wish, you can remove the agents using the remove agent command . After the results are generated, you can review them and keep iterating until you get satisfactory results from the evaluation.

Analyzing the validation results

The validation_results.json stores the results from the validation tests. The validation tests send sample inputs and record the responses. The “success” field indicates if the external agent events matched the chat-completions spec. The validation tests prepend default messages to the user story. These messages act as context to the agent. Considering that there are n messages, the goal is to validate if the external agent can properly handle an array of messages where n - 1 messages act as the context. The nth message is the message the external agent should respond to provided the context. The following default messages are prepended:
[
    {"role": "user", "content": "what's the holiday is June 13th in us?"},
    {"role": "assistant", "content": "tool_name: calendar_lookup, args {\"location\": \"USA\", \"data\": \"06-13-2025\"}}"},
    {"role": "assistant", "content":"it's National Sewing Machine Day"}
]
The validation files contain the following fields:
  1. success: Boolean value that indicates if the events streamed back adhered to the expected schema.
  2. logged-events: Streamed events from the external agent.
  3. messages: The messages that were sent to the external agent.

What to do next

You can now proceed to Preparing for submission.

Validating your native agent

Native agents refer to agents that were created with the watsonx Orchestrate Agent Development Kit or inside the watsonx Orchestrate platform. The validate-native command validates the native agent and registered tools, collaborator agents, and knowledge bases against a set of inputs.

Running the Validation

Prepare a TSV file with three columns:
  • The first column contains user stories.
  • The second column is the expected summary or output.
  • The third column is the name of the native agent that you want to validate.
For example:
example.tsv
My username is nwaters. I want to find out my timeoff schedule from: 2025-01-01 to: 2025-03-03.	Your timeoff schedule for 20250101 to 20250303 is: 20250105	hr_agent
The provided user stories and expected output are used to generate the json formatted test case used to evaluate the agent. The generated test cases are saved at the path: <output-folder>/native_agent_evaluations/generated_test_data Running the command
orchestrate evaluations validate-native -t <path to data file tsv> -o <output folder>

Preparing for submission

To prepare your agent for submission, you must include all the result files from the validation and evaluation stages. You must also provide a valid credential that can be used to test your agent. Compress the results into a zip file to send them, as follows: External agents
company-name-onboarding-validation.zip/
└── evaluations/
    ├── ... # other relevant files for the evaluation
    ├── sample_agent.yaml
    ├── knowledge_base_summary_metrics.json
    └── summary_metrics.csv
└── validation_external/
    ├── input_sample.tsv
    └── validation_results.json
Native agents
company-name-onboarding-validation.zip/
└── native_agent_evaluations/
    ├── generated_test_data/
        ├── native_agent_evaluation_test_0.json
        ├── ...
    ├── knowledge_base_summary_metrics/
    ├── messages/
    └── summary_metrics.csv
    └── ... # other relevant files for the evaluation
└── evaluations/
    ├── ... # other relevant files for the evaluation
    ├── sample_agent.yaml
    ├── knowledge_base_summary_metrics.json
    └── summary_metrics.csv

Next steps

After you prepare the evaluation files, you must package your agent if you haven’t already. For more information, see Packaging your agent. Your must put the validation results under the evaluations/ folder, for example:
.
├── agents/
│   └── my_agent.yaml
├── connections/
│   └── my_connections.yaml
├── offerings/
│   └── my_offering.yaml
├── tools/
│   └── sample_tool/
│       ├── tool.py
│       └── requirements.txt
└── evaluations/
    └── company-name-onboarding-validation.zip
I