CLI Provider
The cli provider runs an arbitrary shell command per test case and captures its output as the target’s response. It’s the escape hatch that lets you evaluate anything that exposes a command-line entry point — your own agent, a third-party CLI, a stub that prints a fixed answer, a script that calls an in-house microservice, etc.
Because the contract is “we invoke a command and read a file,” almost any useful composition pattern (sanity-checking your grader against a known-good answer, diffing two implementations, driving a batch mode) can be built on top without any new primitives.
Minimal example
Section titled “Minimal example”targets: - name: my_agent provider: cli command: python agent.py --prompt {PROMPT} --out {OUTPUT_FILE} grader_target: azure-base # required if your evals use LLM gradersYour agent.py reads the prompt, writes its response to the path passed as --out, and exits 0. That’s it.
Command contract
Section titled “Command contract”Before each test case, AgentV renders the command template and spawns it as a shell process. The command has two responsibilities:
- Read the input via one of the placeholders below.
- Write the response to
{OUTPUT_FILE}— AgentV reads that file, not your stdout.
When the process exits successfully, AgentV parses the contents of {OUTPUT_FILE} and treats it as the target’s response. Non-zero exits, timeouts, and unreadable output files are surfaced as test errors with the underlying stderr/exit code.
Template placeholders
Section titled “Template placeholders”Use these in command; AgentV substitutes them per test case.
| Placeholder | What it expands to |
|---|---|
{PROMPT} | The test case’s input text, shell-escaped. |
{PROMPT_FILE} | Path to a temp file containing the prompt (use this when the input is large enough to blow past shell argv limits). |
{OUTPUT_FILE} | Path to a temp file the command must write to. Deleted after the run unless keep_temp_files: true. |
{FILES} | Space-separated paths of any input files attached to the test case, formatted via files_format. |
{EVAL_ID} | Unique identifier of the current test case — useful for logging or per-case scratch dirs. |
{ATTEMPT} | Retry attempt number (0 on the first try). |
Output file format
Section titled “Output file format”AgentV tries to parse {OUTPUT_FILE} as JSON first. If it parses and contains any of these keys, they’re picked up; if it doesn’t parse, the entire content is treated as the assistant’s message text.
{ "output": [ // preferred: full message array { "role": "assistant", "content": "..." } ], "text": "...", // fallback: plain assistant text "token_usage": { "input": 123, "output": 456, "cached": 0 }, "cost_usd": 0.0042, "duration_ms": 1800}For the common case, plain text is fine:
echo "Hello, world!" > {OUTPUT_FILE}Configuration fields
Section titled “Configuration fields”| Field | Type | Required | Default | Description |
|---|---|---|---|---|
name | string | yes | — | Target identifier used in eval configs. |
provider | literal "cli" | yes | — | Selects this provider. |
command | string | yes | — | Shell command template. |
timeout_seconds | number | no | — | Kill the process if it runs longer than this. |
cwd | string | no | eval dir | Working directory. Relative paths resolve against the eval file. |
files_format | string | no | {path} | How each entry in {FILES} is formatted. Placeholders: {path}, {basename}. |
verbose | boolean | no | false | Log the rendered command and cwd to stdout. Useful for debugging template substitution. |
keep_temp_files | boolean | no | false | Preserve {PROMPT_FILE} / {OUTPUT_FILE} after the run — handy while iterating on your command. |
healthcheck | object | no | — | Pre-run health check (HTTP or command); the eval aborts if it fails. |
workers | number | no | — | Concurrent test-case executions against this target. |
provider_batching | boolean | no | false | Run all cases in one command invocation — see Batching. |
grader_target | string | no | — | LLM target used by this target’s LLM graders. Required if your evals use LLM-based graders. |
Batching
Section titled “Batching”For targets where spin-up cost dominates per-case work (e.g. loading a model, authenticating), set provider_batching: true. AgentV invokes the command once, hands it a JSONL stream of cases, and expects a JSONL response keyed by each case’s id:
targets: - name: batched_agent provider: cli provider_batching: true command: python agent.py --batch-in {PROMPT_FILE} --batch-out {OUTPUT_FILE}{PROMPT_FILE} contains one JSON object per line with an id and the case’s inputs; your command writes one line per case to {OUTPUT_FILE}, each carrying the matching id plus the same output shape as the non-batched case.
Pattern: Oracle validation (sanity-check your grader)
Section titled “Pattern: Oracle validation (sanity-check your grader)”A common question when building a new eval: “if my grader scores my agent poorly, is the agent wrong or is the grader wrong?” The classical testing answer is to run a known-correct reference (“the oracle”) through the same grader — if a perfect answer doesn’t pass, the grader is the bug.
AgentV has no dedicated “oracle” feature because the cli provider already composes into one. Declare a second target that prints your known-good answer into {OUTPUT_FILE}, run the same eval against it, and assert a perfect score:
targets: - name: my_agent provider: cli command: python agent.py --prompt {PROMPT} --out {OUTPUT_FILE} grader_target: azure-base
- name: oracle provider: cli command: cp fixtures/{EVAL_ID}.expected.txt {OUTPUT_FILE} grader_target: azure-base# While iterating on your grader, run the oracle first.# If it doesn't score 100%, fix the grader before trusting any agent results.agentv eval my.EVAL.yaml --target oracle
# Then run the real target.agentv eval my.EVAL.yaml --target my_agentA few practical notes:
{EVAL_ID}in the oracle command lets one target serve an entire eval suite — just ship onefixtures/<id>.expected.txtper case. Alternatively, read the expected output from wherever your rubric already keeps it.- If the oracle doesn’t reach 100%, that’s the bug. Do not proceed to scoring real agents until it does.
- If the oracle does reach 100%, low scores on real agents are a signal about the agent, not the grader.
- The same composition works for other meta-tests: a “deliberately wrong” target that should score 0, a “mostly right” target pinned at a known partial score, etc.
The pattern needs no special config field, no directory convention, and no flag — it’s just a second target that happens to know the answer.
Debugging
Section titled “Debugging”When a cli target misbehaves:
- Set
verbose: trueto see the rendered command and cwd. - Set
keep_temp_files: trueand inspect{PROMPT_FILE}/{OUTPUT_FILE}after the run. - Run the rendered command by hand with those files and check it exits
0and writes the expected output shape. - If the output looks right but grading is off, check the JSON schema — a typo in
outputvsoutput_messagessilently falls back to “treat whole file as plain text.”