agentchat.contrib.agent_eval.agent_eval

generate_criteria

def generate_criteria(llm_config: Optional[Union[Dict, Literal[False]]] = None,
                      task: Task = None,
                      additional_instructions: str = "",
                      max_round=2,
                      use_subcritic: bool = False)

Creates a list of criteria for evaluating the utility of a given task.

Arguments:

llm_config dict or bool - llm inference configuration.
task Task - The task to evaluate.
additional_instructions str - Additional instructions for the criteria agent.
max_round int - The maximum number of rounds to run the conversation.
use_subcritic bool - Whether to use the subcritic agent to generate subcriteria.

Returns:

list - A list of Criterion objects for evaluating the utility of the given task.

quantify_criteria

def quantify_criteria(llm_config: Optional[Union[Dict, Literal[False]]] = None,
                      criteria: List[Criterion] = None,
                      task: Task = None,
                      test_case: str = "",
                      ground_truth: str = "")

Quantifies the performance of a system using the provided criteria.

Arguments:

llm_config dict or bool - llm inference configuration.
criteria [Criterion] - A list of criteria for evaluating the utility of a given task.
task Task - The task to evaluate.
test_case str - The test case to evaluate.
ground_truth str - The ground truth for the test case.

Returns:

dict - A dictionary where the keys are the criteria and the values are the assessed performance based on accepted values for each criteria.

generate_criteria​

quantify_criteria​

generate_criteria

quantify_criteria