agentchat.contrib.agent_eval.agent_eval
generate_criteria
def generate_criteria(llm_config: Optional[Union[Dict, Literal[False]]] = None,
task: Task = None,
additional_instructions: str = "",
max_round=2,
use_subcritic: bool = False)
Creates a list of criteria for evaluating the utility of a given task.
Arguments:
llm_configdict or bool - llm inference configuration.taskTask - The task to evaluate.additional_instructionsstr - Additional instructions for the criteria agent.max_roundint - The maximum number of rounds to run the conversation.use_subcriticbool - Whether to use the subcritic agent to generate subcriteria.
Returns:
list- A list of Criterion objects for evaluating the utility of the given task.
quantify_criteria
def quantify_criteria(llm_config: Optional[Union[Dict, Literal[False]]] = None,
criteria: List[Criterion] = None,
task: Task = None,
test_case: str = "",
ground_truth: str = "")
Quantifies the performance of a system using the provided criteria.
Arguments:
llm_configdict or bool - llm inference configuration.criteria[Criterion] - A list of criteria for evaluating the utility of a given task.taskTask - The task to evaluate.test_casestr - The test case to evaluate.ground_truthstr - The ground truth for the test case.
Returns:
dict- A dictionary where the keys are the criteria and the values are the assessed performance based on accepted values for each criteria.