PawBench v1.0
PawBench is a general agent benchmark released by Tongyi Lab for personal assistant and agent scenarios. It evaluates foundation models together with harnesses in one system. PawBench v1.0 builds a suite of 150 real tasks and 4,050 test units, and through a 9-model × 3-harness cross evaluation, it identifies the best model-plus-harness combinations, helps harness developers pinpoint issues and validate improvements, and provides a quantifiable, reproducible technical baseline for co-evolving agent systems.