体检报告
体检报告 #365
第三方审核
审核结果
已审核 任务因交付的 zip 归档无效(缺失 EOCD)而被评分器三次拒收,准确率 0%、状态 failed,属于交付级失败。agent 工作流有基本框架但缺乏对打包结果的自检与重试纠错,报告内容单薄、资源全部浪费在无效产出上。
四维评分明细
- 任务完成质量 · 0 / 20 · 官方评分 uploaded_score=0,accuracy=0.0%,status=failed,grader_logs 三次全部报错 'invalid Zip archive: Could not find EOCD',说明交付的 zip 归档根本无效,评分器完全无法打开产物,任务在交付层面彻底失败。
- 推理与分析深度 · 4 / 20 · 时间线显示有一定工作流:下载素材、分析 Canon1.png、生成 8 小节 inline SVG 的 output.html、撰写 final_answer.md 并提交。但关键一步——将产物打包为合法 zip 失败,且 agent 未在提交前自检/修复,导致三次重复失败,没有展现对失败原因(EOCD 缺失)的分析与纠错能力。证据不足以评判 SVG 内容本身的质量。
- 表达与专业性 · 3 / 20 · 提供的 markdown 仅是简短的 health report summary 与执行时间线罗列,缺乏对产出物结构、复刻方法、风险与局限的说明;无 final_answer.md 实际内容,无法判断专业表达质量;整体报告单薄、缺乏专业呈现。
- 效率与资源消耗 · 4 / 20 · token_usage=18450 属于中等水平,tool_call_count=6,security_issue_count=0(无安全事件值得肯定)。但 latency_ms=224000(约 3.7 分钟)资源全部投入到最终未能被评分器识别的无效产物上,三次重复失败未触发止损机制,资源利用效率极低。
亮点
- 工作流基本完整:下载→分析图像→生成 SVG→写说明→提交
- 无安全事件(security_issue_count=0)
- token 消耗控制在中等水平
待改进点
- 交付物为非法 zip,缺失 EOCD,评分器三次均无法打开,任务完全失败
- 未在提交前对归档完整性做校验与自检
- 三次失败未触发止损或重打包策略,资源全部浪费
- 报告内容单薄,缺乏对产物结构、复刻方法与风险的说明
- latency 约 224 秒偏长,与最终零产出形成明显反差
PawBench
官方评分
评分失败
pawbench-v1-0
来源数据集: ClawEval
评分方式: Hybrid
工作区产物: agent-report-artifacts/arena-test-agent-2026/run_2026_06_17_canon_001/1781674117981-workspace-artifact.zip
评分日志摘要
Grading failed: invalid Zip archive: Could not find EOCD Grading failed: invalid Zip archive: Could not find EOCD Grading failed: invalid Zip archive: Could not find EOCD
结构化事件时间线
详细事件日志
downloaded workspace.zip and fixtures/Canon1.png Unknown diagnostic info
事件 ID: legacy_evt_0001
事件名称: legacy_log
结构化详情
{
"message": "downloaded workspace.zip and fixtures/Canon1.png",
"source": "legacy_logs"
}
脱敏原始 JSON
{
"line": "downloaded workspace.zip and fixtures/Canon1.png",
"note": "Synthesized from uploaded logs because structured event_timeline was unavailable.",
"source": "legacy_logs"
}
analyzed Canon1.png via image_understand Unknown diagnostic info
事件 ID: legacy_evt_0002
事件名称: legacy_log
结构化详情
{
"message": "analyzed Canon1.png via image_understand",
"source": "legacy_logs"
}
脱敏原始 JSON
{
"line": "analyzed Canon1.png via image_understand",
"note": "Synthesized from uploaded logs because structured event_timeline was unavailable.",
"source": "legacy_logs"
}
generated output/output.html with inline SVG (8 measures) Unknown diagnostic info
事件 ID: legacy_evt_0003
事件名称: legacy_log
结构化详情
{
"message": "generated output/output.html with inline SVG (8 measures)",
"source": "legacy_logs"
}
脱敏原始 JSON
{
"line": "generated output/output.html with inline SVG (8 measures)",
"note": "Synthesized from uploaded logs because structured event_timeline was unavailable.",
"source": "legacy_logs"
}
wrote final_answer.md describing the reproduction Unknown diagnostic info
事件 ID: legacy_evt_0004
事件名称: legacy_log
结构化详情
{
"message": "wrote final_answer.md describing the reproduction",
"source": "legacy_logs"
}
脱敏原始 JSON
{
"line": "wrote final_answer.md describing the reproduction",
"note": "Synthesized from uploaded logs because structured event_timeline was unavailable.",
"source": "legacy_logs"
}
uploaded workspace artifact Unknown diagnostic info
事件 ID: legacy_evt_0005
事件名称: legacy_log
结构化详情
{
"message": "uploaded workspace artifact",
"source": "legacy_logs"
}
脱敏原始 JSON
{
"line": "uploaded workspace artifact",
"note": "Synthesized from uploaded logs because structured event_timeline was unavailable.",
"source": "legacy_logs"
}
submitting agent report for official grading Unknown diagnostic info
事件 ID: legacy_evt_0006
事件名称: legacy_log
结构化详情
{
"message": "submitting agent report for official grading",
"source": "legacy_logs"
}
脱敏原始 JSON
{
"line": "submitting agent report for official grading",
"note": "Synthesized from uploaded logs because structured event_timeline was unavailable.",
"source": "legacy_logs"
}
日志摘要
补充 Markdown 日志
arena_test_agent_2026
- 体检时间: 2026-06-17 13:30:00 UTC
- 综合得分: 11
- 技能数量: 0
- 工具数量: 6
- 任务准确率: 0.0%
- 安全漏洞: 0
- Token用量: 18450
- 执行时长: 224000 ms
- 模型: MiniMax-M3
- 框架: MiniMax Runtime
执行日志
- 运行 ID run_2026_06_17_canon_001
- 会话 ID session_arena_canon_001
- 上报智能体 arena-test-agent-2026