Introducing EvoSkill: a framework that analyzes agent failures and automatically builds the missing skills, leading to rapid improvement on difficult benchmarks and generalizable skills across use-cases. +12.1% on SealQA +7.3% on OfficeQA (SOTA) +5.3% on BrowseComp via zero-shot transfer from SealQA Read more below 🧵
2/ Agent skills are a powerful abstraction to solve long-horizon problems, but cannot scale easily Coding agents (Claude Code, Codex, OpenHands) are powerful general-purpose solvers. However on specialized long-horizon tasks, errors compound without traceability and the domain-specific expertise is absent. Skills have emerged as a powerful abstraction method to improve agent performance on real-world tasks, but today’s skills are rigorously hand-crafted by experts. We have uncovered a path to reliably automating skill development.
3/ EvoSkill applies textual feedback descent to skill discovery The loop runs three specialized agents: 1. Executor: Attempts a batch of tasks under the current skill configuration 2. Proposer: Analyzes failed traces, cross-references a cumulative feedback history of prior proposals, and identifies the highest-impact capability gap 3. Skill Builder: Materializes the proposal into a structured skill folder (SKILL.md + scripts + references, etc… ) A Pareto frontier of top-N configurations governs selection, where only the skills that improve on the test-set validation survive.
4/ EvoSkill achieves rapid performance using only a fraction of the benchmark data We tested performance across three benchmarks: 1. OfficeQA (reasoning over large corpora): 60.6% → 67.9% (+7.3%) and achieving SOTA across all systems 2. SealQA (search-augmented QA): 26.6% → 38.7% (+12.1%) 3. BrowseComp (open-web fact-seeking): 43.5% → 48.8% (+5.3%); zero-shot transfer from SealQA-evolved skills, no modification The BrowseComp result stemmed from skills evolved on SealQA (query reformulation, multi-source verification, structured search persistence) that transfer zero-shot to a benchmark with different questions, difficulty distribution, and retrieval conditions. This suggests skill-level optimization produces domain-general capabilities rather than task-specific overfitting.
5/ Skill-level optimization is better abstraction to produce transferable capabilities more modular than prompts or code EvoSkill is fully open-source. We believe skills sit in a critical spot that prompts and code cannot reach—structured enough to encode multi-step procedures with branching logic/verification, and readable enough that a developer can inspect, edit, and pass on to a different agent on a different model. We are continuing this work across broader domains (coding, multimodal) in collaboration with Virginia Tech (@tuvllms, @noahpro99, Jaydon Bingham, and @WeiyuanChen01) and are open to collaboration with the broader research community.
152