Trending topics
#
Bonk Eco continues to show strength amid $USELESS rally
#
Pump.fun to raise $1B token sale, traders speculating on airdrop
#
Boop.Fun leading the way with a new launchpad on Solana.
Introducing EvoSkill: a framework that analyzes agent failures and automatically builds the missing skills, leading to rapid improvement on difficult benchmarks and generalizable skills across use-cases.
+12.1% on SealQA
+7.3% on OfficeQA (SOTA)
+5.3% on BrowseComp via zero-shot transfer from SealQA
Read more below 🧵

2/ Agent skills are a powerful abstraction to solve long-horizon problems, but cannot scale easily
Coding agents (Claude Code, Codex, OpenHands) are powerful general-purpose solvers. However on specialized long-horizon tasks, errors compound without traceability and the domain-specific expertise is absent.
Skills have emerged as a powerful abstraction method to improve agent performance on real-world tasks, but today’s skills are rigorously hand-crafted by experts.
We have uncovered a path to reliably automating skill development.
3/ EvoSkill applies textual feedback descent to skill discovery
The loop runs three specialized agents:
1. Executor: Attempts a batch of tasks under the current skill configuration
2. Proposer: Analyzes failed traces, cross-references a cumulative feedback history of prior proposals, and identifies the highest-impact capability gap
3. Skill Builder: Materializes the proposal into a structured skill folder (SKILL.md + scripts + references, etc… )
A Pareto frontier of top-N configurations governs selection, where only the skills that improve on the test-set validation survive.

4/ EvoSkill achieves rapid performance using only a fraction of the benchmark data
We tested performance across three benchmarks:
1. OfficeQA (reasoning over large corpora): 60.6% → 67.9% (+7.3%) and achieving SOTA across all systems
2. SealQA (search-augmented QA): 26.6% → 38.7% (+12.1%)
3. BrowseComp (open-web fact-seeking): 43.5% → 48.8% (+5.3%); zero-shot transfer from SealQA-evolved skills, no modification
The BrowseComp result stemmed from skills evolved on SealQA (query reformulation, multi-source verification, structured search persistence) that transfer zero-shot to a benchmark with different questions, difficulty distribution, and retrieval conditions. This suggests skill-level optimization produces domain-general capabilities rather than task-specific overfitting.

5/ Skill-level optimization is better abstraction to produce transferable capabilities more modular than prompts or code
EvoSkill is fully open-source. We believe skills sit in a critical spot that prompts and code cannot reach—structured enough to encode multi-step procedures with branching logic/verification, and readable enough that a developer can inspect, edit, and pass on to a different agent on a different model.
We are continuing this work across broader domains (coding, multimodal) in collaboration with Virginia Tech (@tuvllms, @noahpro99, Jaydon Bingham, and @WeiyuanChen01) and are open to collaboration with the broader research community.
152
Top
Ranking
Favorites
