Trending topics
#
Bonk Eco continues to show strength amid $USELESS rally
#
Pump.fun to raise $1B token sale, traders speculating on airdrop
#
Boop.Fun leading the way with a new launchpad on Solana.
Crypto copilots need to be capable of reasoning under moving markets. That means tougher, production grounded benchmarks.
CryptoAnalystBench helps advance reasoning for open-source AI by grading long-form crypto answers on relevance, temporal relevance, depth, and data consistency 🧵

2/ This benchmark is important because reasoning breaks in fast changing conditions
Most evals check whether a model can fetch facts. In crypto, users need a coherent stance when signals conflict, time windows shift, and sources disagree. If you do not measure that synthesis, you ship copilots that sound plausible, then drift, contradict themselves, and mislead decisions.
CryptoAnalystBench scores long form, analyst style answers on relevance, depth, temporal relevance, and data consistency, giving teams a repeatable baseline for iteration and regression testing. It also surfaces where agents break in practice: stale framing, shallow synthesis, internal contradictions, and overconfident claims.
CryptoAnalystBench is designed to complement ground truth suites like DMind and CryptoBench, with separate factuality checks for claim level correctness.
3/ We built CryptoAnalystBench by distilling production traffic into a compact dataset
We started from a recent slice of Sentient Chat queries and removed prompts that were either too long to evaluate consistently or too short to reflect real intent.
Then we clustered the remainder into roughly 2,000 intent groups, defined 11 categories, and AI tagged each query so coverage stays aligned with real user demand.
From there, we removed near duplicates within each category, pruned “easy” prompts that models can answer from training alone, and hand curated a representative final snapshot for evaluation.
4/ Our dataset design choices determine what failures you can find
Near duplicates inflate scores without improving coverage. Easy prompts hide tool and synthesis failures.
We designed CryptoAnalystBench to keep diversity, preserve real traffic proportions, and stay time robust so it catches drift and regressions instead of rewarding memorization.
5/ The evaluation loop is built for reproducible iteration
We score each answer with an LLM judge using a fixed rubric and JSON only outputs, without revealing which system produced which response.
We chose DeepSeek v3.1 via Fireworks after bias testing, then controlled variance with balanced response order randomization and a shared judge conversation per query to reduce calibration drift.
The output is what dev teams need to iterate: per dimension scores, per query ranks, and category slices for regression testing and targeted fixes. It also makes the limitation explicit, that is that high analyst quality can still hide hallucinated numerics or misattributed claims.
Next steps are to keep the benchmark fresh on a cadence and pair it with trace based error localization plus evidence bounded factuality checks.
59
Top
Ranking
Favorites
