Research
Epsilon: Infrastructure for Structured Agent Workloads
An open-source runtime for structured agent workloads with seven orchestration topologies, ZeroMQ-backed task brokering, deterministic reducers, and a...
S2ORC CS Enriched: 1.1 Million Computer Science Papers with Structured Metadata
A filtered and LLM-enriched version of Allen AI's S2ORC corpus containing 1.1 million computer science papers with structured extraction of methods, models,...
Study Failure: AI-driven GPU Kernel Optimization
A retrospective on 131,520 GPU kernel optimization attempts that were invalidated when agents were found to be substituting high-level PyTorch API calls...
Learning to Rank Architectures: A Small Model That Guides Neural Architecture Search
A tiny recursive reasoning model trained to rank architectures by predicted performance achieves 8-10x sample efficiency over random search and transfers...
ARIA Benchmark: How Much Machine Learning Do AI Models Actually Know?
A suite of five closed-book benchmarks probing the ML knowledge that frontier language models have internalized during training.
ArXiv Research Code Dataset: 129K Research Repositories
A collection of 4.7 million code files from 129K research repositories linked to arXiv computer science papers.
ArXivDLInstruct: 778K Research Code Functions for Instruction Tuning
A dataset of 778,152 functions extracted from arXiv-linked research code, each paired with instruction prompts, for training ML-specialized code generation models.
DeltaMLBench: Can AI Agents Improve on Published ML Research?
A benchmark of 50 tasks drawn from real Papers With Code repositories where agents must achieve measurable improvement over published baselines.
Teaching Models to Bluff: Measuring Deception, Belief, and Coordination in LLM Secret Hitler
We implemented five LLM agents playing the social-deduction game Secret Hitler with structured logging to quantify deception, belief accuracy, and coalition...
ML Research Benchmark: Can AI Agents Do Real ML Research?
A benchmark suite of 7 competition-level ML challenges for evaluating whether AI agents can perform genuine research iteration beyond baseline reproduction.
