Research
Study Failure: AI-driven GPU Kernel Optimization
I recently completed what I thought was a comprehensive study of AI-driven GPU kernel optimization. Over 131,520 optimization attempts across 137 kernels,...
Learning to Rank Architectures: A Small Model That Guides Neural Architecture Search
I trained a tiny recursive reasoning model to rank architectures by predicted performance, then used it to guide search. It achieved 8-10x sample efficiency...
ARIA Benchmark: How Much Machine Learning Do AI Models Actually Know?
A benchmark suite for evaluating AI agents on real machine learning research tasks — including task definitions, a baseline agent, and evaluation infrastructure.
ArXiv Research Code Dataset: 129K Research Repositories
A benchmark suite for evaluating AI agents on real machine learning research tasks — including task definitions, a baseline agent, and evaluation infrastructure.
ArXivDLInstruct: 778K Research Code Functions for Instruction Tuning
A benchmark suite for evaluating AI agents on real machine learning research tasks — including task definitions, a baseline agent, and evaluation infrastructure.
DeltaMLBench: Can AI Agents Improve on Published ML Research?
A benchmark suite for evaluating AI agents on real machine learning research tasks — including task definitions, a baseline agent, and evaluation infrastructure.
Teaching Models to Bluff: Measuring Deception, Belief, and Coordination in LLM Secret Hitler
I wired up five LLM agents to play the social-deduction game Secret Hitler with structured logging.
ML Research Benchmark: Can AI Agents Do Real ML Research?
A benchmark suite for evaluating AI agents on real machine learning research tasks — including task definitions, a baseline agent, and evaluation infrastructure.
