Study Failure: AI-driven GPU Kernel Optimization

I recently completed what I thought was a comprehensive study of AI-driven GPU kernel optimization. Over 131,520 optimization attempts across 137 kernels,...

gpu / optimization / machine learning

Learning to Rank Architectures: A Small Model That Guides Neural Architecture Search

I trained a tiny recursive reasoning model to rank architectures by predicted performance, then used it to guide search. It achieved 8-10x sample efficiency...

nas / architecture search / machine learning

ARIA Benchmark: How Much Machine Learning Do AI Models Actually Know?

A benchmark suite for evaluating AI agents on real machine learning research tasks — including task definitions, a baseline agent, and evaluation infrastructure.

agent-evaluation / benchmarks / python

ArXiv Research Code Dataset: 129K Research Repositories

A benchmark suite for evaluating AI agents on real machine learning research tasks — including task definitions, a baseline agent, and evaluation infrastructure.

agent-evaluation / benchmarks / python

ArXivDLInstruct: 778K Research Code Functions for Instruction Tuning

A benchmark suite for evaluating AI agents on real machine learning research tasks — including task definitions, a baseline agent, and evaluation infrastructure.

agent-evaluation / benchmarks / python

DeltaMLBench: Can AI Agents Improve on Published ML Research?

A benchmark suite for evaluating AI agents on real machine learning research tasks — including task definitions, a baseline agent, and evaluation infrastructure.

agent-evaluation / benchmarks / python

Teaching Models to Bluff: Measuring Deception, Belief, and Coordination in LLM Secret Hitler

I wired up five LLM agents to play the social-deduction game Secret Hitler with structured logging.

ai-research / agi / recursive-improvement

ML Research Benchmark: Can AI Agents Do Real ML Research?

A benchmark suite for evaluating AI agents on real machine learning research tasks — including task definitions, a baseline agent, and evaluation infrastructure.

agent-evaluation / benchmarks / python