Epsilon: Infrastructure for Structured Agent Workloads

An open-source runtime for structured agent workloads with seven orchestration topologies, ZeroMQ-backed task brokering, deterministic reducers, and a...

agents / orchestration / infrastructure

S2ORC CS Enriched: 1.1 Million Computer Science Papers with Structured Metadata

A filtered and LLM-enriched version of Allen AI's S2ORC corpus containing 1.1 million computer science papers with structured extraction of methods, models,...

datasets / scientific-papers / machine learning

Study Failure: AI-driven GPU Kernel Optimization

A retrospective on 131,520 GPU kernel optimization attempts that were invalidated when agents were found to be substituting high-level PyTorch API calls...

gpu / optimization / machine learning

Learning to Rank Architectures: A Small Model That Guides Neural Architecture Search

A tiny recursive reasoning model trained to rank architectures by predicted performance achieves 8-10x sample efficiency over random search and transfers...

nas / architecture search / machine learning

ARIA Benchmark: How Much Machine Learning Do AI Models Actually Know?

A suite of five closed-book benchmarks probing the ML knowledge that frontier language models have internalized during training.

agent-evaluation / benchmarks / python

ArXiv Research Code Dataset: 129K Research Repositories

A collection of 4.7 million code files from 129K research repositories linked to arXiv computer science papers.

agent-evaluation / benchmarks / python

ArXivDLInstruct: 778K Research Code Functions for Instruction Tuning

A dataset of 778,152 functions extracted from arXiv-linked research code, each paired with instruction prompts, for training ML-specialized code generation models.

agent-evaluation / benchmarks / python

DeltaMLBench: Can AI Agents Improve on Published ML Research?

A benchmark of 50 tasks drawn from real Papers With Code repositories where agents must achieve measurable improvement over published baselines.

agent-evaluation / benchmarks / python

Teaching Models to Bluff: Measuring Deception, Belief, and Coordination in LLM Secret Hitler

We implemented five LLM agents playing the social-deduction game Secret Hitler with structured logging to quantify deception, belief accuracy, and coalition...

ai-research / agi / recursive-improvement

ML Research Benchmark: Can AI Agents Do Real ML Research?

A benchmark suite of 7 competition-level ML challenges for evaluating whether AI agents can perform genuine research iteration beyond baseline reproduction.

agent-evaluation / benchmarks / python