ML Research Benchmark
A benchmark suite for evaluating AI agents on real machine learning research tasks, with 7 competition-level challenges from NeurIPS, ICML, and CoNLL plus task definitions, a baseline agent, and evaluation infrastructure.
We publish datasets, models, code, and evaluation resources to support research on how AI systems behave in practice.
These artifacts are designed to surface failure modes that emerge when models interact with tools, operate over time, and rely on retrieval, compression, and imperfect objectives.
A benchmark suite for evaluating AI agents on real machine learning research tasks, with 7 competition-level challenges from NeurIPS, ICML, and CoNLL plus task definitions, a baseline agent, and evaluation infrastructure.
Five closed-book benchmarks probing the ML knowledge that frontier models have internalized during training, covering dataset recognition, model classification, and metric recall across the ML landscape.
A benchmark for measuring whether AI agents can improve on published ML research, focused on the gap between implementing known approaches and discovering better ones.
A recursive reasoning model that ranks architectures by predicted performance and guides search, achieving 8-10x sample efficiency over random search with zero-shot transfer across datasets.
An open-source runtime for structured agent workloads with eight orchestration topologies, ZeroMQ-backed task brokering, deterministic reducers, and a population search mode for recursive improvement loops.
A dataset for studying how models interpret, follow, and generalize technical instructions derived from real research code, especially in research and agentic workflows.
A large corpus of research code referenced directly in scientific papers, designed to study how models reason about, modify, and execute real-world code in complex software systems.
A collection of instruction-tuning datasets derived from scientific abstracts for studying how synthetic supervision shapes model behavior, overfitting, and generalization under distribution shift.
Question-answer datasets derived from ArXiv for evaluating retrieval and search behavior in technical domains, including retrieval failures, false confidence, and compounding RAG errors.
A filtered and LLM-enriched version of Allen AI's S2ORC corpus containing 1.1 million computer science papers with structured metadata for methods, models, datasets, metrics, compute usage, and limitations.
A dataset of 34GB of AI research data for supervised fine-tuning, designed to train models that can reason about and generate AI research.
A dataset of 16k AI safety-relevant papers from the ArXiv, enriched with structured metadata.
Semantic search models trained on large-scale scientific corpora for studying retrieval quality, query formulation, and downstream agent behavior when retrieval errors propagate silently.
Models for summarizing full-length scientific papers from abstracts and source documents, useful for analyzing information loss, misrepresentation, and overconfidence introduced by compression.