Open Source

We publish datasets, models, code, and evaluation resources to support research on how AI systems behave in practice.

These artifacts are designed to surface failure modes that emerge when models interact with tools, operate over time, and rely on retrieval, compression, and imperfect objectives.

ML Research Benchmark

A benchmark suite for evaluating AI agents on real machine learning research tasks, with 7 competition-level challenges from NeurIPS, ICML, and CoNLL plus task definitions, a baseline agent, and evaluation infrastructure.

ARIA Benchmark

Five closed-book benchmarks probing the ML knowledge that frontier models have internalized during training, covering dataset recognition, model classification, and metric recall across the ML landscape.

Neural Architecture Search

A recursive reasoning model that ranks architectures by predicted performance and guides search, achieving 8-10x sample efficiency over random search with zero-shot transfer across datasets.

Epsilon

An open-source runtime for structured agent workloads with eight orchestration topologies, ZeroMQ-backed task brokering, deterministic reducers, and a population search mode for recursive improvement loops.

ArXiv DL Instruct Dataset

A dataset for studying how models interpret, follow, and generalize technical instructions derived from real research code, especially in research and agentic workflows.

ArXiv Research Code Dataset

A large corpus of research code referenced directly in scientific papers, designed to study how models reason about, modify, and execute real-world code in complex software systems.

S2ORC CS Enriched

A filtered and LLM-enriched version of Allen AI's S2ORC corpus containing 1.1 million computer science papers with structured metadata for methods, models, datasets, metrics, compute usage, and limitations.

AI SFT Dataset

A dataset of 34GB of AI research data for supervised fine-tuning, designed to train models that can reason about and generate AI research.