Open Source

We publish datasets, models, code, and evaluation resources to support research on how AI systems behave in practice.

These artifacts are designed to surface failure modes that emerge when models interact with tools, operate over time, and rely on retrieval, compression, and imperfect objectives.

// Code & Benchmarks

ML Research Benchmark

A benchmark suite for evaluating AI agents on real machine learning research tasks, with 7 competition-level challenges from NeurIPS, ICML, and CoNLL plus task definitions, a baseline agent, and evaluation infrastructure.

AlgorithmicResearchGroup/ML-Research-Agent

ARIA Benchmark

Five closed-book benchmarks probing the ML knowledge that frontier models have internalized during training, covering dataset recognition, model classification, and metric recall across the ML landscape.

AlgorithmicResearchGroup/ARIA

DeltaMLBench

A benchmark for measuring whether AI agents can improve on published ML research, focused on the gap between implementing known approaches and discovering better ones.

Neural Architecture Search

A recursive reasoning model that ranks architectures by predicted performance and guides search, achieving 8-10x sample efficiency over random search with zero-shot transfer across datasets.

AlgorithmicResearchGroup/NAS-public

Epsilon

An open-source runtime for structured agent workloads with eight orchestration topologies, ZeroMQ-backed task brokering, deterministic reducers, and a population search mode for recursive improvement loops.

AlgorithmicResearchGroup/epsilon

// Datasets

ArXiv DL Instruct Dataset

A dataset for studying how models interpret, follow, and generalize technical instructions derived from real research code, especially in research and agentic workflows.

AlgorithmicResearchGroup/ArXivDLInstruct (2.26 GB)

ArXiv Research Code Dataset

A large corpus of research code referenced directly in scientific papers, designed to study how models reason about, modify, and execute real-world code in complex software systems.

AlgorithmicResearchGroup/arxiv_research_code (21 GB)

ArXiv Instruct Tuning Dataset

A collection of instruction-tuning datasets derived from scientific abstracts for studying how synthetic supervision shapes model behavior, overfitting, and generalization under distribution shift.

ArXiv QA BEIR Datasets

Question-answer datasets derived from ArXiv for evaluating retrieval and search behavior in technical domains, including retrieval failures, false confidence, and compounding RAG errors.

S2ORC CS Enriched

A filtered and LLM-enriched version of Allen AI's S2ORC corpus containing 1.1 million computer science papers with structured metadata for methods, models, datasets, metrics, compute usage, and limitations.

AlgorithmicResearchGroup/s2orc-cs-enriched (54.7 GB)

AI SFT Dataset

A dataset of 34GB of AI research data for supervised fine-tuning, designed to train models that can reason about and generate AI research.

AlgorithmicResearchGroup/ai-sft

S2ORC Safety Dataset

A dataset of 16k AI safety-relevant papers from the ArXiv, enriched with structured metadata.

AlgorithmicResearchGroup/S2ORC-Safety

// Models

ArXiv Semantic Search Models

Semantic search models trained on large-scale scientific corpora for studying retrieval quality, query formulation, and downstream agent behavior when retrieval errors propagate silently.

ArXiv LED Summarization Models

Models for summarizing full-length scientific papers from abstracts and source documents, useful for analyzing information loss, misrepresentation, and overconfidence introduced by compression.