COLOSSEUM

The arena where
AI systems evolve

An AI system that investigates problems autonomously, fixes them in parallel across isolated workers, and evolves solutions through natural selection — learning from every run so the next one is smarter.

01
SimLink
Investigate
02
Atlas
Orchestrate
03
Soil
Remember
04
Colosseum
Evolve
Enter the Arena vs. Claude Code / Codex
How It Works

You give it a problem.
It does the rest.

Colosseum is four layers that work together. Each one feeds the next — and the whole system gets smarter every time it runs.

01
Understand

Point it at anything — a codebase, a market, a research question — and get structured findings back.

SimLink decomposes problems into zones, investigates each autonomously, and surfaces contradictions, gaps, and confirmed patterns. It knows when to stop — no token limits, no manual cutoff.

findings → Atlas
02
Build

Turns findings into a dependency graph and dispatches parallel workers to execute the fixes.

Up to 8 workers run simultaneously, each in its own isolated git worktree with exclusive file ownership. No merge conflicts. Crash recovery built in. Hours of serial work done in minutes.

traces → Soil
03
Learn

Every action leaves a trace. The system remembers what worked, what failed, and what stalled.

Future runs read these traces to avoid bottlenecks, boost proven patterns, and co-schedule coupled files. Traces decay over time so stale knowledge fades. Like ant colonies coordinating through pheromone trails — no messages, just shared memory.

context → Colosseum
04
Evolve

When the answer space is too large for any prompt, breed solutions through natural selection.

Colosseum generates candidates via LLMs, evaluates them against real fitness functions, kills the weak, and mutates survivors. Runs locally first, escalates to frontier models on stall. Finds solutions that prompting alone never would.

← feeds back into Atlas + Soil

Most AI tools do one of these. Colosseum does all four — and the memory layer means run #10 is fundamentally better than run #1.

F(x) = Σ wi · score(ci, ti) − λ · penalty(c) select: P(survive) ∝ F(x) / ΣF mutate: c′ = LLM(c, feedback, T=0.7) niche: cluster(C) → argmax diversity soil: s(t) = s0 · eλt UCB1(z) = μz + c√(ln N / nz) DAG: topo_sort(G) → parallel_dispatch surrogate: ŷ = βTx escape: touches ≥ 3 → colosseum F(x) = Σ wi · score(ci, ti) − λ · penalty(c) select: P(survive) ∝ F(x) / ΣF mutate: c′ = LLM(c, feedback, T=0.7) niche: cluster(C) → argmax diversity soil: s(t) = s0 · eλt UCB1(z) = μz + c√(ln N / nz) DAG: topo_sort(G) → parallel_dispatch surrogate: ŷ = βTx escape: touches ≥ 3 → colosseum
02 — Atlas

SimLink found the problems.
Atlas fixes them — in parallel.

Atlas breaks work into a dependency graph and dispatches it across up to 8 isolated workers — each in its own git worktree, each owning specific files. Two workers never touch the same file. Work that takes hours serially finishes in minutes.

ATLAS PARALLEL ORCHESTRATOR
WORKERS 5/8 · BUNDLES 6 · TASKS 18
Task DAG — Bundle 3
TaskFilesIntentStatus
refactor-auth3exclusive✓ done
split-utils5exclusive✓ done
update-imports12shared_readrunning
add-tests4exclusiverunning
verify-types8shared_readblocked
integrationmergepending
Active Workers
worker-1refactor-authdone 4m
worker-2split-utilsdone 6m
worker-3update-imports2m 14s
worker-4add-tests1m 38s
worker-5idle
Each worker → isolated git worktree
claude -p headless · 10-min stall kill
Activity
Worker dispatched — add-tests
split-utils complete (conf: 0.82)
refactor-auth complete (conf: 0.91)
Worker dispatched — update-imports
DAG resolved — 2 tasks ready
Atlas: 6 bundles dispatching
Plan accepted — 18 tasks
Worker active — cycle 14 claude -p writing code in git worktree
COLOSSEUM / ATLAS — DAG DISPATCH + WORKERS
DAG-Based Work

Dependency-aware decomposition

Topological ordering with file intents (exclusive / shared_read). Ready tasks scored by downstream dependency count. Up to 5 tasks dispatched per batch, 1–8 concurrent workers.

Crash-Proof

Two run modes

Ephemeral — in-memory, fast, gone when session ends.

Durable — SQLite WAL. Survives crashes. Pause mid-run, resume tomorrow.

Worker System

Isolated git worktrees

Each worker gets its own worktree. Runs claude -p headless — can't ask questions, forcing better specs. 10-min stall detection with auto-kill.

Safety Systems

Circuit breakers + cost controls

Per-bundle failure tracking (threshold: 3). Low-confidence (<0.65) triggers verification. Tasks block and wait rather than guess. 4-hour wall time.

Cross-Run Memory

The system gets smarter over time

Patterns, failures, decisions, and conventions persist in a vector-embedded store. 30-day confidence decay. New runs seeded with relevant memories from past work.

Soil-Aware Scheduling

Traces bias task ordering

Every completed task deposits pheromone traces into soil.db. The scheduler reads these signals to boost hot files, penalize stall-prone patterns, co-schedule coupled files, and avoid failure zones. ±0.5 soft adjustment — never overrides hard deps.

Details
Activity
Memory
265 memories · 265/265 embedded · 1 cluster ONLINE
pattern
failure
decision
convention
Semantic search memories...
CONVENTION

Major professional services firms are active competitors in the geostrategic analysis space

conf 1.00 · 11 obs
PATTERN

Large asset managers developed proprietary interactive dashboards for clients

conf 1.00 · 11 obs
COLOSSEUM — CROSS-RUN MEMORY
21
MCP Tools
8
Workers
4x
Speedup
4
SQLite Stores
8
LLM Backends
03 — Soil

The system remembers
what worked — and what didn't.

Every task completion, stall, and failure deposits traces into a shared memory layer. Future tasks read these traces and adapt — avoiding known bottlenecks, boosting proven patterns. Like how ant colonies coordinate through pheromone trails, workers coordinate through shared soil without ever talking to each other.

SOIL CROSS-RUN TRACE MEMORY
TRACES 847 · DECAY 7d · GC 0.01
File Trace Heatmap
success hot failure coupled
Soil Score — refactor-auth
Base fan-out3.0
+ Hot file boost+0.25
− Stall penalty−0.15
+ Coupling bonus+0.10
− Failure penalty0.00
Final score3.20
Soil adjustment: +0.20 · Promoted in queue
Live Trace Feed
file_touch src/drive.ts 0.92
task_outcome split-utils success
coupling scheduler↔drive 0.78
stall verify-types 180s
file_touch src/schemas.ts 0.85
discovery "ESM imports required" 0.90
file_touch src/pool.ts 0.88
DECAY
T½ 7d · GC <0.01 · coupling 30d
COLOSSEUM / SOIL — CROSS-RUN TRACE MEMORY
6 Trace Kinds

Every action leaves a mark

file_touch — which files get modified and how often

task_outcome — success/failure per task with confidence

coupling — files that change together across runs

stall — tasks that hung or timed out

lock_contention — file lock conflicts between workers

discovery — insights workers surface during execution

Pheromone Dynamics

Strength decays. Relevance persists.

Traces start at strength 1.0 and decay exponentially — 7-day half-life for most traces, 30-day for coupling. Re-observation reinforces by +0.1. Garbage collection sweeps traces below 0.01. Dedup prevents duplicate deposits — same file+kind within 1 hour triggers reinforce instead.

Ecosystem Fingerprinting

Cross-repo intelligence

Detects tech stack from package.json, Cargo.toml, go.mod, requirements.txt. Tags repos with ecosystem identifiers. Traces from similar repos (tag overlap > 0.4) contribute at 40% strength — so a React project learns from all React projects.

Worker Context Seeding

Every worker starts informed

Each dispatched worker receives natural-language soil summaries: file success rates, coupling relationships, stall warnings, relevant discoveries. Workers adapt behavior before writing a single line of code.

Validation Harness

Does soil actually help?

dispatch_log table records base_score, soil_adjust, and per-signal breakdown at dispatch time. Backfills outcome + duration at completion. Segments into promoted / penalized / neutral. Computes per-weight effectiveness and a soil maturity curve — success rate vs trace count, proving that more traces → better scheduling.

Soil → Colosseum Bridge

Evolution doesn't start cold

At Colosseum genesis, three soil channels seed initial populations: ecosystem hints (successful patterns from repos with matching tech stacks), file profile (target file history → aggressive / moderate / conservative mutation strategy), and coupling context (structurally linked files). Composes a natural language genesis prompt fragment that biases candidates toward what has worked before.

6
Trace Kinds
7d
Half-Life
8
Soil MCP Tools
±0.5
Score Range
WAL
SQLite Mode
"Stigmergy is the mechanism by which the products of previous actions stimulate subsequent actions, creating coherent group behaviour without direct communication." Pierre-Paul Grassé, La reconstruction du nid (1959)
04 — Colosseum

When prompting can't find
the answer — evolution can.

Some problems have too many possible solutions for any prompt to find the best one. Colosseum breeds candidate solutions, tests them against real fitness functions, kills the weak, and mutates the survivors. LLMs propose, natural selection disposes.

PVLVINAR Evolution Control Panel
RUNNING
30Generation
524.64Best Fitness
30Population
93%Survived
418 OLLAMA
51 GEMINI
6 GROK
NUMERICAL
Fitness Curve — Best per gen
93 245 389 478 524 0 550
Population — Survived · Died · Apex
Arena Log
GEN 01 Genesis → 30 candidates. Best: 93.21
GEN 05 Fitness plateau. Gauntlet tightened. Best: 187.4
GEN 08 STALL DETECTED — 2 gens no improvement
GEN 09 FRONTIER ESCALATION → Ollama→Gemini
GEN 12 Breakout. New apex: 312.7. Gemini variant.
GEN 18 Population converging. Diversity injection.
GEN 24 FRONTIER ESCALATION → Gemini→Grok
GEN 28 NEW APEX PREDATOR fitness: 501.3
GEN 30 APEX PREDATOR CROWNED fitness: 524.64
APEX PREDATOR
{ base_rate: 208.6, market_leader_premium: 2.546, shoulder_min: 135.0, peak_cap: 489.2, demand_elasticity: -0.847 }
◀◀ ▶▶ SPEED: 1x · 2x · 4x · 8x
COLOSSEUM / PVLVINAR — EVOLUTION CONTROL PANEL
colosseum — 8 generations, 6.7 seconds
1.0 0.7 0.4 0.1 fitness gen 1 5 killed gen 2 2 killed gen 3 1 killed gen 4 gen 5 gen 6 gen 7 gen 8 apex: 0.97 12 candidates population narrows as selection pressure increases — only the fittest DNA survives each gauntlet
COLOSSEUM / CONVERGENCE — 8 GENERATIONS
colosseum — evolution architecture
Phase 1: Genesis Phase 2: Selection Phase 3: Output LLM Ensemble local ollama + frontier Genesis N parallel variants Compile wasm / validate Gauntlet α blocking tests Gauntlet β penalizing tests Gauntlet γ shadow validation × killed × killed Niching diversity selection Vault apex → wasm git-tagged deploy edge.js sandbox Mutation llm + rule-based ← mutation loop (generations 1..N) F(x) = Σw·score − λ·penalty domain: code | numerical | any
COLOSSEUM / EVOLUTION ARCHITECTURE
colosseum — live evolution
[0.00] Genesis Spawning 50 variants via local Ollama...
[1.23] Compile 50/50 WASM validated
[1.24] Gauntlet α 18 killed (blocking) | 32 survive
[1.25] Gauntlet β Scoring... best: 93.21
[2.41] Mutate 5 parents → 30 offspring (LLM + rule-based)
[3.18] Gauntlet α 8 killed | 22 survive
[3.19] Gauntlet β best: 245.0 ← STALL DETECTED: escalating to Gemini
[4.55] Niche 3 niches, 2 apex candidates
[5.82] Gauntlet Gen 8... best: 478
[6.44] Niche Converged → 1 apex predator
[6.71] APEX PREDATOR CROWNED Fitness: 524.64 | Gens: 30 | LLM calls: 475
[6.72] Vault colosseum/specialist/f524.64 ✓ → deployed to edge
COLOSSEUM / LIVE EVOLUTION TRACE
GENESIS

Soil-seeded → 50 candidates

Ecosystem hints + file profile + coupling context from soil. DNA vault ancestry. Compiled in parallel via Rayon

GAUNTLET

Fitness evaluation

Weighted tests + shadow tiers. Physics validation: impossible = dead

BREED

Top 10% elite → crossover

LLM-guided mutation. Ollama local → frontier on stall. Rule-based fallback

REPEAT

Natural selection

The LLM proposes, evolution disposes. Solutions prompting alone would never find

5 Domains

Same engine, different arenas

code_transform — AST mutations via ast-grep. WASM sandbox.

real_estate — Unit mix, construction, LP/GP waterfall, DSCR.

hotel — ADR/RevPAR, promote tiers, exit valuation.

mezz_lending — Coverage ratios, recovery scenarios.

numerical — Parameter vectors against real data.

WASM Arena

Sandboxed code execution

Wasmtime + Cranelift JIT + SIMD. Fuel-limited — no infinite loops. Module reuse across evaluations.

Frontier escalation — Ollama handles 90%. Stall 2+ gens → Gemini/Grok → drops back. Near-zero API cost.

Surrogate Fitness Model

Pre-filter before the gauntlet

Polynomial regression trained on completed evaluations predicts fitness before running the full gauntlet. Candidates below μ − kσ are rejected instantly — saving 40%+ of evaluation cost. Staged complexity: linear below 200 samples, quadratic after. LOO-CV R² validates model quality. Reservoir sampling keeps the training cache representative as the population evolves. Compile failures excluded from training to prevent surrogate poisoning.

Monolith Auto-Escape

Stuck files evolve themselves

Fire-and-forget check after every task completion. When a file is touched 3+ times across tasks, exceeds 500 lines, shows <50% success rate or active stall signals, and has no recent discoveries — auto-spawns a Colosseum sub-run. Uses buildColosseumConfig() for real AST pattern detection and injects a [SWARM TARGET DETECTED] task into the active bundle.

Diversity-Aware Escalation

Stagnation triggers frontier models

detectEscalation() analyzes findings diversity ratio and empty run counts across the SimMap every cycle. When diversity drops below 35% with 3+ empty runs — or 2+ zones are stuck — escalates to frontier models (Claude API → xAI fallback), bypassing task affinity routing. Drops back to standard tier when diversity recovers.

$500K
discovered / year

Pointed at FY2025 revenue data, Colosseum evolved pricing parameters within 1% of actual revenue. Discovered systematic underpricing of $135–143/night in shoulder-season periods. Found optimal 2.546x peak-season competitor premium. No human specified those numbers.

10ms
3K Proformas
5
Domains
50
Initial Pop.
WASM
Sandbox
"Systems that improve through self-play, evolutionary search, or large-scale simulation can surpass human performance without imitation." Goldfeder, Wyder, LeCun, Shwartz-Ziv (arXiv:2602.23643, 2026)
Under the Hood

What powers it —
and why it gets smarter

8 LLM Backends

Claude · Anthropic · DeepSeek · xAI/Grok · Ollama · Ollama→Claude · Multi · BitNet

select, ripple → Ollama
perceive, investigate → Claude

4 Persistence Stores

atlas.db — runs, bundles, tasks, swarm DNA

memory.db — patterns + vector embeddings

soil.db — stigmergic traces, pheromone decay

simlink-runs.db — maps, zones, beats

Self-Optimization

Breeds its own PromptConfig. Surrogate fitness pre-filters 40%+ of candidates. Stigmergic soil biases scheduling. Validation harness proves soil works. Monolith auto-escape routes stuck files to evolution. Diversity-aware escalation brings frontier models when findings stagnate.

Security Hardened

Shell metacharacter rejection. Command allowlisting. Git allowlist. HTML stripping. Fuel-limited WASM.

How It's Different

Most AI coding tools are single-agent,
single-context, single-pass.

This Stack

Parallel execution 8 isolated worktree workers
Autonomous investigation UCB1 + SA gates + GWT
DAG decomposition Topological ordering, file intents
Evolutionary optimization LLM + genetic + frontier
Knows when it's done Information gain, sentinel volatility
Cross-run memory Vector embeddings, 30-day decay
Stigmergic soil Pheromone traces bias scheduling
Surrogate fitness Pre-filter rejects 40%+ before eval
Monolith auto-escape Stuck files auto-spawn evolution
Diversity escalation Stagnation → frontier models
Crash recovery Durable mode, pause/resume
8 LLM backends Task-aware routing
Self-optimization Breeds its own prompts

Typical AI Coding Tools

Single agent, serial execution
Reactive — you ask, it answers
Flat task lists, no dependency awareness
Prompt-based generation only
Token limit or user stops it
Limited cross-session memory
No trace-based environmental learning
Evaluate every candidate equally
Same file, same approach each time
Fixed model routing
Limited crash recovery
Single or few providers
Manual prompt iteration
Real Results

Pointed at real problems.
These are the outcomes.

Financial Optimization

Revenue Parameter Discovery

Evolved pricing against FY2025 data. Within 1% of actual revenue. No human guidance.

$500K/yr · 2.546x premium
Stress Testing

Capital Structure Analysis

69-unit development. Breakeven points and cliff risks identified before capital committed.

12,935 runs · 7 variables
Codebase Refactoring

Monolith Decomposition

1,209-line component → 6 bundles → 5 parallel workers. All tests maintained.

4x faster than serial
Autonomous Research

Architecture Audit

Pointed at repo with no instructions. Identified risks, coupling, refactor plan.

8 zones · 12 cycles
Competitive Intelligence

Market Landscape

Web-mode investigation mapped competitor positioning across a fragmented market.

265 memories · 11 obs
Learning Across Runs

Stall Avoidance

On run #4 of a project, Soil traces from prior runs automatically penalized task patterns that had stalled before — cutting queue time without any manual tuning.

847 traces · 0 repeated stalls
Smart Pre-Filtering

43% Fewer Wasted Evaluations

A lightweight regression model learned to predict which candidates would fail the gauntlet — rejecting them before the expensive evaluation, saving nearly half the compute.

43% cost reduction · R² > 0.85
Code Evolution

AST Transform Optimization

Bred WASM-sandboxed code transforms. 3,000+ variants in under 10 seconds.

Fitness 547 · WASM
Ecosystem Learning

Cross-Repo Pattern Transfer

TypeScript project inherited soil traces from 12 prior React projects — co-change patterns, stall warnings, and ESM conventions — before the first task dispatched.

40% cross-repo · 0.4 threshold
Built On Research

The science proved it works.
Colosseum makes it practical.

DeepMind proved LLMs combined with evolutionary search discover new knowledge. Google scaled it to TPU design. LeCun's team declared adaptation speed the metric that matters. Colosseum makes these ideas accessible as infrastructure.

Origin
FunSearch
Romera-Paredes et al.
Nature, Dec 2023
Scale Proof
AlphaEvolve
Novikov et al.
arXiv:2506.13131, 2025
Thesis
Superhuman AI (SAI)
Goldfeder, Wyder, LeCun et al.
arXiv:2602.23643, 2026
Coordination
Stigmergy
Grassé, P.-P.
Insectes Sociaux, 1959
Pre-Filtering
Surrogate-Assisted EAs
Jin, Y.
Springer, Swarm & Evo. Comp. 2011
Sandbox
WebAssembly for Safe Execution
Haas et al.
PLDI, ACM 2017
"The AI that folds our proteins should not be the AI that folds our laundry." LeCun et al. (2026)

See what it can build for you

Colosseum is built and maintained by Simeon Garratt — AI architect working across security, intelligence, and autonomous systems.

Get in Touch Explore Projects simeongarratt.com