An AI system that investigates problems autonomously, fixes them in parallel across isolated workers, and evolves solutions through natural selection — learning from every run so the next one is smarter.
Colosseum is four layers that work together. Each one feeds the next — and the whole system gets smarter every time it runs.
Point it at anything — a codebase, a market, a research question — and get structured findings back.
SimLink decomposes problems into zones, investigates each autonomously, and surfaces contradictions, gaps, and confirmed patterns. It knows when to stop — no token limits, no manual cutoff.
Turns findings into a dependency graph and dispatches parallel workers to execute the fixes.
Up to 8 workers run simultaneously, each in its own isolated git worktree with exclusive file ownership. No merge conflicts. Crash recovery built in. Hours of serial work done in minutes.
Every action leaves a trace. The system remembers what worked, what failed, and what stalled.
Future runs read these traces to avoid bottlenecks, boost proven patterns, and co-schedule coupled files. Traces decay over time so stale knowledge fades. Like ant colonies coordinating through pheromone trails — no messages, just shared memory.
When the answer space is too large for any prompt, breed solutions through natural selection.
Colosseum generates candidates via LLMs, evaluates them against real fitness functions, kills the weak, and mutates survivors. Runs locally first, escalates to frontier models on stall. Finds solutions that prompting alone never would.
Most AI tools do one of these. Colosseum does all four — and the memory layer means run #10 is fundamentally better than run #1.
Point SimLink at a codebase, research question, or competitive landscape. It decomposes the problem into zones, investigates each one autonomously, and delivers prioritized findings — contradictions, gaps, and confirmed patterns. Not chat. Investigation.
| Zone | Findings | Status | Cycle |
| WASM Execution Arena | 7 | understood | 6 |
| AST Mutation Tools | 5 | understood | 5 |
| Visual Node Generation | 5 | understood | 3 |
| Telemetry & Fitness | 5 | understood | 2 |
| Search Entry Points | 3 | understood | 4 |
| Configuration & CLI | 4 | understood | 1 |
Arena struct handles WASM compilation for testing, not visual rendering
Search exists for AST pattern matching (ast_grep_core), not visual node creation
No generate_node functions detected — searches do not trigger node generation
src/main.rs (300+ lines) monolithic — needs refactoring into modules
Splits the problem into 3-8 zones based on structure — files, market segments, competitor features, research areas
Balances exploring unknown areas against digging deeper into promising ones. Skips what's already understood
Each zone gets a 3-turn loop: plan what to look at, execute the reads and searches, synthesize findings
Findings propagate across zones. Contradictions get flagged. Duplicates are merged automatically
Smart Exploration — Balances investigating unknown areas against digging deeper into promising ones (UCB1 bandit algorithm).
Layered Understanding — Each cycle builds: perceive raw data, comprehend patterns, project implications (Endsley's SA model).
Breakthrough Detection — When findings converge from multiple zones, SimLink flags the insight automatically.
Measures whether new cycles are still producing new insights. When findings stop changing, it completes — no arbitrary token limits. Monolithic files that resist analysis get routed to the Colosseum evolution engine.
Code — readFiles, grepRepo, gitQuery, listDir, buildRepoMap
Web — SerpAPI search, page fetch. Competitive intel, market research, marketing strategy, trend analysis
Auto — Heuristic mode detection. "Research competitors" → web. "Audit this repo" → code
Blocks when human input needed. Operator nudges steer direction between cycles. SSE streams to web UI at port 4200. Cross-run memory seeds future investigations.
Atlas breaks work into a dependency graph and dispatches it across up to 8 isolated workers — each in its own git worktree, each owning specific files. Two workers never touch the same file. Work that takes hours serially finishes in minutes.
| Task | Files | Intent | Status |
| refactor-auth | 3 | exclusive | ✓ done |
| split-utils | 5 | exclusive | ✓ done |
| update-imports | 12 | shared_read | running |
| add-tests | 4 | exclusive | running |
| verify-types | 8 | shared_read | blocked |
| integration | — | merge | pending |
Topological ordering with file intents (exclusive / shared_read). Ready tasks scored by downstream dependency count. Up to 5 tasks dispatched per batch, 1–8 concurrent workers.
Ephemeral — in-memory, fast, gone when session ends.
Durable — SQLite WAL. Survives crashes. Pause mid-run, resume tomorrow.
Each worker gets its own worktree. Runs claude -p headless — can't ask questions, forcing better specs. 10-min stall detection with auto-kill.
Per-bundle failure tracking (threshold: 3). Low-confidence (<0.65) triggers verification. Tasks block and wait rather than guess. 4-hour wall time.
Patterns, failures, decisions, and conventions persist in a vector-embedded store. 30-day confidence decay. New runs seeded with relevant memories from past work.
Every completed task deposits pheromone traces into soil.db. The scheduler reads these signals to boost hot files, penalize stall-prone patterns, co-schedule coupled files, and avoid failure zones. ±0.5 soft adjustment — never overrides hard deps.
Major professional services firms are active competitors in the geostrategic analysis space
Large asset managers developed proprietary interactive dashboards for clients
Every task completion, stall, and failure deposits traces into a shared memory layer. Future tasks read these traces and adapt — avoiding known bottlenecks, boosting proven patterns. Like how ant colonies coordinate through pheromone trails, workers coordinate through shared soil without ever talking to each other.
file_touch — which files get modified and how often
task_outcome — success/failure per task with confidence
coupling — files that change together across runs
stall — tasks that hung or timed out
lock_contention — file lock conflicts between workers
discovery — insights workers surface during execution
Traces start at strength 1.0 and decay exponentially — 7-day half-life for most traces, 30-day for coupling. Re-observation reinforces by +0.1. Garbage collection sweeps traces below 0.01. Dedup prevents duplicate deposits — same file+kind within 1 hour triggers reinforce instead.
Detects tech stack from package.json, Cargo.toml, go.mod, requirements.txt. Tags repos with ecosystem identifiers. Traces from similar repos (tag overlap > 0.4) contribute at 40% strength — so a React project learns from all React projects.
Each dispatched worker receives natural-language soil summaries: file success rates, coupling relationships, stall warnings, relevant discoveries. Workers adapt behavior before writing a single line of code.
dispatch_log table records base_score, soil_adjust, and per-signal breakdown at dispatch time. Backfills outcome + duration at completion. Segments into promoted / penalized / neutral. Computes per-weight effectiveness and a soil maturity curve — success rate vs trace count, proving that more traces → better scheduling.
At Colosseum genesis, three soil channels seed initial populations: ecosystem hints (successful patterns from repos with matching tech stacks), file profile (target file history → aggressive / moderate / conservative mutation strategy), and coupling context (structurally linked files). Composes a natural language genesis prompt fragment that biases candidates toward what has worked before.
Some problems have too many possible solutions for any prompt to find the best one. Colosseum breeds candidate solutions, tests them against real fitness functions, kills the weak, and mutates the survivors. LLMs propose, natural selection disposes.
Ecosystem hints + file profile + coupling context from soil. DNA vault ancestry. Compiled in parallel via Rayon
Weighted tests + shadow tiers. Physics validation: impossible = dead
LLM-guided mutation. Ollama local → frontier on stall. Rule-based fallback
The LLM proposes, evolution disposes. Solutions prompting alone would never find
code_transform — AST mutations via ast-grep. WASM sandbox.
real_estate — Unit mix, construction, LP/GP waterfall, DSCR.
hotel — ADR/RevPAR, promote tiers, exit valuation.
mezz_lending — Coverage ratios, recovery scenarios.
numerical — Parameter vectors against real data.
Wasmtime + Cranelift JIT + SIMD. Fuel-limited — no infinite loops. Module reuse across evaluations.
Frontier escalation — Ollama handles 90%. Stall 2+ gens → Gemini/Grok → drops back. Near-zero API cost.
Polynomial regression trained on completed evaluations predicts fitness before running the full gauntlet. Candidates below μ − kσ are rejected instantly — saving 40%+ of evaluation cost. Staged complexity: linear below 200 samples, quadratic after. LOO-CV R² validates model quality. Reservoir sampling keeps the training cache representative as the population evolves. Compile failures excluded from training to prevent surrogate poisoning.
Fire-and-forget check after every task completion. When a file is touched 3+ times across tasks, exceeds 500 lines, shows <50% success rate or active stall signals, and has no recent discoveries — auto-spawns a Colosseum sub-run. Uses buildColosseumConfig() for real AST pattern detection and injects a [SWARM TARGET DETECTED] task into the active bundle.
detectEscalation() analyzes findings diversity ratio and empty run counts across the SimMap every cycle. When diversity drops below 35% with 3+ empty runs — or 2+ zones are stuck — escalates to frontier models (Claude API → xAI fallback), bypassing task affinity routing. Drops back to standard tier when diversity recovers.
Pointed at FY2025 revenue data, Colosseum evolved pricing parameters within 1% of actual revenue. Discovered systematic underpricing of $135–143/night in shoulder-season periods. Found optimal 2.546x peak-season competitor premium. No human specified those numbers.
Claude · Anthropic · DeepSeek · xAI/Grok · Ollama · Ollama→Claude · Multi · BitNet
select, ripple → Ollama
perceive, investigate → Claude
atlas.db — runs, bundles, tasks, swarm DNA
memory.db — patterns + vector embeddings
soil.db — stigmergic traces, pheromone decay
simlink-runs.db — maps, zones, beats
Breeds its own PromptConfig. Surrogate fitness pre-filters 40%+ of candidates. Stigmergic soil biases scheduling. Validation harness proves soil works. Monolith auto-escape routes stuck files to evolution. Diversity-aware escalation brings frontier models when findings stagnate.
Shell metacharacter rejection. Command allowlisting. Git allowlist. HTML stripping. Fuel-limited WASM.
Evolved pricing against FY2025 data. Within 1% of actual revenue. No human guidance.
$500K/yr · 2.546x premium69-unit development. Breakeven points and cliff risks identified before capital committed.
12,935 runs · 7 variables1,209-line component → 6 bundles → 5 parallel workers. All tests maintained.
4x faster than serialPointed at repo with no instructions. Identified risks, coupling, refactor plan.
8 zones · 12 cyclesWeb-mode investigation mapped competitor positioning across a fragmented market.
265 memories · 11 obsOn run #4 of a project, Soil traces from prior runs automatically penalized task patterns that had stalled before — cutting queue time without any manual tuning.
847 traces · 0 repeated stallsA lightweight regression model learned to predict which candidates would fail the gauntlet — rejecting them before the expensive evaluation, saving nearly half the compute.
43% cost reduction · R² > 0.85Bred WASM-sandboxed code transforms. 3,000+ variants in under 10 seconds.
Fitness 547 · WASMTypeScript project inherited soil traces from 12 prior React projects — co-change patterns, stall warnings, and ESM conventions — before the first task dispatched.
40% cross-repo · 0.4 thresholdDeepMind proved LLMs combined with evolutionary search discover new knowledge. Google scaled it to TPU design. LeCun's team declared adaptation speed the metric that matters. Colosseum makes these ideas accessible as infrastructure.
Colosseum is built and maintained by Simeon Garratt — AI architect working across security, intelligence, and autonomous systems.