Abstract
We present a two-layer architecture for studying emergent collaborative behavior in multi-agent LLM systems. The lower layer is a deterministic rule substrate governing agent needs, pairwise bond dynamics, action selection, and a hard cost ceiling. The upper layer is an LLM-powered dialogue and decision module (GPT-4o-mini / Claude Haiku) that produces natural speech but does not drive the emergent social or task statistics.
Reproducible claims in this paper come from the lower layer, which ships as experiments/run.py and runs in ~3 seconds with stdlib Python.
Keywords: Multi-Agent LLMs, Emergent Behavior, GPT-4o-mini, Claude Haiku, Autonomous Agents, Reproducible Rigs
Motivation
Recent work (Park et al., 2023; Wang et al., 2024) shows that LLM-powered agents can develop emergent collaborative patterns when given personalities, contextual awareness, and freedom to act autonomously. The open question is not whether such patterns emerge, but which are artifacts of the specific LLM and which are properties of the agent architecture itself. If every number in a paper depends on which model version was called, results are unverifiable.
Research Questions
- Under a shared rule substrate, does a random initial role mix produce reproducible strong-friendship and tension counts?
- Does self-organized construction emerge without explicit task assignment — and how many unique workers does a single site attract?
- Can the system hold all agents' needs above breach thresholds for the entire session?
- What does the work-vs-social action distribution look like under a need-biased role-weighted policy?
System Architecture
Each of the 10 colonists is an autonomous agent with:
- Role + personality — one of
commander / scientist / builder / engineer / miner / medic; traits used by the LLM layer but not by the rule substrate. - Needs —
energy,social,purposein [0, 100]. Decay rates (0.020, 0.015, 0.010) per tick. - Bonds — symmetric pair score. Gain is scaled by a per-pair affinity multiplier derived from role compatibility plus random jitter; affinities above a pivot of 0.9 gain bond, below lose.
- Action — one of
working / socializing / resting / building / walking, chosen by a need-biased role-weighted selector.
Experimental Setup
- Platform: Apple M2 Pro, 32 GB RAM, Python 3.13, stdlib only
- Configuration: 10 agents × 10,000 ticks (≈30 simulated days) × 10 seeds
- Wall-clock: ≈3 seconds per full run
- No network calls, no NumPy, no PyTorch
Key Findings
Social dynamics (mean ± std across 10 seeds)
| Category | Value per session |
|---|---|
| Strong friendships (bond > 50) | 3.5 ± 2.8 |
| Working relationships (15 < bond ≤ 50) | 14.6 ± 2.3 |
| Neutral relationships | 23.3 ± 2.7 |
| Tensions (bond < −15) | 3.6 ± 1.6 |
Task coordination
| Metric | Value |
|---|---|
| Construction sites completed | 3 / 3 every session |
| Unique workers per site (mean) | 7.3 |
| Stable-needs fraction | 100% |
Action distribution
| Action | Share of agent-ticks |
|---|---|
| Working | 49.2% |
| Socializing | 20.0% |
| Resting | 16.7% |
| Walking | 10.7% |
| Building | 3.5% |
Cost envelope
Dialogue events are rate-limited to 100 per session. At a GPT-4o-mini average cost of $7 × 10⁻⁵ per call, the analytic ceiling is $0.007 per session — regardless of prompt content.
Reproduce
cd genass/publications/quarto/mars_colony_collaboration
uv run python experiments/run.py
Outputs land in experiments/results/ (flat JSON) and the paper pulls numbers from data/simulation_results.json.
Live Demo
The full LLM-enabled system runs in the browser at /future/gaming — watch emergent collaboration play out in 3D.