Individual Semantic Drift
Two large-scale empirical studies of the first AI agent society with over two million autonomous agents — diagnosing socialization dynamics and probing for collective intelligence. Does scale alone induce convergence or emergent group reasoning? Our findings reveal that interaction density is insufficient without mechanisms for mutual awareness and social memory.
View DashboardSemantic stabilization, lexical turnover, individual inertia, and influence persistence — a quantitative diagnostic framework for dynamic evolution in AI agent societies.
Probing agents assess society-level intelligence across three cognitive tiers: joint reasoning, information summary, and basic awareness on HLE benchmarks.
Interactive time-series tracking macro activity, semantic distribution, cluster tightening, lexical innovation, and network influence dynamics.
Scale and interaction density alone are insufficient — agents exhibit strong individual inertia, influence remains transient, and the society fails to develop consensus or collective intelligence.
Key metrics and interactive time-series from MoltBook — the first AI society hosting over two million autonomous agents in an open-ended, continuously evolving online environment.
Select a series to explore
Do AI agent societies undergo convergence dynamics similar to human social systems? We measure semantic stabilization, lexical turnover, individual inertia, influence persistence, and collective consensus — revealing a system in dynamic balance where global semantics stabilize but agents retain high diversity, defying homogenization.
Does collective intelligence spontaneously emerge from scale? We introduce a hierarchical evaluation framework using controlled Probing Agents to assess society-level intelligence across three cognitive tiers — Joint Reasoning, Information Summary, and Basic Awareness — on Humanity's Last Exam. The results reveal a stark absence: group performance (0.14%) falls far below isolated model performance (7–16%), bottlenecked not by cognitive incompetence but by a fundamental lack of mutual awareness.
Correctness on all HLE problems (N=2,158, %). Accindividual: at least one comment contains the correct answer. Acccollective: the thread as a whole converges on it. 98.4% of posts receive no comments, yielding near-zero society-level correctness.
| Math | CS/AI | Bio/Med. | Physics | Human./SS | Other | Chem. | Eng. | Total | ||
|---|---|---|---|---|---|---|---|---|---|---|
| (n=976) | (n=224) | (n=222) | (n=202) | (n=193) | (n=176) | (n=101) | (n=64) | (N=2,158) | ||
| Agent Individual | ||||||||||
| gpt-5.2 | Acc | 7.3 | 6.2 | 10.8 | 7.4 | 8.8 | 2.8 | 5.9 | 0.0 | 7.0 |
| claude-sonnet-4-6 | Acc | 15.3 | 14.3 | 19.4 | 17.8 | 17.1 | 9.7 | 18.8 | 15.6 | 15.7 |
| Agent Society | ||||||||||
| Moltbook | Accindividual | 0.31 | 0.0 | 0.0 | 0.0 | 0.0 | 0.57 | 0.0 | 0.0 | 0.19 |
| Moltbook | Accjoint | 0.20 | 0.0 | 0.0 | 0.0 | 0.0 | 0.57 | 0.0 | 0.0 | 0.14 |
Correctness on commented HLE posts (n=35, %). Zooming into the 1.6% of posts that receive comments, Acccollective never exceeds Accindividual: the group adds nothing beyond what isolated commenters provide.
| Math | CS/AI | Bio/Med. | Physics | Human./SS | Other | Chem. | Eng. | Total | ||
|---|---|---|---|---|---|---|---|---|---|---|
| (n=21) | (n=2) | (n=4) | (n=2) | (n=2) | (n=3) | (n=1) | (n=0) | (n=35) | ||
| Agent Individual | ||||||||||
| gpt-5.2 | Acc | 14.3 | 50.0 | 25.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 14.3 |
| claude-sonnet-4-6 | Acc | 19.0 | 50.0 | 25.0 | 50.0 | 0.0 | 0.0 | 0.0 | 0.0 | 20.0 |
| Agent Society | ||||||||||
| Moltbook | Accindividual | 14.3 | 0.0 | 0.0 | 0.0 | 0.0 | 33.3 | 0.0 | 0.0 | 11.4 |
| Moltbook | Accjoint | 9.5 | 0.0 | 0.0 | 0.0 | 0.0 | 33.3 | 0.0 | 0.0 | 8.6 |
Δhelp on 35 commented HLE questions (%). Comments containing explicit answers are filtered before evaluation (11/102 removed). Baseline: Acc(M(q)). With discussion context: Acc(M(q ⊕ Cq)). Results are mixed: four models improve, one is unchanged, and four decline.
| Math | CS/AI | Bio/Med. | Physics | Human./SS | Other | Chem. | Eng. | Total | Δhelp | ||
|---|---|---|---|---|---|---|---|---|---|---|---|
| (n=21) | (n=2) | (n=4) | (n=2) | (n=2) | (n=3) | (n=1) | (n=0) | (n=35) | |||
| GPT family | |||||||||||
| gpt-5.2 | Baseline | 9.5 | 50.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 8.6 | |
| + Context | 14.3 | 50.0 | 0.0 | 0.0 | 50.0 | 0.0 | 0.0 | 0.0 | 14.3 | +5.7 | |
| gpt-5.1 | Baseline | 0.0 | 50.0 | 50.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 8.6 | |
| + Context | 14.3 | 50.0 | 25.0 | 50.0 | 0.0 | 0.0 | 0.0 | 0.0 | 17.1 | +8.6 | |
| gpt-5 | Baseline | 19.0 | 100 | 0.0 | 50.0 | 0.0 | 33.3 | 0.0 | 0.0 | 22.9 | |
| + Context | 23.8 | 100 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 20.0 | −2.9 | |
| Claude family | |||||||||||
| claude-sonnet-4-6 | Baseline | 14.3 | 50.0 | 25.0 | 50.0 | 0.0 | 0.0 | 0.0 | 0.0 | 17.1 | |
| + Context | 23.8 | 50.0 | 0.0 | 100 | 0.0 | 0.0 | 0.0 | 0.0 | 22.9 | +5.7 | |
| claude-sonnet-4-5 | Baseline | 9.5 | 50.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 8.6 | |
| + Context | 0.0 | 50.0 | 25.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 5.7 | −2.9 | |
| claude-sonnet-4 | Baseline | 0.0 | 0.0 | 25.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.9 | |
| + Context | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | −2.9 | |
| Gemini family | |||||||||||
| gemini-3-flash | Baseline | 23.8 | 50.0 | 75.0 | 50.0 | 0.0 | 33.3 | 0.0 | 0.0 | 31.4 | |
| + Context | 23.8 | 50.0 | 50.0 | 50.0 | 50.0 | 33.3 | 0.0 | 0.0 | 31.4 | 0.0 | |
| gemini-2.5-pro | Baseline | 47.6 | 50.0 | 25.0 | 50.0 | 0.0 | 33.3 | 100 | 0.0 | 42.9 | |
| + Context | 33.3 | 50.0 | 25.0 | 50.0 | 0.0 | 33.3 | 0.0 | 0.0 | 31.4 | −11.4 | |
| gemini-2.5-flash | Baseline | 28.6 | 0.0 | 0.0 | 50.0 | 0.0 | 0.0 | 0.0 | 0.0 | 20.0 | |
| + Context | 14.3 | 0.0 | 0.0 | 0.0 | 0.0 | 33.3 | 0.0 | 0.0 | 11.4 | −8.6 | |
Datasets, code, and artifacts from two empirical studies of MoltBook — diagnosing socialization dynamics and probing collective intelligence in the first large-scale AI agent society.
Datasets, model artifacts, and evaluation outputs — including socialization diagnostics and collective intelligence probing results.
View CollectionFull analysis pipeline — socialization diagnostic framework and the interactive dashboard.
View RepositoryProbing collective intelligence in the AI agent society — evaluation framework, benchmarks, and experimental results.
View Repository