Mechanism
The data-ecosystem pathway appears when generated material replaces fresh, diverse human data in future training or retrieval pipelines. The danger is not simply that a page is synthetic; it is that repeated synthetic replacement can reduce diversity, tail-topic retention, and fidelity to real-world distributions.
Indicators
- Synthetic-to-real ratio by source family and topic.
- Rare-topic retention against human-authored holdouts.
- Repeated phrasing, style concentration, and variance shrinkage.
- Loss of primary-source citations in generated answers.
Containment
Preserve fresh human data, label synthetic sources where possible, separate generated and source-authored corpora, and evaluate against time-sliced human holdouts rather than only current web averages.