Skip to content
PSI PARASITICAI.com

Research Node

Synthetic Data Contamination and Model Collapse

Mechanism

The data-ecosystem pathway appears when generated material replaces fresh, diverse human data in future training or retrieval pipelines. The danger is not simply that a page is synthetic; it is that repeated synthetic replacement can reduce diversity, tail-topic retention, and fidelity to real-world distributions.

Indicators

  • Synthetic-to-real ratio by source family and topic.
  • Rare-topic retention against human-authored holdouts.
  • Repeated phrasing, style concentration, and variance shrinkage.
  • Loss of primary-source citations in generated answers.

Containment

Preserve fresh human data, label synthetic sources where possible, separate generated and source-authored corpora, and evaluate against time-sliced human holdouts rather than only current web averages.