User Types - AI Fiction in the Wild

Prompt Repetition Explorer

What am I looking at?

Each dot is one ChatGPT user who had fiction-related conversations in the WildChat dataset. Only users with 20+ conversations are shown.

X-axis: Total number of fiction conversations (log scale).
Y-axis: Prompt uniqueness ratio — what fraction of their first messages are unique (using fuzzy matching, so slight rewording still counts as the same prompt). A ratio of 1.0 means every conversation started differently; near 0 means they repeated the same or very similar prompt over and over.

Users in the bottom-right are power users with highly repetitive behavior — sometimes sending the same or nearly identical prompt hundreds of times.

Learn more →

Loading data...

Methodology

What this plot shows: Each dot represents one ChatGPT user from the WildChat dataset who had at least 20 fiction-related conversations. The x-axis shows their total number of fiction conversations (log scale), and the y-axis shows how unique their opening prompts were.

Measuring uniqueness: For each user, we collect the first message (opening prompt) from every fiction conversation. We then measure how many of these prompts are meaningfully distinct from each other. Use the toggle above to switch between two methods:

TF-IDF method: Uses TF-IDF (term frequency–inverse document frequency) cosine similarity — a standard NLP technique that weights distinctive words more heavily than common ones. Prompts with cosine similarity above 0.7 are merged into the same cluster.

Sentence Embedding method: Uses all-MiniLM-L6-v2 sentence-transformer embeddings (384 dimensions) to capture deeper semantic meaning. Prompts with cosine similarity above 0.85 are merged. This method can detect paraphrases and meaning overlap that word-level methods miss.

DBSCAN method: Uses pre-computed DBSCAN clustering on all-MiniLM-L6-v2 sentence embeddings with a cosine similarity threshold of 0.9. Unlike the per-user methods above, DBSCAN clusters prompts across all users, then assigns each user's prompts to global clusters. Users are grouped into estimated equivalence classes when two IPs share the same prompt cluster in the same state.

Clustering: The TF-IDF and Sentence Embedding methods use agglomerative hierarchical clustering (average linkage) on the full pairwise cosine similarity matrix of each user's prompts. The DBSCAN method uses density-based clustering across the entire corpus. The unique prompt ratio is then (clusters − 1) / (conversations − 1), scaled so that 0.0 means every conversation used the same prompt and 1.0 means every conversation started with a meaningfully different prompt.

Color coding: By default all dots are blue (fiction). Click the category labels in the legend to highlight users whose conversations are more than 50% fanfiction (orange), explicit (red), or toxic (amber). Enabling both fanfiction and explicit reveals users with a mix of both (purple). Double-click a category to isolate it.

Normalization: Before comparison, all prompts are lowercased and whitespace is normalized. Only the first user message of each conversation is considered, not subsequent turns.