AI Fiction in the Wild
Exploring how people use ChatGPT to generate fiction
We know that some professional authors are beginning to use AI tools to help produce their fiction writing. Are readers using AI to generate fiction, too? Thanks to a unique public dataset called WildChat — containing more than one million "real world" conversations with ChatGPT, all shared with user consent — we can say, definitively, yes.
This project shares insights from a collaborative analysis of WildChat, where we found that more than a third of the English-language conversations involved fiction generation: original stories, role-play, world-building, fan fiction, erotica, and more. This activity, while seemingly novel, extends trajectories of contemporary literature charted by critics like Mark McGurl, in which readers increasingly prize generic forms, repetition, and instantaneity, and where they threaten the traditional authority of the author as a customer and consumer.
The Visualizer
This interactive tool allows you to search, explore, and analyze the ~193,000 fiction-related conversations in the WildChat dataset. It includes four main views:
Search
Browse and keyword-search fiction conversations. Results are grouped by estimated user and globally ranked by fiction conversation count. Each user card shows an activity timeline (date × hour-of-day scatter), label breakdown, and conversation snippets. Click through to individual user pages for deeper exploration.
Fiction Map
An interactive 2D scatter plot of all ~193,000 fiction conversations. Each dot is one conversation, positioned so that semantically similar opening prompts appear close together. Hierarchical topic labels (generated by GPT-4o from HDBSCAN clusters) appear at increasing zoom levels. Toggle coloring by category, topic, or power user.
User Map
A 2D scatter plot of ~2,100 estimated users (those with 5+ fiction conversations). Each user is positioned by the mean of their prompt embeddings; dot size reflects conversation count. Topic labels identify clusters of users with similar fiction interests. Color by majority category or topic cluster.
User Types
A scatter plot of prompt repetition behavior. The x-axis shows total fiction conversations (log scale); the y-axis shows what fraction of opening prompts are unique. Toggle between TF-IDF and sentence-embedding clustering methods to compare how uniqueness is measured.
Methodology
Data Source
The dataset is drawn from WildChat (Zhao et al., 2024), a corpus of ~1 million real ChatGPT conversations collected with informed user consent by the Allen Institute for AI between April 2023 and September 2024. We filter to English-language fiction-related conversations — those classified as fiction, fanfiction, or explicit creative writing — yielding ~193,000 conversations from ~13,000 unique IP addresses.
Embeddings & Dimensionality Reduction
Each conversation's first user message is embedded using all-MiniLM-L6-v2 (Reimers & Gurevych, 2019), a sentence-transformer model that produces 384-dimensional L2-normalized vectors. The embeddings are reduced to 50 dimensions via PCA (retaining ~56% of variance), then projected to 2D using UMAP (n_neighbors=30, min_dist=0.1, cosine metric). The result is a map where conversations with similar opening prompts appear close together.
Topic Labeling (HDBSCAN + GPT-4o)
Topics are identified through two-level HDBSCAN clustering on the 50-dimensional PCA embeddings (not the 2D UMAP coordinates, which distort semantic distances). Coarse clusters (min_cluster_size=500) define ~17 broad thematic groups (e.g., "Fantasy Worldbuilding," "Romance & Erotica"); fine clusters (min_cluster_size=50) define ~183 subtopics within those groups. Each cluster is labeled by GPT-4o from a sample of 20 representative prompts. Noise points (~73% of conversations that don't meet the density threshold) are re-clustered with lower thresholds to improve map coverage. Labels appear on the Fiction Map and User Map at different zoom levels.
User Estimation
WildChat identifies users by hashed IP address, but one person may use multiple IPs (e.g., different networks, VPN changes). We estimate which IPs likely belong to the same user by computing a profile embedding (mean of all fiction prompt embeddings per IP), then grouping IPs from the same geographic state whose profiles have cosine similarity ≥ 0.85 using union-find merging. This identifies ~192 estimated user groups covering ~900 IPs. The "Group by" toggle in the navbar switches between IP-level and estimated-user grouping.
Prompt Repetition Analysis
For each user with 20+ fiction conversations, we measure how repetitive their opening prompts are using agglomerative hierarchical clustering (average linkage). Two similarity methods are available via toggle:
- TF-IDF (default): Prompts are vectorized with TF-IDF (character 4-grams, max 20K features) and clustered using a cosine similarity threshold of 0.7. This method captures surface-level lexical overlap.
- Sentence Embedding: Prompts are embedded with all-MiniLM-L6-v2 and clustered with a cosine similarity threshold of 0.85. This method captures deeper semantic similarity, detecting paraphrases that word-level methods miss.
The unique prompt ratio is (clusters − 1) / (conversations − 1), scaled so 0.0 means every conversation started identically and 1.0 means every prompt was meaningfully different. Users near the bottom of the User Types plot repeat the same or very similar prompt hundreds of times; users near the top start every conversation differently.
Talk
Presented at the UC Berkeley School of Information Cultural Analytics Series, co-sponsored by the Berkeley Institute for Data Science. October 24, 2025.
Related Projects
WildChat
A corpus of 1 million real-world user-ChatGPT interactions, collected with user consent.
wildchat.allen.aiWildVisualizer
An interactive search tool for exploring the full WildChat dataset.
wildvisualizer.comReferences
- Zhao, Y., et al. (2024). WildChat: 1M ChatGPT Interaction Logs in the Wild. ICLR 2024.
- Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP 2019.
- McInnes, L., Healy, J., & Astels, S. (2017). hdbscan: Hierarchical density based clustering. JOSS.
- McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.