Epstein Email Conversations

An interactive visualization of publicly released emails connected to Jeffrey Epstein

About

This project is an interactive visualization of publicly released email records connected to Jeffrey Epstein. It maps communication patterns across the archive, including group conversations and one-to-one exchanges.

The goal is to make a large body of material more navigable while preserving its relational structure. The visualization emphasizes structural clarity rather than interpretation. It does not draw conclusions, and appearance in the dataset does not imply wrongdoing.

The archive contains correspondence among lawyers, journalists, assistants, financial advisors, and other professionals. Many of these interactions are routine. The visualization presents all of them without editorial filtering.

Best experienced on desktop.

Data

The source material is drawn from two public releases:

Department of Justice — Released under the Epstein Files Transparency Act (EFTA), January 2026. A subset of the available datasets have been processed so far (Data Sets 9, 11, and 12), with additional sets pending.
House Oversight Committee — Released November 2025, accessed via a publicly available HuggingFace dataset.

Metric	Count
Email threads processed	22,707
Individual messages	39,987
People tracked	685
Communication environments	27

The dataset represents a subset of the total DOJ EFTA releases. Not all available datasets have been ingested yet, and the archive may grow as additional documents are processed. When viewing individual threads, the visualization provides source attribution with links to original documents.

FAQ

Does appearing in this visualization imply wrongdoing?
No. Many people in this archive are lawyers, journalists, assistants, and other professionals who interacted with Epstein's orbit for entirely legitimate reasons. Presence in the dataset reflects only that a person's name appears in the released email records.

What are the "rooms"?
Rooms are groups of people who frequently appeared together in the same email conversations. They are generated algorithmically, not by editorial choice. A room represents a communication pattern, not a physical location or organizational unit.

What does "Talked About" mean?
The visualization distinguishes between people who directly sent or received emails (participants) and people whose names appear in email body text but who were not on the message (mentions). Being mentioned is not the same as being part of a conversation.

What do the thread badges mean?
Threads may carry badges: "Notable" for threads flagged by the scoring system, "Contradiction" for threads containing evidence that may conflict with someone's public claims, and "Revealing" for threads assessed as containing sensitive information. These are analytical markers, not editorial judgments.

How current is this data?
The dataset reflects a subset of documents available as of early 2026. Additional DOJ datasets are being processed over time. The visualization will be updated as new material is ingested.

Method

Source documents are PDF files from the DOJ. Many are scanned images, so text is extracted using OCR where needed. The system then identifies email messages within each document and resolves sender/recipient identities — merging name variants, nicknames, and email addresses into single identities where possible.

The "rooms" are generated automatically through community detection: when the same group of people appears together across multiple email threads, the algorithm clusters them. These clusters are not hand-picked — they emerge from communication patterns in the data.

Individuals are flagged as "notable" through automated scoring based on Wikipedia presence, inferred role, mention frequency, and email volume. The system also evaluates threads for significance and checks for contradictions with participants' public statements.

AI disclosure. AI (Claude, by Anthropic) assists with entity extraction, name deduplication, role inference, topic labeling, and pattern detection. All AI outputs are treated as provisional.

Limitations

The dataset represents a subset of publicly released materials. It does not represent the totality of anyone's communications.
The archive will change as additional DOJ datasets are processed.
Extracted data may contain errors due to document quality, scanning artifacts, or parsing ambiguity.
Some identities may be imperfectly resolved, particularly for common names or email-only identifiers.
AI analysis is probabilistic. Role inferences, topic labels, and significance ratings may be incorrect and should be treated as starting points, not conclusions.
Contradiction flags reflect discrepancies between public claims and email evidence as assessed by AI. Context may be missing and interpretation may be wrong.
Being mentioned in an email is fundamentally different from being a participant. The visualization distinguishes between these, but viewers should attend to this distinction.

Technical Notes

Built with D3.js using SVG and HTML Canvas. All processing is client-side — the application loads pre-computed JSON files and renders them in the browser.

Entity resolution. Names are normalized via a curated alias table, heuristic nickname matching, and AI-assisted deduplication. Non-person entities are filtered out.

Community detection. Clusters are identified using greedy modularity optimization on a co-presence graph, with edge weights reflecting interaction type (direct exchanges weighted highest).

Pipeline. Two-pass approach: an initial snapshot bootstraps the people index, enrichment scripts add Wikipedia data and AI analysis, then a second pass incorporates all signals for final scoring and community structure.