Briefing: How researchers identified over 400 people in the Epstein Files

Our weekly Briefing is free, but you should upgrade to access all of our reporting, resources, and a monthly workshop.

This week on Indicator

Craig published the video, transcript, slides, and notes from our most recent monthly workshop. He shared tools and tips for uncovering corporate ownership. We also heard from special guest Stephen Abbott Pugh, who demonstrated some of the tools he’s built to investigate and visualize ownership information.

Alexios discovered his fake partner was fake cheating on him as part of an investigation into the murky world of “cheater buster” websites that promise to uncover evidence of your partner’s infidelities.

Alexios was also quoted in a Reuters Institute piece on AI and the fog of war.

Paid members help us break news and dig into tools

Deception in the News

📍 Italian prime minister Giorgia Meloni issued a warning about AI-generated images that showed her in lingerie. She encouraged people to “verify before believing.” (She also joked that the images “improved me quite a bit.”) Meloni was previously the victim of a deepfaked sexually explicit video that led to an ongoing lawsuit in Sardinia.

📍 The judicial inquiry into Elon Musk and X led by the Paris prosecutor’s office is now a criminal investigation. CNBC reports the inquiry “has focused on complaints of algorithmic manipulation by X to influence and interfere in French politics, and allegations that Musk and X knowingly allowed users of the AI chatbot Grok to create and spread Holocaust denials and nonconsensual sexually explicit deepfake images on X.”

📍 AFP reports that South Korea’s government “has hired hundreds of staff to track and counter manipulated content ahead of local ballots on June 3.”

📍 Several British children told Internet Matters that they’ve been drawing facial hair on themselves to get around new age verification measures.

📍 A Canadian musician is suing Google, alleging an AI Overview search result falsely labeled him a sex offender.

📍 Also in legal news about AI deception: Pennsylvania is suing Character AI to prevent the company’s chatbots from roleplaying as doctors.

📍 For a few days this week, a photo of Snap CEO Evan Spiegel on Wikipedia was replaced with a pic of a Wired journalist. It was apparently an honest mistake, though a surprisingly high-profile one.

📍 The EU’s elaborate decision-making process took a big step towards banning AI nudifiers.

📍 X’s head of product appeared to threaten Mr Beast with demonetization over an engagement-baiting tweet.

📍 20 people were found guilty of murdering two men in northeastern India in 2018. The case was tied to false rumors of child abduction that had spread on WhatsApp, and which led the company to change its approach to message forwarding.

📍 A fake Muslim mayoral candidate in the German city of Bielefeld is generating real hate.

Tools & Tips

Image taken from the Epstein Photo Network

Decoherence Media, a non-profit investigative outlet, recently published the Epstein Photo Network, which it described as “the highest-quality publicly available facial recognition interface to the Epstein Library, with the most verified names and the fewest false positives.”

Along with creating the freely-accessible site, Decoherence’s announcement included the methodology and code used to build the tool. The post also discussed about how they dealt with false positives and incorporated manual review to try and ensure data quality. It’s a nice case study of building a facial recognition pipeline using publicly-available images.

I reached out to Decoherence cofounder Tristan Lee to talk about the process and what they learned. Here’s a quick Q&A. — Craig

You published the methodology of how you built the pipeline. One thing I didn't see is a time/cost estimate. How long did it take, and how much did it cost?

This project took about two months of effort, between building the pipeline, identifying faces, and manually reviewing images. We used AWS Rekognition as the core facial recognition model, our AWS bill for the whole thing ended up being around $200 for processing 25,000 images. PimEyes was $30 per month, Facecheck was $50 for two months of credits. So overall pretty reasonable, by far the biggest cost was my own time.

What was your approach to identifying and removing false positives?

The facial recognition model and parameters we used were on the conservative side, so we had far more false negatives than false positives—which is preferable for a project like this. It’s much better to default to not identifying someone than to identify them with the wrong name.

There were still a few false positives though, and the way I dealt with them was flagging lower confidence matches and manually reviewing a lot of pictures to make sure all the names and unique identifiers were correct.

Was there a part that you expected to automate but ended up needing more manual work than anticipated? Or vice versa?

I was hoping I wouldn’t have to manually look through 25,000 images, but there were enough consistency issues that this ended up being necessary. For example, two photos might show the same scene a few seconds apart, and a person would have their face detected in the first photo but not the second. (For example maybe their face is angled away from the camera). So I would then have to add a bounding box for the face in the second photo.

I created a few basic web apps to speed up these kinds of tasks, for example adding new faces, identifying near-duplicate images, and determining if a person should be included in or excluded from the network.

You mentioned using PimEyes and Facecheck. What can you share about the tools' strengths and weaknesses?

In my experience, PimEyes has a very low false positive rate. If it returns a result there’s a good chance it’s the same person. Facecheck has a much higher false positive rate, but it seems to draw from a database that includes more social media content, like Instagram and Facebook photos. Their pricing models are also different: PimEyes is a flat monthly subscription (75 searches per day, a limit I’ve never actually hit), and Facecheck you pay per credit, and credits expire after a period of time. In practice I usually went to PimEyes first, then tried Facecheck if PimEyes results weren’t helpful.

But like any facial recognition models, it’s important to take their results with a grain of salt and corroborate using additional sources of information. Especially when it comes to faces of non-white people. There are dozens of stories about police making a false arrest based on a “confident” face match, almost always targeting a Black or Brown person.

How did you consider and implement concerns around privacy and nudity?

The privacy of Epstein’s victims was at the front of my mind throughout this project, the last thing we wanted to do is further traumatize them. The Department of Justice did a shameful job in their redactions, the initial release included nearly 100 naked pictures of one prominent Epstein victim.

If there was any indication a woman was abused by Epstein (for example court documents, victim testimony, disturbing images), we removed her from all pages in the site, and removed all pictures containing her from our static storage.

In addition to that, there were many young women (often connected to the modeling industry) who appeared in photos, but didn’t have any other clear connection to Epstein. (For example being an employee or in a relationship.) In those cases we didn’t attempt to identify them, and added them to the “Excluded” category.

There were a handful of edge cases that we carefully considered. Three women we identified from Epstein’s inner circle have credible claims of being abused by him, but because of their key role in his operations (for example being a key assistant, recruiter, or facilitator), we decided to name them.

We used an AWS Rekognition API (DetectModerationLabels) to identify images that contained nudity, and then manually reviewed the photos of a few particular people that had a high percentage of nude images, since the nudity detection wasn’t 100% accurate.

Many people in the photos are still unidentified. What's the plan for those?

We’ll continue to work on identifying as many people in these photos as possible. And now that we’ve released this project, we have a larger community of researchers to help. We’ve already added more identities from tips that people submitted through our Google form or through Reddit comments.

Now that you've done it once, what would you do differently on a similar future project and why?

Being more meticulous from the beginning about documenting my process and archiving everything. It’s easy to get “in the zone” when doing this kind of research, find something significant, and then have to go back a few days later with fuzzy memory and write down the chain of evidence that led there.

There are also a lot of changes I needed to make to our database schema and processing pipelines over time. The source code is littered with the creation of new columns and indexes that would have been nice to include from the beginning.

📍 Henk van Ess launched version 2.0 of his image analysis tool, imagewhisperer.org. He said it’s faster, includes “a custom investigation plan after every verdict,” additional search options, and more.

📍 Sweden’s official company registrar released an API for accessing beneficial ownership information. (via Stephen Abbott Pugh)

📍 Logan Woodward shared a link to SHADOHDORKS, a site that can generate more than 1,000 search dorks for a single URL.

📍 Ines Narciso highlighted the open-source Social Media Data Toolkit (SMDT), “a lightweight toolkit designed for ingesting, normalizing, enriching, and analyzing social-media data.” She noted that it “gives messy social media datasets a common grammar: it maps them into a standard schema — communities, accounts, posts, actions and entities — and then adds anonymization and enrichment layers, including LLM-based analysis and network tools.”

📍 Quick Cache and Archive Search is a free tool from Cyber Detective that can find archived copies of a webpage. (via Mario Santella)

📍 Kirby Plessas outlined five tools you can use instead of Instant Data Scraper, the Chrome extension that we recently reported has a mysterious new owner, and which also appears to have new and concerning functionality.

📍 Aida Kokanovic wrote a detailed article for OSINT Team about how to “Turn any document into a Maltego graph.”

📍 OpenCorporates published the latest article in its series about legal entities, “Not all legal entities are created equal.”

📍 Aeon Flex wrote, “Discord Servers Are OSINT Goldmines. If You Know Where to Look.”

Reports & Research

📍 Researchers at ETH Zurich found in a preprint that slightly shifting pixels in images can have a significant consequences in how LLMs understand them. For example, adjusting an iconic photo of the moon landing to match a visual representation of “fake news” led ChatGPT 5.4 Thinking to claim the photo was a fake.

📍 In an op-ed, misinformation researcher Lisa Fazio vowed to keep working on the topic despite the US administration’s efforts to defund and silence such work. “The work is vital, so we’ll continue to do it for as long as we can,” she wrote.

📍 The Bureau of Investigative Journalism found that a devout Pakistani Muslim is behind Facebook accounts that spread Islamophobic AI slop to users in the UK. “At the heart of the issue is the way Meta has incentivised the creation of hateful AI slop as it chases user engagement and ad revenue,” the article said.

📍 In a global survey of human rights defenders, activists, journalists, writers, and other public communicators, 6% reported having been targeted with deepfake abuse.

📍 Research from NewsGuard found that the Claude chatbot “has become more vulnerable to state disinformation campaigns, a finding that is consistent with more general recent complaints from Claude users that the popular chatbot has become less reliable.”

Want more studies on digital deception? Paid subscribers get access to our Academic Library with 75 categorized and summarized studies:

Academic Library | Indicator

Indicator is your essential guide to understanding and investigating digital deception. Sign up for free

indicator.media/academic-library

One More Thing

Synthetic audio is still prone to pretty garish hallucinations, especially in longer clips. A case in point:

Indicator is a reader-funded publication.

Please upgrade to access to all of our content, including our how-to guides and Academic Library, and to our live monthly workshops.