Privacy GuideJuly 1, 202610 min read

How Data Brokers Sell Your Information to AI Companies

SC

By Sarah Chen

Head of Privacy Research

How Data Brokers Sell Your Information to AI Companies

Don't want to do this by hand? We remove your info from 100+ broker sites automatically.

There is a growing pipeline between data brokers and AI companies, and your personal information is flowing through it. This is not the familiar story of web scraping, where AI companies hoover up publicly available internet content. This is something different: a commercial data brokerage ecosystem where personal data — Slack messages, emails, phone recordings, social media posts, and detailed consumer profiles — is packaged and sold specifically for AI model training. The people whose data fills these datasets rarely know about it and almost never consent to it.

The New AI Training Data Supply Chain

AI companies need massive volumes of diverse, high-quality data to train their models. As the most accessible internet data has already been scraped, these companies are increasingly turning to commercial sources — and data brokers are eager to supply them.

Traditional data brokers have spent decades collecting, aggregating, and selling consumer information. They have databases containing names, addresses, phone numbers, purchase histories, browsing patterns, financial records, and hundreds of other data points on hundreds of millions of people. Now, these same brokers are pivoting to serve a new category of customer: AI companies willing to pay premium prices for structured, labeled data that can be used directly for model training.

Companies like RadHash have built business models around openly brokering structured data specifically for large language model training. They position themselves as intermediaries between data holders and AI companies, facilitating transactions for datasets that include consumer communications, behavioral data, and professional content.

SimpleClosure's Asset Hub: Selling the Dead

One of the most striking examples of this pipeline is SimpleClosure's Asset Hub, a platform where companies that are shutting down can sell their digital assets — including Slack archives, internal emails, code repositories, and customer data — to AI companies.

The platform has facilitated nearly 100 transactions, with individual company datasets selling for between $10,000 and $100,000. For AI companies, this is a bargain: they get access to authentic workplace communications, real-world code, and genuine customer interactions that make for high-quality training data. For the closing companies, it is a way to extract final value from their assets.

The people whose Slack messages, emails, and customer interactions fill those datasets are typically not informed that their communications have been sold, let alone asked for consent. An employee who spent years writing internal messages and emails at a startup may never know that their communications are now training an AI model.

Your Work Communications May Already Be AI Training Data

If you have ever worked at a company that shut down, your Slack messages, emails, and other internal communications may have been sold as AI training data through platforms like SimpleClosure's Asset Hub. There is currently no legal requirement in most states for companies to notify employees before selling this data, and typical employment agreements give the company ownership of communications sent through company systems.

Atlassian's Data Pipeline to AI

It is not just defunct companies selling data. Atlassian, which serves approximately 300,000 customers through products like Jira and Confluence, has announced that starting August 17, 2026, it will use customer data to train its AI assistant Rovo. Data from Jira tickets, Confluence pages, and other Atlassian products will flow to OpenAI as part of this training process.

Marc Rotenberg, president of the Center for AI and Digital Policy, raised alarms about this practice and sent a letter to the Senate Commerce Committee urging scrutiny. The concern is straightforward: businesses store proprietary strategies, internal discussions, product plans, customer information, and sensitive operational details in tools like Jira and Confluence. Routing that content to a third-party AI company for model training raises significant privacy, security, and intellectual property questions.

Atlassian customers who do not want their data used for AI training will need to proactively opt out before the August 2026 deadline. History suggests that most will not, either because they are unaware of the change or because the opt-out process is deliberately inconvenient.

Skip the manual opt-outs

One opt-out won't stop them — brokers relist your data. PrivacyOn removes your info from 100+ sites and keeps it removed.

Start your free scan

What Kinds of Personal Data Are Being Sold?

The datasets flowing from brokers to AI companies are remarkably diverse:

  • Images and photos: Personal photos scraped from social media or purchased from photo platforms, used to train image generation and facial recognition models.
  • Phone recordings: Call center recordings and voice samples sold for speech recognition and voice synthesis training.
  • Social media posts: Content classified by sentiment, topic, and demographic data, sold as labeled datasets for natural language processing.
  • Consumer profiles: Detailed profiles combining demographic data, purchase history, and behavioral patterns, sold as structured data for recommendation and personalization models.
  • Professional communications: Emails, Slack messages, and internal documents from companies that have dissolved or sold their data assets.

The key difference between this and traditional web scraping is the commercial layer. These are not AI companies collecting freely available internet content — these are structured business transactions where data brokers are paid specifically to provide data for AI training purposes.

Why This Matters for Your Privacy

When your personal data is used to train an AI model, it cannot be effectively "deleted" or "removed" from that model in the way that it can be removed from a database. The data becomes embedded in the model's parameters, influencing its outputs in ways that are technically impossible to fully reverse. This creates a one-way door: once your data has been used for training, the privacy harm cannot be undone.

This is fundamentally different from traditional data broker concerns, where the remedy is deletion. If your personal information is on a people-search website, it can be removed. If your data has been baked into a large language model, the practical remedy is far more limited.

The implications include:

  • Loss of control: You have no way to know which AI models were trained on your data or how your information influences their outputs.
  • Amplified exposure: AI models trained on your data can generate outputs that reveal or approximate your personal information to anyone who queries the model.
  • No effective deletion: Even if a data broker removes your information from their database, copies of that data may already be embedded in AI models that were trained on earlier versions of the dataset.
  • Consent gaps: Most data broker terms of service predate the AI training use case. Your data may have been collected under one set of expectations and repurposed for an entirely different use.

Prevention Is the Only Real Protection

Because AI training is effectively irreversible, the most important time to act is before your data is sold to an AI company — not after. Removing your information from data broker sites now reduces the pool of data available for future AI training transactions. Every record that is removed before it enters an AI training pipeline is one that cannot be permanently embedded in a model.

What You Can Do

Remove Your Data From Broker Sites Now

The single most impactful step is to reduce the amount of personal data about you that is available for sale. Data brokers can only sell what they have. PrivacyOn removes your personal information from over 100 data broker sites and continuously monitors for re-listings. By keeping your data off these sites, you limit what can be packaged and resold to AI companies for model training.

Audit Your SaaS and Workplace Tools

Check the privacy policies and AI data-use disclosures of every SaaS platform you use — especially collaborative tools like Atlassian products, Slack, Notion, and Google Workspace. Look for opt-out mechanisms for AI training and exercise them before deadlines pass. If you are a business owner, review these settings on behalf of your entire organization.

Advocate for Stronger Protections

Contact your representatives and support legislation that requires explicit, informed consent before personal data can be sold for AI training. The Center for AI and Digital Policy and similar organizations are pushing for federal rules that would close the consent gaps that make this data pipeline possible.

Limit the Data You Generate

Use email aliases for online services, limit social media sharing, and be deliberate about what information you provide to any platform. The less data you generate, the less there is to be collected, brokered, and fed into AI systems.

The data broker-to-AI pipeline is still in its early stages, but it is growing quickly. Acting now to remove your personal information from data broker sites is one of the most concrete steps you can take to prevent your data from being permanently embedded in AI models you will never see and cannot control.

SC
Sarah Chen

Head of Privacy Research

CIPP/US CertifiedIAPP MemberB.S. Computer Science

CIPP/US-certified privacy researcher with over a decade of experience helping consumers remove their personal information from data brokers.

Ready to Protect Your Privacy?

Let PrivacyOn automatically remove your personal information from data broker sites and keep it removed.