The Quest for Data Dignity

Is it just pure imagination?

Sep 06, 2025

0:00

-2:00

Walk with me. Gates creak open. Steam hisses. The river is chocolate, the lickable wallpaper tastes suspiciously like terms of service.

Willy Wonka is the original prompt engineer: mixture too cold? toss in a parka; drink lacks kick? add a pair of soccer cleats.

“Pure Imagination” is just temperature, ratio, and timing… with a dash of showmanship.

Now for today’s golden ticket: Anthropic has agreed to pay $1.5 billion to settle an authors’ class action over books scraped from pirate libraries; about $3,000 per work, with a court-ordered dataset destruction clause.

It awaits approval, but that’s the headline.

A judge has drawn a line: training on lawfully obtained works might be fair use, but amassing pirated copies to train your AI is not.

So: did we all win a lifetime supply of chocolate? Not quite.

If your book’s on the list, you might get a check. If not, you get… Wonka’s famous line.

The larger fight: how AI may train on lawful copies, and what creators get for it, continues, with suits like NYT v. OpenAI.

The Factory We’re Actually In

Modern AI doesn’t make money on tours; it charges at the inference turnstile; metered tokens in/out. That pricing is public for OpenAI and Anthropic. Which means the cash register rings when people use the models, not when they’re trained.

Meanwhile, creators’ training data, what gives those inferences their flavor, mostly sits outside the revenue split.

A few exceptions exist: Adobe Firefly pays a contributor bonus for training data; Shutterstock runs a Contributor Fund and even signed a six-year data-licensing deal with OpenAI (and says it’s paying out to hundreds of thousands of artists). But those are platform-by-platform patches, not a system.

Slugworth, the Gobstopper, and Data Dignity

Remember Slugworth dangling cash for an Everlasting Gobstopper?

That’s today’s AI economy: everyone chasing the candy that never wears out (your work, your data) because one lick (one inference) can be sold again and again.

The moral of the classic chocolate factory tale wasn’t “take the bounty.” It was Charlie handing it back and earning ownership.

That’s the spirit of Jaron Lanier & E. Glen Weyl’s data dignity concept: treat people’s contributions as ongoing labor, traceable and compensable when they measurably influence outputs.

Not a one-time hush check, but royalties every time the Gobstopper is actually enjoyed.

OK, but how do you trace a lick?

We’re not starting from scratch. There’s real math for attribution:

Influence functions: trace a specific output back toward the training points that moved the model there.
TracIn: follow gradient steps during training to estimate which examples most affected a given prediction.
Data Shapley: estimate fair value contributions of data to model performance.

Are these push-button for frontier LLMs? Not yet. But they’re a credible backbone for who gets paid when.

“Fine. Just delete my bit from your model.”

If Augustus Gloop falls in the chocolate river, you can fish him out. If your book gets distilled into weights, pulling it back out is… not like scooping a kid from a pipe.

Machine unlearning is a hot research area with tough scalability and verification problems. Surveys in 2024–2025 say there’s progress but no easy production-grade cure-all.

Translation: courts can nuke datasets, but surgically removing learned influence is much harder.

Provenance labels on the candy wrappers

We also need provenance, or an awareness of origin. These would need to be labels the whole supply chain can read.

That, in fact already exists: C2PA/Content Credentials can carry tamper-evident history and policy (think: “trainable yes/no” and who to pay), and it’s backed by big players.

Adoption is uneven, but the plumbing is real.

Meanwhile, OpenAI’s promised Media Manager (targeted for 2025) is a visible “opt-out/permissions” commitment; useful, but still one company’s switch. We need interoperable labels the industry must honor.

So what’s the ask?

Cue the contract in 4-point type. Wonka made kids sign a microscopic contract before they touched (or saw) anything. Let’s have one worth the ink:

Inference-time splits
A standardized slice of inference revenue flows to rightsholders whose works measurably influenced an output when attribution is confident, and with pooled fallbacks when it isn’t. We already meter tokens; metering credit is the next step. (Yes, the Adobe/Shutterstock funds are prototypes and productize this principle.)
Attribution in the serving stack
Ship influence estimation into runtime so models can flag likely contributing sources (by work, catalog, or cohort) at generation time; not perfect, but enough to route royalties and logs for audit.
See influence/TracIn/Data-Shapley as the scoring spine.
Provenance by default
Treat C2PA manifests as first-class citizens in datasets and outputs. Hosting platforms and model providers should honor “trainable yes/no” and “pay here” as policy, not suggestion.
Transparent clearinghouse
Independent PRO-style entities (music’s model works!) to collect inference splits, perform audits, and publish public payout reports.

Moral victory? Try moral engineering.

If the Anthropic settlement ends up giving a handful of creators some candy, then yes, it’s a win in the fight against piracy. This just isn’t a blueprint for a sustainable system on which we all could endure.

But we can do what Charlie did: hand back the Gobstopper (the quick payout mindset) and demand the factory: a durable, boring, programmable way to pay the people whose work powers the magic.

Until that clicks into place: You get nothing. Good day, sir.

Wrapper Receipts

Anthropic settlement: $1.5B total; ~$3k/work; dataset destruction; judge’s lawful-training vs pirated-storage split noted. (Reuters, AP News)
Inference pricing exists today (money changes hands per token): OpenAI & Anthropic public rate cards. (OpenAI, Anthropic)
Creator-payout precedents: Adobe Firefly Contributor Bonus; Shutterstock Contributor Fund + six-year OpenAI data-licensing deal and broad payouts. (Adobe Help Center, Adobe Blog, Shutterstock Investor Relations, PR Newswire)
Provenance plumbing: C2PA/Content Credentials; OpenAI Media Manager slated for 2025. (C2PA, OpenAI)
Unlearning is hard: current surveys highlight scalability/verification challenges. (arXiv)

Editorial research and source verification by Harper (ChatGPT). Header artwork: concept—Charles Wilke; AI-assisted generation—Harper. © Charles Wilke.

Link to the chat used in the creation of this piece.