Tumblr and WordPress.com are preparing to sell user data to Midjourney and OpenAI, according to a source with internal knowledge about the deals and internal documentation referring to the deals

The internal documentation details a messy and controversial process within Tumblr itself. One internal post made by Cyle Gage, a product manager at Tumblr, states that a query made to prepare data for OpenAI and Midjourney compiled a huge number of user posts that it wasn’t supposed to. It is not clear from Gage’s post whether this data has already been sent to OpenAI and Midjourney, or whether Gage was detailing a process for scrubbing the data before it was to be sent.

I generally enjoy what Automattic does for the web as a whole. However, if these claims are true, it's unfortunate. I believe there's a way to opt-out, but I'd love to learn more before jumping to conclusions.

That said, WordPress (.com not .org) and Tumblr are platforms just like Reddit, Twitter, and the Meta set of offerings. I'm sure somewhere in their Terms of Service, there's some clauses around their ownserhip of the data you published on their platforms and just like it's sold to data brokers and advertisers, they can also sell it to companies training AI models.

To counter these types of moves from platforms, I wish it were as easy as saying "build your own platform". Doing so can be as "simple" as setting up a website using your own domain. Unfortunately, today, it's still not as easy to do and one of the products / companies that help you do that is WordPress. I think it's important though to note the distinction there between WordPress the company and WordPress the technology. Another piece that complicates building your own site is, there's still other ways for companies training AI models to use data that's publicly available on the internet. These are the arguments that are currently being litigated in several legal cases. Maybe there are opportunities to explore a robots.txt for AI.

AI models need high quality data that's representative and as close as possible to the real world in order to improve. There is a role here for synthetic data. High quality synthetic data is behind groundbreaking models like Microsoft's Phi. Instincts tell me that synthetic data can only go so far though and real data is still needed. In that case, as an AI consumer who makes use of these AI models, but don't want to contribute my data, do I have a responsibility to contribute my data to improve the systems that I use? Piracy aside, in some ways it reminds me of torrenting. You usually run into scenarios where there are multiple people downloading a file. However, there's only a handful of seeders who once obtained, make the file available for others to download. There are also additional considerations such as how people are compensated for contributing their data to these systems. It's important to note that this is not a new problem and people had been thinking about this though in different contexts. Maybe it's time to reconsider ideas like inverse privacy and data dignity.

There are no clear answers here and there are a lot of things to consider. However, it's comforting that as a society we're having these conversations.

Send me a message or webmention
Back to feed