If you run a WordPress site or have a blog on Tumblr, you’ve probably produced and published a significant amount of content there. Even though we all know that the Internet is not “private”, you probably posted these texts and images thinking that they belonged to you and that they would not be stolen by the companies you relied on to host them.
WordPress and Tumblr happen to be gearing up to do just that. As first reported by 404 Media, the parent company of both sites, Automattic, has struck a deal to sell user data from Tumblr and WordPress to artificial intelligence companies like Midjourney and OpenAI. Artificial intelligence companies intend to utilize data to train their systems.
As if that wasn’t enough, the preparations for the sale didn’t go well, and it appears that it added gigantic categories of Tumblr posts that weren’t intended to be sold. These data include:
-
Private posts from public accounts
-
Posts from deleted or suspended accounts
-
He asks without an answer
-
Private replies
-
Explicit posts
-
Posts from affiliate accounts, such as advertising campaigns, to which Tumblr does not have the rights. (Apple is specifically named here.)
It is possible that this data was not actually submitted to OpenAI and Midjourney, but was merely identified and cleansed for this purpose. However, 404 Media could not confirm this. However, they were able to confirm that password-protected posts, direct messages, and media identified as CSAM were not in this group. So… that’s good.
These may not be all WordPress sites
Automatic specifies only this WordPress.com Data scraping affects websites, unlike content created in the WordPress CMS, which can be used on a website hosted elsewhere. Theoretically, your CMS WordPress sites NO hosted on Automattic should be protected against such activities.
That said, 404 Media could not confirm whether using Automattic plug-ins such as JetPack would result in a self-hosted site being subject to Automattic’s fraudulent data sharing policies.
You do not have to accept automatic sales of your data
A source tells 404 Media that Automattic will be adding a modern setting to its services on Wednesday to allow users to opt out of having their data sold and shared with third-party companies. The facility received a copy of the modern FAQ section, which details that this opt-out option will block robots from accessing your sites if you enable it “from the beginning.” If you decide to opt out later, Automattic will contact partners and “ask” them to remove your content from their data and training collections.
This formulation is not particularly encouraging. However, whenever Automattic does release this opt-out option, I suggest using it on Tumblr and WordPress sites anyway.
Following Article 404 Media, Automattic published a statement claiming to block major AI platform crawlers and update its lists with modern ones; has features that block search engines from indexing your sites, which may also discourage AI indexing; and that they only host public content hosted on WordPress and Tumblr from sites that have not opted out. That said, they acknowledge that there are no regulations preventing robots from respecting these preferences and that they work with specific AI companies “as long as their plans align with what our community cares about: attribution, opt-out, and control.” “
What will AI companies do with this data?
Companies like Midjourney and OpenAI require huge datasets for training artificial intelligence systems. Programs like Midjourney and ChatGPT wouldn’t be possible without sending them massive amounts of information: that’s how they “learn” to do what they do.
This allows your WordPress blog posts filled with your favorite recipes to be fed to generative AI models to train them to “talk” about food (or anything at all); Your Tumblr photo dumps can teach models to recognize objects like a car or a bird. Data from all your sites and the sites of millions of other users is invaluable to AI companies, which means unusually valuable to the companies that own these sites and can sell them. Automattic will probably make a lot of money on this deal, just as Reddit will probably make a lot of money under its own AI content licensing agreement with Google.
Publishing and sharing online is great fun, but maybe it’s time to take back what’s yours: if you don’t own the platform where you share your original ideas, consider moving them to a platform you do before your ideas they become training wheels for artificial intelligence.