Someone Made a Dataset of One Million Bluesky Posts for ‘Machine Learning Research’

The data isn’t anonymous. In the dataset, each post is listed alongside the users’ decentralized identifier, or DID…It’s also noteworthy that it’s a “snapshot” of time on Bluesky, meaning it could, and probably does, include since-deleted posts.

I mean in a way, it’s similar to the Internet Archive. But they never really archived social media accounts did they?

This dataset could be used for “training and testing language models on social media content, analyzing social media posting patterns, studying conversation structures and reply networks, research on social media content moderation, [and] natural language processing tasks using social media data,” the project page says.

This is literally made for training AI right? I guess it’s just a matter of whether or not the resulting chatbot is publicly released but then the actual data is already available anyway.

“A number of artists and creators have made their home on Bluesky, and we hear their concerns with other platforms training on their data. We do not use any of your content to train generative AI, and have no intention of doing so,” - Bluesky, official account

I don’t know. I can believe the devs aren’t personally training AI on it, but it’s definitely a thing you can feed to a LLM. Regardless of who physically does it.

Personally? I don’t have a problem with AI scraping the dumb shit I post online. But I can understand why an artist or seasoned writer wouldn’t want generative AI learning off of their trademark style and then bottling it up for a monthly subscription fee.

I don’t think this is a Bluesky issue, it just got caught in the middle because it’s in the spotlight right now. I’m not a programmer or an AI whisperer so I could totally be wrong, but couldn’t anyone create a dataset of a public social network’s content using their APIs and then train whatever they like on it?

It’s shitty, but the cat’s already out of the bag. If you’re posting anything online it can, and probably is, being trained on AI.


Update: Looks like they took down the data already:

I've removed the Bluesky data from the repo. While I wanted to support tool development for the platform, I recognize this approach violated principles of transparency and consent in data collection. I apologize for this mistake.

[image or embed]

— Daniel van Strien (@danielvanstrien.bsky.social) November 26, 2024 at 9:19 PM