I created this account two days ago, but one of my posts ended up in the (metaphorical) hands of an AI powered search engine that has scraping capabilities. What do you guys think about this? How do you feel about your posts/content getting scraped off of the web and potentially being used by AI models and/or AI powered tools? Curious to hear your experiences and thoughts on this.
#Prompt Update
The prompt was something like, What do you know about the user llama@lemmy.dbzer0.com on Lemmy? What can you tell me about his interests?" Initially, it generated a lot of fabricated information, but it would still include one or two accurate details. When I ran the test again, the response was much more accurate compared to the first attempt. It seems that as my account became more established, it became easier for the crawlers to find relevant information.
It even talked about this very post on item 3 and on the second bullet point of the “Notable Posts” section.
For more information, check this comment.
Edit¹: This is Perplexity. Perplexity AI employs data scraping techniques to gather information from various online sources, which it then utilizes to feed its large language models (LLMs) for generating responses to user queries. The scraping process involves automated crawlers that index and extract content from websites, including articles, summaries, and other relevant data. It is an advanced conversational search engine that enhances the research experience by providing concise, sourced answers to user queries. It operates by leveraging AI language models, such as GPT-4, to analyze information from various sources on the web. (12/28/2024)
Edit²: One could argue that data scraping by services like Perplexity may raise privacy concerns because it collects and processes vast amounts of online information without explicit user consent, potentially including personal data, comments, or content that individuals may have posted without expecting it to be aggregated and/or analyzed by AI systems. One could also argue that this indiscriminate collection raise questions about data ownership, proper attribution, and the right to control how one’s digital footprint is used in training AI models. (12/28/2024)
Edit³: I added the second image to the post and its description. (12/29/2024).
I’m pretty much fine with AIs scraping my data. What they can see is public knowledge and was already being scraped by search engines.
I object to:
public knowledge about individuals when condensed and analyzed in depth in huge databases can patternize your entire existance and you’re suspicable to being swayed a certain direction in for example elections. Creating further divide and into someone elses pockets.
Maybe but I can’t object too much if I put my content out in public. When forced to create an account I use minimal/false information and a unique generated email. I imagine those web sites can figure out how to aggregate my accounts (especially given the phone number requirement for 2FA) but there shouldn’t be enough public info for a scraper to
Gotta think larger than yourself though. What happens when your spouse uses real info? your kids? your parents? they’ll shadowplay your person with great accuracy and fill in the gaps. You don’t even have to “put content” out there. Said databases can just put two and two together. How will you, or other uses even know you’re actually talking to a human? perhaps you’re on Lemmy and we’re all bots trying to get you to admit fragments of your latest crimes in order to get you into jail for said crime? etcetera. At first glance this all looks harmless but any accumulated information in huge databases is a major infringement to personal integrety at best; and complete control of your freedom at worst. The ultimate power is when someone can make you do X or Y and you don’t even realize you’re doing their bidding; but believe you have a choice when you don’t. (Similiar to how it is in my living situation at home with my gf that is :P jk.)
Hakuna matata. Happy new year
I completely agree, except that I think of them as multiple related privacy issues. In the scope of ai bots scraping my public content, most of these are out of scope
What did you mean by “police” your content?
Not the person you are replying to but Reddit does not make the content you created available for everyone (blocking crawlers, removing the free API) but instead sells it to the highest bidder.
Right, that’s my objection. After benefitting from my content, they police it, as in restrict other sites from seeing it, until it’s monetized. It’s not Reddits to charge money for
Probably not the right word, but my content should still be my content. I offered it to Reddit but that doesn’t mean they have the right to charge others for it or restrict it to others for commercial reasons.
Um, not they do in fact have “every right” here. It’s shitty of course but you explicitly gave them that right in form of an perpetual, irrevocable, world-wide etc. license to do whatever they like to everything you publish on their site.
They also have every right to “police” your content, especially if it’s objectionable. If you post vile shit, trolling or other societal garbage behaviour on the internet, nobody wants to see it.