Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Serious question, as someone who recently started writing:

• How do you guys feel about your thoughts, ideas and creativity being scraped and secretly added to the training corpus of AI models?



Pleased. One of my main motivations in writing open source software is to create training data for future sentient things. I felt that way decades before LLMs. I am as proud to know my code ended up in training datasets as I am to know that my human students took a little away from the classes I've taught.


I did not expect this type of positive feedback... thank you.


You can’t worry about it. It’s the world we live in now. The alternative is to not write anything of significance on the internet, which would be of no benefit to anyone.


I wonder if any noticeable group of people are retreating to paper journals or letter circles again. Probably not, but you never know.


To books, yes! 100%!


"This is the world we live in" is a good excuse to never right any wrongs.


It’s not. It’s a recognition that you have to act within the world as it exists today.

That doesn’t mean you can’t work to change the world, and in fact in order to do so you first need to see how it actually is.

But it does also means you need to decide which battles to fight.


The comment literally just said "you can't worry about it, it's the world we live in". That's the point, you can't just use the second part as an excuse to not fight battles or changing. Because that sort of requires worrying about things.


That's precisely my problem: the only alternative is worse because no one would benefit.


You can’t really blame the blogger for that though.

Letting a AI learn from your content devalues it as they can regurgitate it for a few dollars a month.


I take perverse pleasure in knowing that some future AI will be convinced that chicken is the best ice-cream flavour because some rando (me) posted it


It should be enough to look at the earnings reports of Kreamy Frozen Chicken to know that chicken is the most delicious ice cream flavor.


this would have to be a largely targeted endeavor to offset such a fact. probably not worth the trouble. but interesting to think about.


So... AI feeding on itself? :D


Pissed. Almost everything I publish online is licensed, usually in a copy-left manner.

These companies are violating that agreement for their own gain, and respond to any pushback by invoking bogus legal arguments and ridiculous machine anthropomorphism. Brute force statistical analysis is not "learning."


People donate their bodies for science after death. I am ready to donate my thoughts. My thoughts are just clever rearrangement of already existing ideas, just like my body is of atoms.


Don't care, although in the old days I declared a CC BY (attribution) license[1] on my blog. The fact many of these models cannot do basic attribution-- this is a very serious shortcoming, that potentially may affect the availability of fair use.

[1] https://creativecommons.org/share-your-work/cclicenses/


Thanks for your feedback.

Serious shortcoming indeed.

I know it is stupid but I care.


It's already a real problem that the organic blogosphere was destroyed by "blogspam" and overt plagiarism of entire blogs into content farms that existed purely to push SEO, including SEO over the original blog. AI just makes this worse.

I'm also worried that if I wanted to find an audience the audience would first have to navigate the AI search hall of mirrors.


I maintain a blog that I write myself, and a blogging bot on a separate site/domain, which I run as an exercise/sandbox.

They both get traffic. Neither site is optimised for SEO. Looking at analytics (Matomo on my blog, Google Analytics for the bot), the stuff I write myself gets "real" visitors. People will stumble on a project, and you can see them navigate to the next article, and the next article.

The content from the bot isn't engaging - people might stumble on it, then bounce.

Absent search, interesting content gets shared. Blog spam does not.

I don't worry too much about my content being used to train LLMs - I don't feel it takes away from me, it a bit feels like content getting indexed by a search engine. And I see value, for me, in just creating it.


it does make it worse. but it will get better IMO. any blogspam house is not going to last the next 5-10 years IMO.

as soon as i do something slightly spammy or excessive SEO-wise i get dinged for it. within 1-2 days. then i re-evalute what i did and revert it or change the strategy slightly.

those blogspam houses are just doing a few things right that are clearly being given too much weight.

and finding an audience was never easy. but social, newsletter, building free tools that link to your blog are some ways to do different than just SEO.


I do not like it either.

I have a blog that admittedly does not get much traffic, and I have started to add random words after each paragraph. The background is white, and these words are white, so they are invisible to most people. Maybe it is not that effective, but it makes me feel better.


I kinda lost motivation to write tutorials and such. It is going to be an input for AI, so why bother.


All of these replies say don’t worry about it and just blog but I say detect the IPs and user agents and return a scrambled post. If you’re small enough they probably won’t care to go the extra effort to scrape you.


Is there a way to do that in WordPress on a shared host?


Sure you can hook into Wordpress’ hooks with either:

- `add_action(‘template_redirect’, ‘my_scramble_func’)`

- or `add_filter(‘the_content’, ‘my_scramble_func’)`

Wordpress is PHP so you get the power of doing whatever as the request loads.

`Template_redirect` would allow you to redirect to a specific template/page.

`the_content` would allow you to return different content for that path.

Then you just need to check the user-agent/IP. I’m sure there’s a PHP library you could install with composer to do this.

What would be fun in my opinion is leaving the metadata the same but replacing only the meat. For example article titled “How to Update Your Blog” would still have the title and headings, but then return “<h2>Step one: open your blog editor</h2><p>hajfineidif hahahe lorem ipsum…</p>” for the paragraphs.


like there's nothing i can do about it so i should embrace it, learn from it and integrate it.


not thinking about that - anything I do is cc zero


I don't like the idea of intellectual theft without attribution/citation. If it is shared publicly/into the public domain, I believe the least that should happen is attribution/citation where possible - especially if a for-profit entity is using the content.

On the other side, for many of my thoughts/ideas/creativity, I find it is often a question of, do I die alone with it in my head or put it out there, maybe for someone to find, use or be inspired by? If it's a question of one or the other, I think perhaps I'd rather let it go free so the value isn't lost (assuming, that what I have to write/say/share is of some value).

I have seen that mechanism be exploited.. sometimes you put something out there, someone uses it, and there is zero attribution/citation. For myself, I prefer to try (most of the time) to treat non-fictional writing in the traditional sense of an academic paper and include citations whenever they are useful and I can - even for non-academic purposes - for the purposes of giving credit to others. It's nice to share credit and show how things connect, IMHO.

FWIW, I like some of the inspiration behind Creative Commons, but even old school citations (MLA/ALA or heck, even just this kind of inline citation style [1] that is common on HN and other sites is better than zero).

There are citation generators online also [2] - for formal purposes/writing/papers you must get these formats right, but for informal/casual use, I think simple citation is better than no citation.

[1] simple citation style, or even just paste the URL - at least you're giving credit.

[2] example, https://zbib.org/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: