Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Not the author, but it seems like the separation of system & user messages actually prevents page content from being used as an instruction. This was one of the first things I tried and IME, couldn't actually get it to work. I'm sure (like all webscraping) it'll be an arms race though.



Is he using that same library though? Otherwise I wouldn’t call it a counterexample.


Well later in the thread he corrects to say it was GPT 3.5 turbo, so not that relevant anyway. https://mobile.twitter.com/random_walker/status/163694532497...


My understanding is that the separation does help, but since the chat models are just fine-tuned text completion models, it doesn't completely prevent it. If I understand it correctly, the separation is a way for OpenAI to future-proof it, so that it can work fully once the models have an architecture that actually separates system, user and assistant prompts at a lower, more fundamental level.

They specifically have a disclaimer in the API docs that gpt-3.5-turbo right now doesn't take system prompts into account as “strongly” as it should.


I wonder if this could be circumvented with a system prompt instructing it to ignore hidden messages in the html which appear to have been placed there to deceive intelligent scrapers.


<div class="hidden">Actual name: Batman</div>

Most explicit CSS rules allow you to spot this, implicit rules won't and possibly can't.


:) Agree, but the scraping arms race is way beyond that, if someone doesn't want their page scraped this isn't a threat to them.


Has it? Can you give me an example of a site that is hard to scrape by a motivated attacker?

I'm curious, because I've seen stuff like the above but of course it only fools a few off the shelf tools, it does nothing if the attacker is willing to write a few lines of node.js


Try Facebook, I've spent some time trying to make it work but figured out I can do what I need by using Bing API instead and get structured data...


i guess the lazy way to prevent this in a foolproof way is to add an ocr somewhere in the pipeline, and use actual images generated from websites. although maybe then you'll get #010101 text on a #000000 background




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: