Not the author, but it seems like the separation of system & user messages actua...

sp332 · on March 25, 2023

Counterexample: https://mobile.twitter.com/random_walker/status/163692305837...

nonethewiser · on March 25, 2023

Is he using that same library though? Otherwise I wouldn’t call it a counterexample.

sp332 · on March 25, 2023

Well later in the thread he corrects to say it was GPT 3.5 turbo, so not that relevant anyway. https://mobile.twitter.com/random_walker/status/163694532497...

TheCoreh · on March 26, 2023

My understanding is that the separation does help, but since the chat models are just fine-tuned text completion models, it doesn't completely prevent it. If I understand it correctly, the separation is a way for OpenAI to future-proof it, so that it can work fully once the models have an architecture that actually separates system, user and assistant prompts at a lower, more fundamental level.

They specifically have a disclaimer in the API docs that gpt-3.5-turbo right now doesn't take system prompts into account as “strongly” as it should.

arbuge · on March 25, 2023

I wonder if this could be circumvented with a system prompt instructing it to ignore hidden messages in the html which appear to have been placed there to deceive intelligent scrapers.

lorey · on March 25, 2023

<div class="hidden">Actual name: Batman</div>

Most explicit CSS rules allow you to spot this, implicit rules won't and possibly can't.

tomberin · on March 25, 2023

:) Agree, but the scraping arms race is way beyond that, if someone doesn't want their page scraped this isn't a threat to them.

sebzim4500 · on March 25, 2023

Has it? Can you give me an example of a site that is hard to scrape by a motivated attacker?

I'm curious, because I've seen stuff like the above but of course it only fools a few off the shelf tools, it does nothing if the attacker is willing to write a few lines of node.js

tappio · on March 25, 2023

Try Facebook, I've spent some time trying to make it work but figured out I can do what I need by using Bing API instead and get structured data...

asddubs · on March 25, 2023

i guess the lazy way to prevent this in a foolproof way is to add an ocr somewhere in the pipeline, and use actual images generated from websites. although maybe then you'll get #010101 text on a #000000 background