Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I was just watching a science-related video containing math equations. I wondered how soon will I be able to ask the video player "What am I looking at here, describe the equations" and it will OCR the frames, analyze them and explain them to me.

It's only a matter of time before "browsing" means navigating HTTP sites via LLM prompts. although, I think it is critical that LLM input should NOT be restricted to verbal cues. Not everyone is an extrovert that longs to hear the sound of their own voices. A lot of human communication is non-verbal.

Once we get over the privacy implications (and I do believe this can only be done by worldwide legislative efforts), I can imagine looking at a "website" or video, and my expressions, mannerisms and gestures will be considered prompts.

At least that is what I imagine the tech would evolve into in 5+ years.



Now? OK, you need to screencap and upload to LLM, but that's well established tech by now. (Where by "well established", I mean at least 9 months old ;)

Same goes for "navigating HTTP sites via LLM prompts". Most LLMs have web search integration, and the "Deep Research" variants do more complex navigation.

Video chat is there partially, as well. It doesn't really pay much attention to gestures & expressions, but I'd put the "earliest possible" threshold for that a good chunk closer than 5 years.


Yeah, all these things are possible today, but getting them well polished and integrated is another story. Imagine all this being supported by "HTML6" lol. When apple gets around to making this part of safari, then we know it's ready.


That's a great upper-bound estimator ;)

But kidding aside - I'm not sure people want this being supported by web standards. We could be a huge step closer to that future had we decided to actually take RDF/Dublin Core/Microdata seriously. (LLMs perform a lot better with well-annotated data)

The unanimous verdict across web publishers was "looks like a lot of work, let's not". That is, ultimately, why we need to jump through all the OCR hoops. Not only did the world not annotate the data, it then proceeded to remove as many traces of machine readability as possible.

So, the likely gating factor is probably not Apple & Safari & "HTML6" (shudder!)

If I venture my best bet what's preventing polished integration: It's really hard to do via foundational models only, and the number of people who want to have deep & well-informed conversations via a polished app enough that they're willing to pay for an app that does that is low enough that it's not the hot VC space. (Yet?)

Crystal ball: Some OSS project will probably get within spitting distance of something really useful, but also probably flub the UX. Somebody else will take up these ideas while it's hot and polish it in a startup. So, 18-36 months for an integrated experience from here?


Good lord, I dearly hope not. That sounds like a coddled hellscape world, something you'd see made fun of in Disney's Wall-E.


hence my comment about privacy and need for legislation :)

It isn't the tech that's the problem but the people that will abuse it.


While those are concerns, my point was that having everything on the internet navigated to, digested and explained to me sounds unpleasant and overall a drain on my ability to think and reason for myself.

It is specifically how you describe using the tech that provokes a feeling of revulsion to me.


Then I think you misunderstand. The ML system would know when you want things digested to you or not. Right now companies are assuming this and forcing LLM interaction. But when properly done, the system would know based on your behavior or explicit prompts what you want and provide the service. If you're staring at a paragraph intently and confused, it might start highlighting common phrases or parts of the text/picture that might be hard to grasp and based on your reaction to that, it might start describing things via audio,tool tips,side pane,etc.. In other words, if you don't like how and when you're interacting with the LLM ecosystem, then that is an immature and failing ecosystem, in my vision this would be a largely solved problems, like how we interact with keyboards,mouse and touchscreens today.


No, I fully understand.

I am saying that this type of system, that deprives the user of problem solving, is itself a problem. A detriment to the very essence of human intelligence.


I just look at it as allowing the user to focus on problems that aren't already easily solved. Like using a calculator instead of calculating manually on paper.


But the scenario you described is one in which you need an equation explained to you. That is exactly the kind of scenario where it's important to do the calculation yourself to understand it.

If you are expecting problems to be solved for you, you are not learning, you're just consuming content.


explained != solved


> I wondered how soon will I be able to ask the video player "What am I looking at here, describe the equations" and it will OCR the frames, analyze them and explain them to me.

Seems like https://aiscreenshot.app might fit the bill.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: