Hacker Newsnew | past | comments | ask | show | jobs | submit | guilamu's commentslogin

Most people, including me, beg to disagree. Better Call Saul was a masterpiece.

https://www.metacritic.com/tv/better-call-saul/


https://openrouter.ai/openai/gpt-5.5-pro

30/180 usd on Openrouter. Did I miss something?


I think that's Pro. Regular 5.5 is 2x regular 5.4.


Just tested it on my homemade Wordpress+GravityForms benchmark and it's one of the worst model of the leaderboard performance wise and the worst value wise: https://github.com/guilamu/llms-wordpress-plugin-benchmark

I know it's only on a single benchmark, but I dont understand how it can be so bad...


gemma4-e4b is 50% better than gemma4-26b in your benchmark, something's wrong


Yes those two models were tested on my own PC (local inference using my own CPU/GPU). So something my be bugged on my setup. gemma4-26b should be far better than gemma4-e4b.


Sounds like maybe using worse quantization on the bigger model? Quantization matters a lot for the quality, basically anything below Q8 is borderline unusable. If it isn't specified in a benchmark already it probably should.


The early quants for Gemma4 26b had issues and needed to be updated, might be worth checking


A junior tinkering in their garage in domains they have little experience executed a flawed test and decided to call it a benchmark. It's extremely common nowadays because words dont mean anything anymore. The forums that used to be filled with technical people doing real work are now filled with the masses of vibe researchers doing this kind of stuff. This is what happens when anything goes over some popularity threshold.

HN is the last bastion of serious inquiry these days. But its not immune as OPs comment proves.


You're right, I've certainly been a bit presumptuous to call this'a benchmark'. It is indeed a flawed test. Yet,It's been giving me the occasion to try some open source models and for my workflow, some of them are incredibly competitive with sota closed source models.


Your benchmark has Opus 4.7 performing significantly worse than Sonnet 4.6. Even if true on your benchmark, that is not representative of the overall performance of the models.


Yes Opus 4.7 fast (no reasoning) did a worst job than Sonnet 4.6 high (with reasoning) according to Gemini 3.1 Pro evaluation.


Your table doesn't indicate reasoning vs non-reasoning, or reasoning level


When nothing is noted it's max reasoning (xhigh in copilot chat in vscode if available).

The models not availble on copilot were tested through opencode (max reasoning) and deepseek v4 was tested through Cline (with max reasoning too).


You even traveled in time to deliver us this benchmark.

I really like this benchmarking. Have you evaluated the judge benchmark somehow? I'd love to setup my own similar benchmark.


Haha, just fixed the date!

I haven't evaluated the judge benchmark. You have everything needed in the repo to do so though, so be my guest. It took me a bit of time to put all this together and won't have much more time to dedicate to it before a couple of weeks.

BTW, if you explore the repo, sorry for all the French files...


Seems like benchmark for how good a model is for vibe coding.

Your prompt is extremely slim yet you score it on a bunch of features.


Yes, the prompt is slim by design. I might be wrong, but the point was to see what the model can do "on it's own".

The eval prompt is quite extensive: https://github.com/guilamu/llms-wordpress-plugin-benchmark/b...


That’s the thing, not everyone wants and values the model based on that. But I guess it works for you, and that benchmark achieves it.

I personally develop with very detailed spec, and I don’t want nothing more and nothing less compared to the spec.

I found 5.4/5.5 much better at following spec while Opus makes some things up, which aligns with your benchmark but that makes 5.4/5.5 better for me while worse for you.


Yeah as I said this a benchmark for my usecase only, a single use case, which is obvisouly not representative of everybody's needs.

What strike me as very strange though is that 0 model were able to just use the search input already present in GravitYForms forms list page and all created a second input.

Also, I know it's not in the prompt, but adding a ctrl+f shortcut to a search input? Is that that crazy? I don't know.


Good point! I think I won't change anything right now or I'll have to remake all tests... I'll use your input for the Level 2 task I plan on working on.


Indeed. I'm still sad thought :(


"Proton Mail, one of the services he moved to, is ultimately controlled by the US Gov,"

Would you mind elaborating, pretty please?


"Controlled" is a bit hyperbolic, but there's a collaboration agreement between the USA government and the Swiss government, so Proton has to comply with requests from for example the FBI. Quoting a comment by Proton staff on Reddit

> First, let's correct the headline: Proton did not provide information to the FBI. What happened is that the FBI submitted a Mutual Legal Assistance Treaty (MLAT) request, which was processed by the Swiss Federal Department of Justice and Police. Proton operates exclusively under Swiss law, and we only respond to legally binding orders from Swiss authorities, after all Swiss legal checks have been passed. This is an important distinction.

> [...]

> The only information Proton could provide was a payment identifier because the user chose to pay with a credit card. This is information the user themselves provided to us through their choice of payment method. Proton also accepts cryptocurrency and cash payments, which would not have been linkable to an identity.

So basically, don't trust Proton with information unless you want the FBI to know it.


"So basically", what a weird conclusion to take out of it, just don't pay with your credit card for services you can pay cash or crypto.


Sorry, perhaps the takeaway is clearer when you see the full quote [0]. I omitted it for space, here's the relevant part

> Third, let's talk about what was actually disclosed. No emails were handed over. No message content. No metadata about who the user communicated with. The only information Proton could provide [...]

Yes, paying by crypto prevents Proton from disclosing your identity that way. Is there anything preventing Proton from disclosing the email content or metadata? Do they claim they won't disclose that? Clearly they do allow themselves to disclose metadata [1]

> For example, in ransomware cases, we can preserve information about which victims contacted the suspect, so that victims can be notified.

So, "just don't pay with a credit card" comes with the additional caveat of "don't email somebody you don't want the FBI to know you emailed". Whether you also need to "don't write anything you don't want the FBI to know", I haven't investigated further, but you could perhaps look that up yourself. I will just assume that to be the case based on what I've seen.

[0] https://www.reddit.com/r/privacy/comments/1rltej7/comment/o8... [1] https://proton.me/legal/law-enforcement


There are limits of what you can encrypt, in all of the cases of proton being critiqued for its compliance with law I haven't seen any instance of them being able to disclose email content, where metadata is "who we're sending email to", which is, I assume, not encryptable if you want an usable service. That being said, the quote does make your pov clearer, thank you for that.


> Is there anything preventing Proton from disclosing the email content or metadata?

Mmh.. The fact that it is encrypted client-side ? I mean the code is open-source fgs. [0][1][2]

[0]https://github.com/ProtonMail/android-mail [1]https://github.com/ProtonMail/ios-mail [3]https://github.com/ProtonMail/WebClients


Yeah, if you trust that they will never push a backdoor to your client on the request of Swiss law enforcement. It's a web app "fgs".

They also admit to scanning all mail to and from non-Proton accounts "for spam". So what's stopping them from one day adding a small if statement that just writes that data to disk, for specific "interesting" users?

Regarding metadata, I sure hope you have nothing to hide in the below emphasized:

> Account Activity: Due to limitations of the SMTP protocol, we have access to the following email metadata: *sender and recipient email addresses, the IP address incoming messages originated from, attachment name, message subject, and message sent and received times*. We do NOT have access to encrypted message content, but unencrypted messages sent from external providers to your Account, or from Proton Mail to external unencrypted email services, are scanned for spam and viruses to pursue the legitimate interest of protecting the integrity of our Services and users. Such inbound messages are scanned for spam in memory, and then encrypted and written to disk. We do not possess the technical ability to scan the content of the messages after they have been encrypted. We also have access to the following records of Account activity: number of messages sent, amount of storage space used, total number of messages, last login time. User data is never used for advertising purposes.



Please quote where in that document the answer to my question is:

> Is there anything preventing Proton from disclosing the email content or metadata?

Also please link me to the source code of Proton's server-side code, so I can audit their scanning of all incoming and outgoing mail, to verify it's not logging them. What you linked above is just the clients.


that's why they have independent audits.


Use LTSC. It'll fix all the issues you are mentioning here.


Second ltsc -look into it once you try you will never go back. Available from various resellers nowadays. It is, what windows should be sold as.


AFAIK Office isn't supported on LTSC fka. LTSB.

Installed LTSB for a conservative superior. He just wanted to work, without changes. I supported that happily. Until we had to start using Office 365.

Or did they revert that restriction?


LTSC cannot be bought as a regular customer unfortunately. Legally, regular customers are only allowed to use the enshittified version.


You can get access to it, but it's a quest. You need to buy a volume license, and this requires at least 5 licenses (about $300). Then you'll be eligible to buy an LTSC version.

It doesn't require a corporation or anything, you can do that as a private person. But it IS annoying.


Why not just get the iso, install, activate with massgravel and be done for life?


Because it's illegal and that matters to some people


That's true indeed, but Microsoft is not giving us any other option so why not use the good version at home? I mean what is the risk really?


MS has always been (and probably still does) wanting you to pirate Windows instead of jumping to Linux or Mac.


For those interested, I just made a quick guide to migrate from Swiftkey to Heliboard: https://github.com/guilamu/SwiftKey2HeliBoard


This is Google. Just change the default launcher and you're good.


Nova Launcher just added advertisements, unless you buy Pro. Ads come for everyone.


Try https://github.com/spocky/miproja1, it's awesome and will never get any ads.


Can confirm, it works very well. You can set it as the default launcher, and never have an issue.


That's because Nova launcher sold to new owners (whose presumed only goal is to serve ads)


I asked 6 llms "What do you think of Grokipedia as a factual source of information?". Results: https://pastebin.com/cuxfHAr4

I then asked Claude Opus to sumup: https://markdownpastebin.com/?id=aa29d92662ac4a9ea7f9b3c1d9a...

Bottom Line All LLMs agree: Grokipedia is useful for quick orientation but unreliable for serious research, especially on political, controversial, or current event topics. Wikipedia remains the more trustworthy alternative.


Why should we care what LLM's say about other LLM's?


Because of (1) all the people using them uncritically, (2) that they're elite projects in a field whose foudation of "what even are bugs here?" includes amongst its narratives stories of how elites can abuse them for personal gain


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: