More

guilamu · 2026-04-27T09:44:37 1777283077

Most people, including me, beg to disagree. Better Call Saul was a masterpiece.

https://www.metacritic.com/tv/better-call-saul/

guilamu · 2026-04-24T20:41:49 1777063309

https://openrouter.ai/openai/gpt-5.5-pro

30/180 usd on Openrouter. Did I miss something?

languid-photic · 2026-04-24T21:00:05 1777064405

I think that's Pro. Regular 5.5 is 2x regular 5.4.

guilamu · 2026-04-24T20:05:07 1777061107

Just tested it on my homemade Wordpress+GravityForms benchmark and it's one of the worst model of the leaderboard performance wise and the worst value wise: https://github.com/guilamu/llms-wordpress-plugin-benchmark

I know it's only on a single benchmark, but I dont understand how it can be so bad...

goldenarm · 2026-04-24T20:27:06 1777062426

gemma4-e4b is 50% better than gemma4-26b in your benchmark, something's wrong

guilamu · 2026-04-24T20:30:14 1777062614

Yes those two models were tested on my own PC (local inference using my own CPU/GPU). So something my be bugged on my setup. gemma4-26b should be far better than gemma4-e4b.

embedding-shape · 2026-04-24T21:08:34 1777064914

Sounds like maybe using worse quantization on the bigger model? Quantization matters a lot for the quality, basically anything below Q8 is borderline unusable. If it isn't specified in a benchmark already it probably should.

data-ottawa · 2026-04-25T12:24:35 1777119875

The early quants for Gemma4 26b had issues and needed to be updated, might be worth checking

Art9681 · 2026-04-25T13:54:53 1777125293

A junior tinkering in their garage in domains they have little experience executed a flawed test and decided to call it a benchmark. It's extremely common nowadays because words dont mean anything anymore. The forums that used to be filled with technical people doing real work are now filled with the masses of vibe researchers doing this kind of stuff. This is what happens when anything goes over some popularity threshold.

HN is the last bastion of serious inquiry these days. But its not immune as OPs comment proves.

guilamu · 2026-04-27T10:02:45 1777284165

You're right, I've certainly been a bit presumptuous to call this'a benchmark'. It is indeed a flawed test. Yet,It's been giving me the occasion to try some open source models and for my workflow, some of them are incredibly competitive with sota closed source models.

ac29 · 2026-04-24T20:14:27 1777061667

Your benchmark has Opus 4.7 performing significantly worse than Sonnet 4.6. Even if true on your benchmark, that is not representative of the overall performance of the models.

guilamu · 2026-04-24T20:22:53 1777062173

Yes Opus 4.7 fast (no reasoning) did a worst job than Sonnet 4.6 high (with reasoning) according to Gemini 3.1 Pro evaluation.

ac29 · 2026-04-24T20:32:32 1777062752

Your table doesn't indicate reasoning vs non-reasoning, or reasoning level

guilamu · 2026-04-24T20:36:16 1777062976

When nothing is noted it's max reasoning (xhigh in copilot chat in vscode if available).

The models not availble on copilot were tested through opencode (max reasoning) and deepseek v4 was tested through Cline (with max reasoning too).

mosselman · 2026-04-24T20:18:08 1777061888

You even traveled in time to deliver us this benchmark.

I really like this benchmarking. Have you evaluated the judge benchmark somehow? I'd love to setup my own similar benchmark.

guilamu · 2026-04-24T20:25:31 1777062331

Haha, just fixed the date!

I haven't evaluated the judge benchmark. You have everything needed in the repo to do so though, so be my guest. It took me a bit of time to put all this together and won't have much more time to dedicate to it before a couple of weeks.

BTW, if you explore the repo, sorry for all the French files...

DrProtic · 2026-04-24T20:26:43 1777062403

Seems like benchmark for how good a model is for vibe coding.

Your prompt is extremely slim yet you score it on a bunch of features.

guilamu · 2026-04-24T20:28:33 1777062513

Yes, the prompt is slim by design. I might be wrong, but the point was to see what the model can do "on it's own".

The eval prompt is quite extensive: https://github.com/guilamu/llms-wordpress-plugin-benchmark/b...

DrProtic · 2026-04-24T20:57:27 1777064247

That’s the thing, not everyone wants and values the model based on that. But I guess it works for you, and that benchmark achieves it.

I personally develop with very detailed spec, and I don’t want nothing more and nothing less compared to the spec.

I found 5.4/5.5 much better at following spec while Opus makes some things up, which aligns with your benchmark but that makes 5.4/5.5 better for me while worse for you.

guilamu · 2026-04-24T22:10:00 1777068600

Yeah as I said this a benchmark for my usecase only, a single use case, which is obvisouly not representative of everybody's needs.

What strike me as very strange though is that 0 model were able to just use the search input already present in GravitYForms forms list page and all created a second input.

Also, I know it's not in the prompt, but adding a ctrl+f shortcut to a search input? Is that that crazy? I don't know.

guilamu · 2026-04-24T06:51:33 1777013493

Good point! I think I won't change anything right now or I'll have to remake all tests... I'll use your input for the Level 2 task I plan on working on.

guilamu · 2026-04-20T19:14:24 1776712464

Indeed. I'm still sad thought :(

guilamu · 2026-04-03T10:08:54 1775210934

"Proton Mail, one of the services he moved to, is ultimately controlled by the US Gov,"

Would you mind elaborating, pretty please?

streetfighter64 · 2026-04-03T10:18:32 1775211512

"Controlled" is a bit hyperbolic, but there's a collaboration agreement between the USA government and the Swiss government, so Proton has to comply with requests from for example the FBI. Quoting a comment by Proton staff on Reddit

> First, let's correct the headline: Proton did not provide information to the FBI. What happened is that the FBI submitted a Mutual Legal Assistance Treaty (MLAT) request, which was processed by the Swiss Federal Department of Justice and Police. Proton operates exclusively under Swiss law, and we only respond to legally binding orders from Swiss authorities, after all Swiss legal checks have been passed. This is an important distinction.

> [...]

> The only information Proton could provide was a payment identifier because the user chose to pay with a credit card. This is information the user themselves provided to us through their choice of payment method. Proton also accepts cryptocurrency and cash payments, which would not have been linkable to an identity.

So basically, don't trust Proton with information unless you want the FBI to know it.

Yiin · 2026-04-03T10:37:03 1775212623

"So basically", what a weird conclusion to take out of it, just don't pay with your credit card for services you can pay cash or crypto.

streetfighter64 · 2026-04-03T10:47:30 1775213250

Sorry, perhaps the takeaway is clearer when you see the full quote [0]. I omitted it for space, here's the relevant part

> Third, let's talk about what was actually disclosed. No emails were handed over. No message content. No metadata about who the user communicated with. The only information Proton could provide [...]

Yes, paying by crypto prevents Proton from disclosing your identity that way. Is there anything preventing Proton from disclosing the email content or metadata? Do they claim they won't disclose that? Clearly they do allow themselves to disclose metadata [1]

> For example, in ransomware cases, we can preserve information about which victims contacted the suspect, so that victims can be notified.

So, "just don't pay with a credit card" comes with the additional caveat of "don't email somebody you don't want the FBI to know you emailed". Whether you also need to "don't write anything you don't want the FBI to know", I haven't investigated further, but you could perhaps look that up yourself. I will just assume that to be the case based on what I've seen.

[0] https://www.reddit.com/r/privacy/comments/1rltej7/comment/o8... [1] https://proton.me/legal/law-enforcement

Yiin · 2026-04-03T10:53:44 1775213624

There are limits of what you can encrypt, in all of the cases of proton being critiqued for its compliance with law I haven't seen any instance of them being able to disclose email content, where metadata is "who we're sending email to", which is, I assume, not encryptable if you want an usable service. That being said, the quote does make your pov clearer, thank you for that.

redat00 · 2026-04-03T13:40:33 1775223633

> Is there anything preventing Proton from disclosing the email content or metadata?

Mmh.. The fact that it is encrypted client-side ? I mean the code is open-source fgs. [0][1][2]

[0]https://github.com/ProtonMail/android-mail [1]https://github.com/ProtonMail/ios-mail [3]https://github.com/ProtonMail/WebClients

streetfighter64 · 2026-04-03T17:36:02 1775237762

Yeah, if you trust that they will never push a backdoor to your client on the request of Swiss law enforcement. It's a web app "fgs".

They also admit to scanning all mail to and from non-Proton accounts "for spam". So what's stopping them from one day adding a small if statement that just writes that data to disk, for specific "interesting" users?

Regarding metadata, I sure hope you have nothing to hide in the below emphasized:

> Account Activity: Due to limitations of the SMTP protocol, we have access to the following email metadata: *sender and recipient email addresses, the IP address incoming messages originated from, attachment name, message subject, and message sent and received times*. We do NOT have access to encrypted message content, but unencrypted messages sent from external providers to your Account, or from Proton Mail to external unencrypted email services, are scanned for spam and viruses to pursue the legitimate interest of protecting the integrity of our Services and users. Such inbound messages are scanned for spam in memory, and then encrypted and written to disk. We do not possess the technical ability to scan the content of the messages after they have been encrypted. We also have access to the following records of Account activity: number of messages sent, amount of storage space used, total number of messages, last login time. User data is never used for advertising purposes.

redat00 · 2026-04-03T19:23:14 1775244194

https://datatracker.ietf.org/doc/html/rfc5321

streetfighter64 · 2026-04-03T20:29:09 1775248149

Please quote where in that document the answer to my question is:

> Is there anything preventing Proton from disclosing the email content or metadata?

Also please link me to the source code of Proton's server-side code, so I can audit their scanning of all incoming and outgoing mail, to verify it's not logging them. What you linked above is just the clients.

Yiin · 2026-04-04T09:52:39 1775296359

that's why they have independent audits.

guilamu · 2026-03-27T16:57:11 1774630631

Use LTSC. It'll fix all the issues you are mentioning here.

pomian · 2026-03-27T17:30:29 1774632629

Second ltsc -look into it once you try you will never go back. Available from various resellers nowadays. It is, what windows should be sold as.

GuestFAUniverse · 2026-03-27T20:34:55 1774643695

AFAIK Office isn't supported on LTSC fka. LTSB.

Installed LTSB for a conservative superior. He just wanted to work, without changes. I supported that happily. Until we had to start using Office 365.

Or did they revert that restriction?

Krssst · 2026-03-27T19:01:02 1774638062

LTSC cannot be bought as a regular customer unfortunately. Legally, regular customers are only allowed to use the enshittified version.

cyberax · 2026-03-28T07:02:38 1774681358

You can get access to it, but it's a quest. You need to buy a volume license, and this requires at least 5 licenses (about $300). Then you'll be eligible to buy an LTSC version.

It doesn't require a corporation or anything, you can do that as a private person. But it IS annoying.

guilamu · 2026-03-28T12:29:04 1774700944

Why not just get the iso, install, activate with massgravel and be done for life?

majorchord · 2026-03-28T20:18:08 1774729088

Because it's illegal and that matters to some people

guilamu · 2026-03-28T07:05:53 1774681553

That's true indeed, but Microsoft is not giving us any other option so why not use the good version at home? I mean what is the risk really?

userbinator · 2026-03-28T09:11:56 1774689116

MS has always been (and probably still does) wanting you to pirate Windows instead of jumping to Linux or Mac.

guilamu · 2026-03-18T19:36:01 1773862561

For those interested, I just made a quick guide to migrate from Swiftkey to Heliboard: https://github.com/guilamu/SwiftKey2HeliBoard

guilamu · 2026-02-01T15:17:23 1769959043

This is Google. Just change the default launcher and you're good.

paulryanrogers · 2026-02-01T16:06:10 1769961970

Nova Launcher just added advertisements, unless you buy Pro. Ads come for everyone.

guilamu · 2026-02-01T16:40:42 1769964042

Try https://github.com/spocky/miproja1, it's awesome and will never get any ads.

JoshTriplett · 2026-02-01T18:30:47 1769970647

Can confirm, it works very well. You can set it as the default launcher, and never have an issue.

SECProto · 2026-02-01T16:37:32 1769963852

That's because Nova launcher sold to new owners (whose presumed only goal is to serve ads)

guilamu · 2026-01-25T09:28:49 1769333329

I asked 6 llms "What do you think of Grokipedia as a factual source of information?". Results: https://pastebin.com/cuxfHAr4

I then asked Claude Opus to sumup: https://markdownpastebin.com/?id=aa29d92662ac4a9ea7f9b3c1d9a...

Bottom Line All LLMs agree: Grokipedia is useful for quick orientation but unreliable for serious research, especially on political, controversial, or current event topics. Wikipedia remains the more trustworthy alternative.

gizzlon · 2026-01-25T09:34:04 1769333644

Why should we care what LLM's say about other LLM's?

ben_w · 2026-01-25T09:41:41 1769334101

Because of (1) all the people using them uncritically, (2) that they're elite projects in a field whose foudation of "what even are bugs here?" includes amongst its narratives stories of how elites can abuse them for personal gain