Yes those two models were tested on my own PC (local inference using my own CPU/GPU). So something my be bugged on my setup. gemma4-26b should be far better than gemma4-e4b.
Sounds like maybe using worse quantization on the bigger model? Quantization matters a lot for the quality, basically anything below Q8 is borderline unusable. If it isn't specified in a benchmark already it probably should.
A junior tinkering in their garage in domains they have little experience executed a flawed test and decided to call it a benchmark. It's extremely common nowadays because words dont mean anything anymore. The forums that used to be filled with technical people doing real work are now filled with the masses of vibe researchers doing this kind of stuff. This is what happens when anything goes over some popularity threshold.
HN is the last bastion of serious inquiry these days. But its not immune as OPs comment proves.
You're right, I've certainly been a bit presumptuous to call this'a benchmark'. It is indeed a flawed test. Yet,It's been giving me the occasion to try some open source models and for my workflow, some of them are incredibly competitive with sota closed source models.
Your benchmark has Opus 4.7 performing significantly worse than Sonnet 4.6. Even if true on your benchmark, that is not representative of the overall performance of the models.
I haven't evaluated the judge benchmark. You have everything needed in the repo to do so though, so be my guest. It took me a bit of time to put all this together and won't have much more time to dedicate to it before a couple of weeks.
BTW, if you explore the repo, sorry for all the French files...
That’s the thing, not everyone wants and values the model based on that. But I guess it works for you, and that benchmark achieves it.
I personally develop with very detailed spec, and I don’t want nothing more and nothing less compared to the spec.
I found 5.4/5.5 much better at following spec while Opus makes some things up, which aligns with your benchmark but that makes 5.4/5.5 better for me while worse for you.
Yeah as I said this a benchmark for my usecase only, a single use case, which is obvisouly not representative of everybody's needs.
What strike me as very strange though is that 0 model were able to just use the search input already present in GravitYForms forms list page and all created a second input.
Also, I know it's not in the prompt, but adding a ctrl+f shortcut to a search input? Is that that crazy? I don't know.
Good point! I think I won't change anything right now or I'll have to remake all tests... I'll use your input for the Level 2 task I plan on working on.
"Controlled" is a bit hyperbolic, but there's a collaboration agreement between the USA government and the Swiss government, so Proton has to comply with requests from for example the FBI. Quoting a comment by Proton staff on Reddit
> First, let's correct the headline: Proton did not provide information to the FBI. What happened is that the FBI submitted a Mutual Legal Assistance Treaty (MLAT) request, which was processed by the Swiss Federal Department of Justice and Police. Proton operates exclusively under Swiss law, and we only respond to legally binding orders from Swiss authorities, after all Swiss legal checks have been passed. This is an important distinction.
> [...]
> The only information Proton could provide was a payment identifier because the user chose to pay with a credit card. This is information the user themselves provided to us through their choice of payment method. Proton also accepts cryptocurrency and cash payments, which would not have been linkable to an identity.
So basically, don't trust Proton with information unless you want the FBI to know it.
Sorry, perhaps the takeaway is clearer when you see the full quote [0]. I omitted it for space, here's the relevant part
> Third, let's talk about what was actually disclosed. No emails were handed over. No message content. No metadata about who the user communicated with. The only information Proton could provide [...]
Yes, paying by crypto prevents Proton from disclosing your identity that way. Is there anything preventing Proton from disclosing the email content or metadata? Do they claim they won't disclose that? Clearly they do allow themselves to disclose metadata [1]
> For example, in ransomware cases, we can preserve information about which victims contacted the suspect, so that victims can be notified.
So, "just don't pay with a credit card" comes with the additional caveat of "don't email somebody you don't want the FBI to know you emailed". Whether you also need to "don't write anything you don't want the FBI to know", I haven't investigated further, but you could perhaps look that up yourself. I will just assume that to be the case based on what I've seen.
There are limits of what you can encrypt, in all of the cases of proton being critiqued for its compliance with law I haven't seen any instance of them being able to disclose email content, where metadata is "who we're sending email to", which is, I assume, not encryptable if you want an usable service. That being said, the quote does make your pov clearer, thank you for that.
Yeah, if you trust that they will never push a backdoor to your client on the request of Swiss law enforcement. It's a web app "fgs".
They also admit to scanning all mail to and from non-Proton accounts "for spam". So what's stopping them from one day adding a small if statement that just writes that data to disk, for specific "interesting" users?
Regarding metadata, I sure hope you have nothing to hide in the below emphasized:
> Account Activity: Due to limitations of the SMTP protocol, we have access to the following email metadata: *sender and recipient email addresses, the IP address incoming messages originated from, attachment name, message subject, and message sent and received times*. We do NOT have access to encrypted message content, but unencrypted messages sent from external providers to your Account, or from Proton Mail to external unencrypted email services, are scanned for spam and viruses to pursue the legitimate interest of protecting the integrity of our Services and users. Such inbound messages are scanned for spam in memory, and then encrypted and written to disk. We do not possess the technical ability to scan the content of the messages after they have been encrypted. We also have access to the following records of Account activity: number of messages sent, amount of storage space used, total number of messages, last login time. User data is never used for advertising purposes.
Please quote where in that document the answer to my question is:
> Is there anything preventing Proton from disclosing the email content or metadata?
Also please link me to the source code of Proton's server-side code, so I can audit their scanning of all incoming and outgoing mail, to verify it's not logging them. What you linked above is just the clients.
You can get access to it, but it's a quest. You need to buy a volume license, and this requires at least 5 licenses (about $300). Then you'll be eligible to buy an LTSC version.
It doesn't require a corporation or anything, you can do that as a private person. But it IS annoying.
Bottom Line
All LLMs agree: Grokipedia is useful for quick orientation but unreliable for serious research, especially on political, controversial, or current event topics. Wikipedia remains the more trustworthy alternative.
Because of (1) all the people using them uncritically, (2) that they're elite projects in a field whose foudation of "what even are bugs here?" includes amongst its narratives stories of how elites can abuse them for personal gain
https://www.metacritic.com/tv/better-call-saul/