Hacker Newsnew | past | comments | ask | show | jobs | submit | osti's commentslogin

Not true. Geekbench, especially single threaded benchmark, is probably the best we got, it has a bunch of workloads, unlike many other benchmarks like cinebench for example. And they publish all the results on their website, so you can dig into each individual workload and find the ones that apply to you.

And like the other poster mentioned, it correlates well with SPEC, so it's basically a easily accessible SPEC. These days the only benchmark I use to quickly judge some CPU is geekbench.


May I suggest the one I use (I wrote it), which also correlates well with SPEC & Geekbench 5, but also runs the benchmarks on all cores if you want to so you get both max single-thread and max multi-thread: https://github.com/dkechag/dkbench-docker . You basically run 'docker run -it --rm dkechag/dkbench'.

I took a look, it's not bad but it seems to contain too many micro benchmarks like regex or primes. Geekbench at least has clang which is a subscore that I always look at.

The primes one is my least favourite one indeed, I left it in just because I happened to include it in the very first version and I am thinking it just counts for 5% in the end... The regex ones are "micro" yet quite important, dkbench it's a Perl (and C)-based benchmark (reflects our main code), and the regex engine is the most highly optimized part of the language so regex speed is a good representation of text processing speed in Perl. As I said, the overall score correlates well to SPEC/Geekbench so as a suite it works well. For compiler comparisons I usually compile a language like python or perl as a test, but I did not want to add something like that, to keep it fast with many smaller benchmarks.

It's only that one number that is for sonnet.

except for the webarena-verified

Their company is called Anthropic after all.


Anthslopic is more like it.


ByteDance never really open sourced their models though. But I agree, they will only open source when it doesn't really matter.


That's what I found with some of these LLM models as well. For example I still like to test those models with algorithm problems, and sometimes when they can't actually solve the problem, they will start to hardcode the test cases into the algorithm itself.. Even DeepSeek was doing this at some point, and some of the most recent ones still do this.


I have asked GLM4.7 in opencode to make an application to basically filter a couple of spatial datasets hosted at a url I provided it, and instead of trying to download read the dataset, it just read the url, assumed what the datasets were (and got it wrong) is and it's shape (and got it wrong) and the fields (and got it wrong) and just built an application based on vibes that was completely unfixable.

It wrote an extensive test suite on just fake data and then said the app is perfectly working as all tests passed.

This is a model that was supposed to match sonnet 4.5 in benchmarks. I don't think sonnet would be that dumb.

I use LLMs a lot to code, but these chinese models don't match anthropic and openai in being able to decide stuff for themselves. They work well if you give them explicit instructions that leaves little for it to mess up, but we are slowly approaching where OpenAI and anthropic models will make the right decisions on their own


this aligns perfecly with my experience, but of course, the discourse on X and other forums are filled with people who are not hands on. Marketing is first out of the gate. These models are not yet good enough to be put through a long coding session. They are getting better though! GLM 4.7 and Kimi 2.5 are alright.


It really is infuriatingly dumb; like a junior who does not know English. Indeed, it often transitions into Chinese.

Just now it added some stuff to a file starting at L30 and I said "that one line L30 will do remove the rest", it interpreted 'the rest' as the file, and not what it added.


Sounds exactly what a junior-dev would do without proper guidance. Could better direction in the prompts help? I find I frequently have to tell it where to put what fixes. IME they make a lot of spaghetti (LLMs and juniors)


wtf kinda juniors are you interacting with


Lots of self-taught; looking for an entry level.


I'm self-taught and I've always understood that adjusting tests to cheat is a fail.


So now you don't want capitalism?


Okay, lets do your job and career next! Just capitalism bro.


There was a debate with Mike Dukakis when one of the moderators asked if he would want the death penalty for someone who killed his wife. He gave some cold blooded answer.

The real answer was probably - I shouldn't be the one who decides what happened to the person who killed my wife.

In the same way - It shouldn't be up to me if I get fired or my job gets shipped overseas or done by an AI. If we're in a free market it's up to all the people who are buying what I'm making. If there's a cheaper way why wouldn't they take it?


Somehow regresses on SWE bench?


I don't know how these benchmarks work (do you do a hundred runs? A thousand runs?), but 0.1% seems like noise.


That benchmark is pretty saturated, tbh. A "regression" of such small magnitude could mean many different things or nothing at all.


i'd interpret that as rounding error. that is unchanged

swe-bench seems really hard once you are above 80%


it's not a great benchmark anymore... starting with it being python / django primarily... the industry should move to something more representative


Openai has; they don't even mention score on gpt-5.3-codex.

On the other hand, it is their own verified benchmark, which is telling.


Social "science" be social science.


I saw many complaints about assetto corsa evo early access about the slow pace of development after EA release. So I'm not sure if I wanna "beta test" this one.


AC rally is developed by another studio (Supernova) though, not Kunos. From online and youtube reviews, it seems to be pretty good in terms of visuals, realism and FFB. Might have less bugs.


I saw the suspension behavior and I can’t necessarily agree with the realism statement. Some mild bumps that a million dollar rally car would absorb no problem send the car flying as if it was a 1995 Civic DX going down the road.

I own the original AC and I just can’t get over how bad the audio is. Sounds like a simple pitch change on a static mp3 file that’s not even that accurate at idle. An E30 sounds like a synth.

I thought ok, it’s an old game now, I get it, they must have fixed that in ACC, nope. They must have fixed it in AC Evo… based on the videos I’ve seen, nope.

Maybe rally does it better? Does anyone know?


That's why reading comments about geopolitics on the Internet is largely useless. Big news! A country's population supports its own country on international stage! If you go on Chinese social media, it'll be mostly about how awful the Americans are, and vice versa if you are on Reddit for example. So what is even the point of reading them, anywhere..


I think you and I are on very different Reddits, if you're using it as an example of pro-American social media.

Fully agree that reading either for geopolitical opinions is useless.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: