Hacker Newsnew | past | comments | ask | show | jobs | submit | hipmanbro's commentslogin

There were a few reasoning benchmarks that I noticed think they omitted a direct comparison since they weren't as competitive compared to GPT-4, and instead opted to just show the benchmarks comparing itself to other versions of PaLM or other language models

HellaSwag: GPT-4: 95.3%, PaLM 2-L: 86.8%

MMLU: GPT-4: 86.4%, Flan-PaLM 2-L: 81.2%

ARC: GPT-4: 96.3%, PaLM 2-L: 89.7%

(from: GPT-4 paper: https://arxiv.org/pdf/2303.08774.pdf)


It's a good proxy for both. Both popularity and volatility have misleading statistical failure modes if measured purely by pull requests.

The author did narrow it down into what it mostly accurate represents though:

> Nevertheless, in my view, the number of pull requests is an important indicator of how much people are willing and capable of contributing to your software in the open source domain.


There has to be something to contribute to first -- i.e., new features or bug fixes. Once software reaches a level of stability, there aren't new features to be built, or bugs to be fixed. That doesn't mean there aren't contributors out there, willing to contribute to something new.


One of the worries I have is that the grand prize is so top weighted. It feels daunting to want to attempt to work on something like this if there's only one winner. If my team isn't the first to win it, but we contribute an interesting method, the most we can get is a 2000$ open source prize?


We're thinking about this too. The overall prize pool was about 6x smaller a week ago so we are still digesting this rapid influx of sponsorship.

The grand prize goes to the _first_ team to read 4 passages from the scrolls. But we could, for example, award something to the second team to do so. Or, we could award something to the team that reads the _most_ passages by the end of the year.

We deliberately did not allocate all of the recent sponsorships to the grand prize so we can solve for this exact challenge. So, we have about $500k in unallocated prize money, and might use a good chunk of it towards something like this. We're open to ideas, and consulting with experts from Xprize etc.


It's not the same in that you can steelman a position and come up with brand new arguments that are better than what the other side is saying. "Reconstructing" doesn't necessitate the strongest form of the other argument.


I mean, that is what it's for. It's about making someone else's argument fit into your own system without misrepresenting it or making it unclear.

I suppose having "steelman" lets you relate it to "strawman" and "weakman" which can be an advantage, but knowing the existing term lets you read the existing literature.


> making someone else's argument fit into your own system without misrepresenting it or making it unclear

I think this is where steelman is a superset of this, in that it includes the reconstruction definition but also includes making a whole new set of arguments that are entirely unrelated to your own argument or the other person's argument. i.e. Steelmanning can involve coming up with novel arguments for the other side.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: