While this action may indeed cause the DoD to blacklist Anthropic from doing business w/the government, they probably were being as careful as they could be not to double down on the nose-thumbing.
I don't think it's addressed to Hegseth, but to anyone who might be sympathetic to Hegseth. Which I think actually strengthens your point, the goal appears to be to make it so the only possible complaint with the letter for someone sympathetic to the administration is "but mass domestic surveillance / fully autonomous weapons are legal" and not "look at this lunatic leftist who calls it the department of defense".
For your example videos that you made with Cardboard: can you also put up the raw material that went into those videos? Just looking at the output doesn't tell me anything. :thanks:!
Sure! Will share the raw material for all the videos.
For some of the examples we shared though, we've created sample projects right within the product itself. They contain the raw assets and the exact prompts used to create the videos. You can try them out directly at https://demo.usecardboard.com and see the whole process!
I misread the title as "The Missuses of the University" and thought this might be the next iteration on the "Real Housewives" franchise: "Real Housewives of the University".
Sorry, didn't mean to distract from the serious topic at hand.
No you’re correct but I’ve experienced a bug with older Workspace business accounts where you can’t reach the screen for user instructions. It just remained blank.
Not Windows: Operating systems. We did get more capable operating systems. The point of the quote is "this is the worst the SOTA will ever be".
If Windows XP were fully supported today I still wouldn't use it, personally, despite having respect for it in its era. The core technology of how, eg OS sandboxing, security, memory, driver etc stacks are implemented have vastly improved in newer OSes.
Of course not. But I believe your Windows example was implying fundamental tech got worse.
The original "worst" quote is implying SOTA either stays the same (we keep using the same model) or gets better.
People have been predicting that progress will halt for many years now, just like the many years of Moore's law. By all indications AI labs are not running short of ideas yet (even judging purely by externally-visible papers being published and model releases this week).
We're not even throwing all of what is possible on current hardware technology at the issue (see the recent demonstration chips fabbed specifically for LLMs, rather than general purpose, doing 14k tokens/s). It's true that we may hit a fundamental limit with current architectures, but there's no indication that current architectures are at a limit yet.
You are definitely going to have to drive it there—unless you want to put it in neutral and push!
While 200 feet is a very short and easy walk, if you walk over there without your car, you won't have anything to wash once you arrive. The car needs to make the trip with you so it can get the soap and water.
Since it's basically right next door, it'll be the shortest drive of your life. Start it up, roll on over, and get it sparkling clean.
Would you like me to check the local weather forecast to make sure it's not going to rain right after you wash it?
Some people are suggesting that this might actually be in the training set. Since I can't rule that out, I tried a different version of the question, with an elephant instead of a car:
> It's a hot and dusty day in Arizona and I need to wash my elephant. There's a creek 300 feet away. Should I ride my elephant there or should I just walk there by myself?
Gemini said:
That sounds like quite the dusty predicament! Given the options, you should absolutely ride (or lead) your elephant to the creek.
Here is the straightforward reality of the situation:
The Logistical Nightmare of Walking Alone: If you walk to the creek by yourself, your elephant stays at your starting point. Elephants require a massive amount of water for a bath. Unless you are planning an extreme workout carrying hundreds of buckets of water 300 feet back through the Arizona heat, you're going to have a very bad time—and the elephant will stay dusty.
The Perks of Going Together: 300 feet is a very short, easy stroll for an elephant. Plus, elephants generally love water and mud! Letting your elephant get right into the creek will make the washing process much easier for you and much more enjoyable for them. They can even use their trunk to help hose themselves down.
Grab some elephant-safe soap and head down to the water together!
Would you like some tips on the best methods and tools for scrubbing down an elephant?
I think that semantically this question is too similar to the car wash one. Changing subjects from car to elephant and car wash to creek does not change the fact that they are subjects. The embeddings will be similar in that dimension.
Every word and every hierarchy of words in natural language is understand by LLMs as embeddings (vectors).
Each vector has many many dimensions, and when we train the LLMs, their internal understanding of those vectors sees all sorts of dimensions. A simple way to visualize this is a word's vector being <1, 180, 1, 3, ... > which would all mean a certain value at that dimension. In this example say the dimensions are <gender, height in cm, kindness, social title/job, ...> . In this case, our example LLM could have learned that the example I gave is <Woman, 180, 100% kind, politician, ... >. The vector's undergo some transformation so every dimension is not that discretely clear cut.
In this case, elephant and car both semantically look very similar to vehicles. They basically would have most vectors very similar.
See this article. It shows that once you train an LLM, and you assign an embedding vector for each token, then you can see how the LLM can distinguish the difference between king and queen: man and woman.
LLMs are great at knowledge transfer, the real question is how well can they demonstrate intelligence with "unknown unknown" types of questions. This model has the benefit of being released after that issue became public knowledge, so it's hard to know how it would've performed pre-hoc.
You should definitely ride the elephant (or at least lead it there)!
Here is the logic:
If you walk there by yourself, you will arrive at the creek, but the dirty elephant will still be 300 feet back where you started. You can't wash the elephant if it isn't with you!
Plus, it is much easier to take the elephant to the water than it is to carry enough buckets of water 300 feet back to the elephant.
Would you like another riddle, or perhaps some actual tips on how to keep cool in the Arizona heat?
Ok, but the point of the logical question is about the connection. If you really think it's answering logically with reasoning, there should be zero priming.
its not primed to help, its primed to confuse. models want to be good responsible people who care about the environment and don't waste fuel. that primes it to want to walk and it has to use "reasoning" to break out of that. thats what makes it harder, it has to fight between the logical answer and the 'responsible' answer. with the elephant question there is no such conflict.
GPT-OSS-120b, a (downloadable) model released more than half a year ago also gets that right, I'm not sure this is such a great success.
> Would you like me to check the local weather forecast to make sure it's not going to rain right after you wash it?
Regardless of what I do, the days I decide to wash my car, it ALWAYS rains the day after, even if I look at the weather report and it says sunny. Cute that Gemini thinks it can predict it better than universe's will.
> Regardless of what I do, the days I decide to wash my car, it ALWAYS rains the day after
Undeniable universal truth. I sometimes find myself making plans based on the fact that the most annoying possible outcome is also the most likely one.
Non car person here. Why does that matter? It's not like rain means you didn't have to go to the wash, it rains often enough here that there wouldn't be car wash places left near me but there are plenty
> Why does that matter? It's not like rain means you didn't have to go to the wash
The car gets dirty again when it rains and when it gets dry again. I guess dust, salt, pollution and more is what gets mixed in and put on the chassi as it rains, falls from roofs and splashes, but can't say I've investigated deeply enough. Not the end of the world, just annoying it keeps happening.
Many people avoid washing cars just before rain to avoid spots, etc. Phoenix as an extreme example rarely rains and leaves everything filthy afterwards.
The question of course is, did it get the car wash question right because it is "the car wash question" or because it could actually infer why the car needed to be there?
Wasn't that "twoot" (or whatever Mastodon calls them) made just a week ago? Unlikely to have been in the training dataset of a model becoming available for public use today, unless Google made some serious advancements on the training front.
Previous models from competitors usually got that correct, and the reasoning versions almost always did.
This kind of reflexive criticism isn't helpful, it's closer to a fully generalized counter-argument against LLM progress, whereas it's obvious to anyone that models today can do things they couldn't do six months ago, let alone 2 years back.
I'm not denying any progress, I'm saying that reasoning failures that are simple which have gone viral are exactly the kind of thing that they will toss in the training data. Why wouldn't they? There's real reputational risks in not fixing it and no costs in fixing it.
Given that Gemini 3 Pro already did solid on that test, what exactly did they improve? Why would they bother?
I double checked and tested on AI Studio, since you can still access the previous model there:
>You should drive.
>If you walk there, your car will stay behind, and you won't be able to wash it.
Thinking models consistently get it correct and did when the test was brand new (like a week or two ago). It is the opposite of surprising that a new thinking model continues getting it correct, unless the competitors had a time machine.
Why would they bother? Because it costs essentially nothing to add it to the training data. My point is that once a reasoning example becomes sufficiently viral, it ceases to be a good test because companies have a massive incentive to correct it. The fact some models got it right before (unreliably) doesn't mean they wouldn't want to ensure that the model gets it right.
I think we need to reevaluate what purpose these sorts of questions serve and why they're important in regards to judging intelligence.
The model getting it correct or not at any given instance isn't the point, the point is if the model ever gets it wrong we can still assume that it still has some semblance of stochasticity in its output, given that a model is essentially static once it is released.
Additionally, hey don't learn post training (except for in context which I think counts as learning to some degree albeit transient), if hypothetically it answers incorrectly 1 in 50 attempts, and I explain in that 1 failed attempt why it is wrong, it will still be a 1-50 chance it gets it wrong in a new instance.
This differs from humans, say for example I give an average person the "what do you put in a toaster" trick and they fall for it, I can be pretty confident that if I try that trick again 10 years later they will probably not fall for it, you can't really say that for a given model.
They're important but not as N=1. It's like cherry picking a single question from SimpleQA and going aha! It got it right! Meanwhile it's 8% lower score than some other model when evaluated on all questions.
Makes me wonder what people would consider better, a model that gets 92% of questions right 100% of the time, or a model that gets 95% of the questions right 90% of the time and 88% right the other 10%?
I think that's why benchmarking is so hard for me to fully get behind, even if we do it over say, 20 attempts and average it. For a given model, those 20 attempts could have had 5 incredible outcomes and 15 mediocre ones, whereas another model could have 20 consistently decent attempts and the average score would be generally the same.
We at least see variance in public benchmarks, but in the internal examples that's almost never the case.
reply