There are A LOT of misconceptions about llms, biggest one is they are not deterministic. And they are 100% deterministic and temperature has nothing to do with it. You WILL get exactly same result every single time (at ANY temperature) as long as you use same sampling parameters and server config parameters. What causes variance in LLM's is server parameters like batch processing and caching among a few other things possibly. the batching being responsible for most of the issues. The reason that flag is used is because large providers serve multiple customers per one gpu, and breaking up the vram is tricky and causes drift. If you start llama.cpp for example with only one person per slot batching off, you will always get same results every time even at temperature 1.2 or whatever other parameters because you are using one gpu per inferance call so no fucky buseness there. Reason most people are unaware of this is because most people have experience only with api instead of working with the actual inferance enjine itself so this godd damned myth keeps spreading. my vide for referance here where you can download and try for yourself. https://www.youtube.com/watch?v=EyE5BrUut2o
Thanks so much for this! I still haven't got around to building my own language model yet, so I'm a bit fuzzy on the details, but if I imagined a thought experiment where I did all the math by hand on paper, I just couldn't see how I would end up with a different output each time given the same inputs. Finding out that the variance other people are seeing comes from the server/hardware stuff clears that up.
This is a surprisingly annoying question to Google. A lot of articles give the reason that softmax returns a probability distribution, as if the presence of the word "probability" means the tokens will be different every time.
There's still a lot of low hanging fruit left IMO. Good find and rather funny to think about as you can have someone simply clone the various layers multiple times and instead of spending millions of dollars retraining the model increase performance significantly with "this one trick".
The other interesting point is that right now I'm copy pasting the layers, but a patch in llama.cpp can make the same model now behave better by a fact of simply following a different "flow" without needing more vram...
if this is validated enough it can eventually lead to ship some kind of "mix" architecture with layers executed to fit some "vibe?"
Devstral was the first one I tried and optimize for math/eq, but that din't result in any better model, then I added the reason part, and that resulted in "better" model
I used the devstral with the vibe.cli and it look sharp to me, thing didn't fail, I also used the chat to "vibe" check it and look ok to me.
The other thing is that I pick a particular circuit and that was "good" but I don't know if it was a local maxima, I think I ran just like 10 sets of the "fast test harness" and pick the config that gave the most score... once I have that I use that model and run it against the llm_eval limited to only 50 tests... again for sake of speed, I didn't want to wait a week to discover the config was bad
I've been running my own (admittedly naïve) experiments of new, wacky ideas for both LLMs (well, SLMs) and for Image-Super-Resolution models.
I'm just trying different kinds of attention mechanisms, different configurations of the network, adding loops, ... All kind of wacky ideas. And the real weird thing is that 99% of the ideas I try work at all.
Got mine after my first Acid trip (still don't know if it was real acid). Its not debilitating for me, just annoying. So yeah, be careful out there folks. The Acid trip was very cerebral though and I consider it to be an important experience in my life so I am kind of on the fence that it might have been worth the trade off....
Personally what I am more interested about is effective context window. I find that when using codex 5.2 high, I preferred to start compaction at around 50% of the context window because I noticed degradation at around that point. Though as of a bout a month ago that point is now below that which is great. Anyways, I feel that I will not be using that 1 million context at all in 5.4 but if the effective window is something like 400k context, that by itself is already a huge win. That means longer sessions before compaction and the agent can keep working on complex stuff for longer. But then there is the issue of intelligence of 5.4. If its as good as 5.2 high I am a happy camper, I found 5.3 anything... lacking personally.
Not sure how accurate this is, but found contextarena benchmarks today when I had the same question.
It appears only gemini has actual context == effective context from these. Although, I wasn't able to test this neither in gemini cli, nor antigravity with my pro subscription because, well, it appears nobody actually uses these tools at Google.
I've been working on building my own voice agent as well for a while and would love to talk to you and swap notes if you have the time. I have many things id like to discuss, but mainly right now im trying to figure out how a full duplex pipeline like this could fit in to an agentic framework. Ive had no issues with the traditional route of stt > llm > tts pipeline as that naturally lends itself with any agentic behavior like tool use, advanced context managemnt systems, rag , etc... I separate the human facing agent from the subagent to reduce latency and context bloat and it works well. While I am happy with the current pipeline I do always keep an eye out for full duplex solutions as they look interesting and feel more dynamic naturally because of the architecture, but every time i visit them i cant wrap my head how you would even begin to implement that as part of a voice agent. I mean sure you have text input and output channels in some of these things but even then with its own context limitations feels like they could never bee anything then a fancy mouthpiece. But this feels like im possibly looking at this from ignorance. anyways would love to talk on discord with a like minded fella. cheers.
For my framework, since I am using it for outgoing calls, what I am thinking maybe is I will add a tool command call_full_duplex(number, persona_name) that will get personaplex warmed up and connected and then pause the streams, then connect the SIP and attach the IO audio streams to the call and return to the agent. Then send the deepgram and personaplex text in as messages during the conversation and tell it to call a hangup() command when personaplex says goodbye or gets off track, otherwise just wait(). It could also use speak() commands to take over with TTS if necessary maybe with a shutup() command first. Need a very fast and smart model for the agent monitoring the call.
what's your use case and what specific LLMs are you using?
I'm using stt > post-trained models > tts for the education tool I'm building, but full STS would be the end-game. e-mail and discord username are in my profile if you want to connect!
For me at least its an interesting project I can take apart and build on top of. I've built 100% my own agent frameworks from scratch and have learned a lot from them. There is something to be said on learning from others projects as well, also because its an ever evolving project with so many contributes whatever fork you go with of your own, theirs a good chance the new goodies will work with your own modified version. For example I'm looking in to LCM right now, and woo-dent you know it someone ported it to openclaw. But nanobot doesn't have it, so I'm considering working on the LCM port to that. If i succeed i will learn a lot and also contribute to progress in my own little ways.
How does the whole kv cache situation work for diffusion models? Like are there latency and computation/monetary savings for caching? is the curve similar to auto regressive caching options? or maybe such things dont apply at all and you can just mess with system prompt and dynamically change it every turn because there's no savings to be had? or maybe you can make dynamic changes to the head but also get cache savings because of diffusion based architecture?... so many ideas...
As we approach the singularity things will be more noisy and things will make less and less sense as rapid change can look like chaos from inside the system. I recommend folks just take a deep breath, and just take a look around you. Regardless on your stance if the singularity is real, if AI will revolutionize everything or not, just forget all that noise. just look around you and ask yourself if things are seeming more or less chaotic, are you able to predict better or worse on what is going to happen? how far can your predictions land you now versus lets say 10 or 20 years ago? Conflicting signals is exactly how all of this looks. one account is saying its the end of the world another is saying nothing ever changes and everything is the same as it always was....
reply