When Cosine fine tunes a model to be generally better at coding than gpt-4o that leaves me vaguely confused about the role that each part plays in building the best LLMs (and I would not be shocked to learn that this confusion was also SOTA)
I don’t think it’s all that surprising. VRAM is much smaller than all the data the bot was trained on, so some stuff gets dropped or “lossily compressed”. In the case of foundation models, this is likely a bit of everything.
Fine tunes take that and change what gets dropped or compressed. Rather than somewhat evenly forgetting things, it forgets very little of the knowledge they’re targeting (like coding) and in exchange drops much more of “everything else”.
I don’t believe it works this way, but in a sense, it’s like GPT4o is a 400B parameter model but only devotes 20B parameters to coding because the rest are taken up by Wikipedia and knowing French and what not. A 70B coding fine-tune might be able to devote more than 20B parameters to coding, in exchange for only speaking English, having very little encyclopedic knowledge, etc.
It’s kind of like CPUs vs ASICs. CPUs (GPT4o) will perform better on average at random tasks, while ASICs (finetunes) do dramatically better on the task they were built for, even if they have a lower transistor count.
I am not an expert or even particularly knowledgeable about this, so treat the following as “armchair speculation by an enthusiast”.
Wikipedia says that as of 2023, each expert is ~10 billion parameters, so they’re still much smaller than something like Deepseek Coder 70B.
I’m not sure why they’re so small though. I don’t know whether there’s some kind of architectural issue or super linear scaling somewhere preventing them from growing, or if they’re just trying to keep inference costs down.
GPT-4o has seen a lot of examples of what people write about coding on the web, what code exists, and what tasks people want to do with that. But that general data doesn't include full process of coding - get a bug report, look at a codebase beforehand, make changes to the codebase, test them and see exactly what bugs they have, and iterate. That process is what SWE-bench tests.
It's possible OpenAI did some coding fine-tuning themselves; Meta's Llama 3 paper [0], section 4.3.1 mentions what sort of work is needed. However, anything OpenAI did is based on their own tooling and set of assumptions - e.g. how is the existing code input into the LLM, what set of actions can the LLM take (e.g. look up documentation), what language is the output code being written in, etc. Cosine's LLM framework may do things differently and have different features, so you'd need to fine-tune the LLM to take maximum advantage of the framework.
It's like dropping the LLM down in front of Vim when it had only ever used or even heard of notepad (or even emacs); there needs to be some training to make it work well with the new tools it has.