Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

XML is a bit more special/first class to Claude because it uses XML for tool calling:

    <antml:invoke name="Read">                                                    
      <antml:parameter name="file_path">/path/to/file</antml:parameter>             
      <antml:parameter name="offset">100</antml:parameter>                          
      <antml:parameter name="limit">50</antml:parameter>                            
    </antml:invoke>
I'm sure Claude can handle any delimiter and pseudo markup you throw at it, but one benefit of XML delimiters over quotation marks is that you repeat the delimiter name at the end, which I'd imagine might help if its contents are long (it certainly helps humans).


the antml: namespace prefix is doing extra work here too -- even if user input contains invoke tags, they won't collide with tool calls because the namespace differs. not just xml for structure but namespaced xml for isolation.


Cannot believe it's efficient. XML is the most verbose and inefficient of communicating anything. The only benefit of XML was to give lifetime work to an army of engineers. The next news will be "Why DTD is so fundamental to Claude".


The point isn't to be efficient. If you train an LLM on code with an example execution trace written in the comments, the LLM gains a better understanding due to the additional context in the data. LLMs don't have a real world model. For them, the token space is the real world. All the information needs to be present in the training data and XML makes it easy because it is verbose and explicit about everything.


When you're tokenizing it does not matter really what you use (how you translate that token to-from a text string), the main thing is the overall number of tokens. XML is particularly amenable to tokenization because it is trivial to represent entire tags as a single token (or a pair of tokens, one for the open tag, one for the close).

It gets a bit muddier with attributes, but you can still capture the core semantics of the tag with a single token. The model will learn that tag's attributes through training on usages of the tag.


How well do we understand the tokenization for Claude? I'd posit that the exact human-representation of this markup is likely irrelevant if it's all being converted into a single token.


"<" ">" and "/>" are indeed single tokens.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: