Look at this "<p<a href="/">first part of the text</> second part". This is a valid document fragment in HTML 4.01 because HTML is authored in SGML.
Writing a correct XML parser is much easier than writing a correct SGML parser, and what's more important, it's much easier to recognize errors.
I agree with OP that HTML5 should have been XML from the start. Nowadays, you hardly write any HTML by hand and even if you do, it's easy to write syntactically correct XML.
It's true that you can convert any HTML into XML with ease but it's still a stupid, unnecessary step.
> I agree with OP that HTML5 should have been XML from the start.
The key requirement for HTML5, And why it succeeded where XHTML had limited success, was that existing HTML docs had to work with it. Which is why it has both an HTML and an XML format.
It was not wrong for it not to be pure XML, it was absolutely necessary.
> You could write XHTML 1.0 documents that were backwards compatible to browsers that only understood HTML 4.01.
You could and a lot of people _tried_, or at least pretended to. But the vast majority of documents that tried to do this failed to actually be well-formed XML, for various reasons... In practice, even restricting parsing as XML to cases when the page was explicitly sent with the application/xhtml+xml MIME type would leave a browser with problems when sites sent non-well-formed XML with that MIME type. This was a pretty serious problem for Gecko back in the day when we attempted to push XHTML usage (e.g. by putting "application/xhtml+xml" ahead of "text/html" in the Accept header). So we stopped pushing that, since it was actively harming our users...
The point is that this hasn't happened; neither back in XML's heyday, and much less today. Now you can bemoan XML's demise until the end of time, or you can fallback to XML's big sister SGML. As I said, SGML has lots of features over XML that are in fact desirable for an authoring format, such as Wiki syntaxes, type-safe/injection-free templating, stylesheets, etc. on top of being able to parse HTML. Many of these features are being reinvented in modern file-based CMSs and static site generators, so there's definitely a use case for this. Whereas editing XML (a delivery rather then authoring format) by hand is quite cumbersome, verbose and redundant, yet still doesn't help at all in how text content is actually created on the web.
Is SGML even still used? The only usecase I remember besides HTML is DocBook and that of course also has a XML variant for a long time.
SGML is needlessly complex as an authoring format. Even HTML was considered too complex and that's why we got lightweight markup languages like MarkDown and AsciiDoc.
I would be very surprised if we ever turn back to something like SGML. Especially as there are well designed LML as AsciiDoc or reStructuredText.
To give you an idea of what SGML is capable of, see my tutorial at [1]. It implements a lightweight content app where markdown syntax is parsed and transformed into HTML via SGML short references, then gets HTML5 sectioning elements inferred (eg. the HTML5 outlining algorithm is implemented in SGML), then gets rendered as a page with a table of content nav-list linking to full body text, and with HTML boilerplate added, all without procedural code.
SGML was in fact designed to be typed by hand, as an evolution of earlier mainframe markup languages at IBM. The idiosyncratic shortcut features are supposed to reduce the number of keystrokes needed for entering text.
Look at this "<p<a href="/">first part of the text</> second part". This is a valid document fragment in HTML 4.01 because HTML is authored in SGML.
Writing a correct XML parser is much easier than writing a correct SGML parser, and what's more important, it's much easier to recognize errors.
I agree with OP that HTML5 should have been XML from the start. Nowadays, you hardly write any HTML by hand and even if you do, it's easy to write syntactically correct XML.
It's true that you can convert any HTML into XML with ease but it's still a stupid, unnecessary step.