<li>item<li> Is not valid HTML, it's merely valid grammar syntax for a loose par...

kristoff_it · on Sept 16, 2024

The example omits the context around those <li>s but you can assume they're inside a <ul>. That's semantically valid html because </li> can be omitted.

But despite being valid and unambiguous for the parser, it can still lead to confusing problems for unsuspecting developers:

    <ul>
      <li>item1</li>
      <li>item2<li>
      <script id="this-script">
         let ul = document.getElementById("this-script").parentElement;
         console.log(ul.tagName);  // prints "LI"
         // *confused screams by the developer*
      </script>
    </ul>

tuyiown · on Sept 16, 2024

Ok, that's a specific rule for <li>, among others for closing tags, that was ignorance on my part, I overlooked the "sometimes" in the "It's valid HTML because the spec allows you to omit closing tags sometimes" comment.

dmsnell · on Sept 16, 2024

> Is not valid HTML, it's merely valid grammar syntax for a loose parser.

It's an incredible journey writing a spec-compliant HTML parser. One of the things that stands out from the very first steps are that the "loose parser" is kind of a myth.

Parsing HTML is fully-specified. The syntax is full of surprises with their own legacy, but every spec-compliant parser will produce the same result from the same input. HTML is, in a sense, a shorthand notation for a DOM tree - it is not the tree itself.

The term "invalid HTML" also winds up fairly meaningless, as HTML error are mainly there as warnings for HTML validators, but are unnecessary for general parsing and rendering.

And these are things we can't easily say about XML parsers. There are certain errors from which XML processors are allowed to recovery, but which ones those are depends on which parser is run.

---

> I do like adding restrictions on confusing patterns with no known legitimate use cases or better alternatives.

HTML was based loosely on SGML, a language designed to encode structure in a way that humans could easily type. Particular care was made in SGML to allow syntax "minimizations" (omitted tags, for example), so that humans would overcome the effort to encode the required structure. It was noted in the spec that if people had to type every single tag they would likely give up. They did.

But SGML also had well-specified content models in the DTD, formalizing features like optional tags, short tags, tags derived from content templates, default attribute values. Any compliant SGML parser could reconstruct the missing syntax by parsing against that DTD.

HTML missed out on this and effectively the DTD was externalized in the browsers. The effort was made to produce a proper SGML DTD for HTML, but it was too late. Perhaps if there had been widely-available SGML spec and parsers at the time HTML was created the story would be different.

Needless to say, these patterns are the result of formal systems taking human factors into their designs. XML came later as a way to make software parsers easier, largely abandoning the human factors and use-case of people writing SGML/HTML/XML in a text editor.

SGML is still rather fun to write and many of these minimization features are far more ergonomic than they might seem at first. If you have a parser that properly understands them, they are basically just convenient macros for writing XML.

tuyiown · on Sept 17, 2024

Yes, thanks for pointing out that "valid" should not be thrown out too easily. And it happens that I made and mistake and the snippet is actually valid, a pattern shared with a small set of others exceptions, exactly as you point out !

Thanks for pointing out key aspects of the story, I had a loose knowledge about it.