6th March, 2007

It's Just Not XHTML

Tuesday, 9:03 pm in CodeGirl

So as usual the other day I was watching my GARO, catchin’ my Pokemans and thinking about web standards.  Specifically about XHTML.  Now, I suppose if you’re a regular reader of this site (do such things exist? Hah!) you might have gotten the impression that I don’t like XHTML.  You also might be wrong about that.  It’s true that I do have a problem with unthinking standards Nazis, but that’s true of anyone, anywhere who is uncritical1 of anything.  I’ve said it before and I’ll say it again; XHTML is a good idea which unfortunately suffers from a bad genes and dodgy current implementation.  Realistically, I think that some of the things that are trying to be achieved with XHTML simply aren’t going to happen as long as legacy support for HTML is still required.  What XHTML really needed was to be re-written as an entirely new language; something which was more-or-less unrealistic and would have had the same teething problems as, say, the jump between IPv4 and IPv6.  That doesn’t mean that XHTML has failed, though, only that it has some issues.

So what is the point of XHTML?  And what does it do that HTML doesn’t?

Validation

HTML is a messy syntax; it’s also extremely forgiving.  Ironically, this ‘problem’ with HTML is also probably its greatest strength since it is the one single thing that makes it extremely accessible.  Should you quote attributes in HTML or not?  Should you close a <p> tag or not?  How about a <br>?  Uppercase or lowercase?

This laxness is almost unique to HTML.  In most programming languages2, your syntax is either correct or it is not.  If it is not correct – if your variables are named wrong, if you’ve forgotten a semi-colon, if you’ve misspelt while() – then your complier is going to give you an error and your program will simply fail.  Similarly, algorithmic and architectural failures also often produce measurable errors, albeit often less obvious ones (memory leaks, security flaws, et cetera).  HTML is traditionally much more relaxed with regards to its syntax, and is very accepting of variants.  Variant syntax under HTML was, more often than not, a personal stylistic choice.  Most browsers didn’t care whether you quoted attributes or not.  Most browsers didn’t even care if you nested tags correctly or not.  And most browsers didn’t care whether all your inline elements were correctly nested inside a block element.

XHTML does care, due to its roots in XML.  XML and HTML share the same ancestor, SGML, however XML was an attempt to enforce much stricter syntactic rules on documents, primarily so they would be easier for a non-human parser to read.  Something like XML-RPC (sending remote commands between machines using XML; I use it to mirror void-star.net to my LiveJournal, but things like Trackbacks also work on a similar concept) would not be possible if XML were as variant as HTML, simply because the parser would be exponentially too complex.  Think of an average HTML parser (your browser) compared to an average XML parser (the two PHP files I use at sk.log).  An XML document is either well-formed or it is not; unlike HTML, there is no ‘guessing’ that is done for alternate or ambiguous syntax.

In this vein, XHTML attempts to take a lot of the syntactic ambiguity out of HTML.  Tags and attributes are case sensitive (and should be lower case).  All attributes must be quoted.  All tags must be closed.  Tags must be correctly nested.  Strict syntax means that XHTML browsers can be much, much smaller than traditional HTML-based browsers, and therefore better suited to PEDs such as my Nintendo DS.

There’s a downside to this, of course.  Under the real XHTML spec (I’ll get to that in a minute), an XHTML browser is supposed to critically fail if it encounters a malformed document.  There are a lot of things that can malform documents.  For those of you who use logware clients like WordPress.  For those of you who allow user input.  For those of you who post quizzes in your blog (do people still do that?  I assume so).  XSS attacks relying on dodgy user input are bad enough, but it’s easy to foresee a situation where a new type of XSS/DoS attack – using malformed XHTML to completely shut down websites – starts becoming more and more popular.

Form & Function

The web was always visual, but it wasn’t always aesthetic.  As modems became faster, content became richer – images, embedded media, Flash and Shockwave – and the default, browser-dictated appearance of tags was hardly what people were looking for.  ‘Context’ tags feel out of favour; who wanted a default serif 14pt bold Heading 1 when you could have a bright pink Verdana heading in wide-spaced caps?  Everything was <p>, <font> and table-spliced layouts.

I’m internet-old enough to remember the first time I ever saw a style sheet; it was on a Chaos! Comics fansite by a woman named Chaosmom and she used it to change her navigation text on mouse-over to be yellow and bold.  For a long time, that sort of thing was the primary use of CSS; visual ‘tricks’ like making hover changes on links or changing the mouse cursor or site scrollbars.  Of course, none of that was really what CSS was actually for.

What CSS was really designed to do was attempt to provide a layer of separation – common in enterprise-level software development – between design and display.  In other words, CSS provides the look while HTML/XHTML provides the content.  Anyone who’s played with a layout at somewhere like Gaia Online3 will tell you that you can do some truly amazing transformations.  Despite the various… eccentricities of this style of website development (which you’ve all heard me rage about previously I’m sure), it’s a truly powerful tool not just visually but also with regards to accessibility.  Load up void-star.net without a stylesheet, and it strips itself down to the extreme bare-bones of raw content.  Incidentally, the internet proxy where I work is extremely flaky, and has a habit of throwing 504 Gateway Timeout errors; the net result is it often loads pages without the requisite stylesheets.  There’s something so incredibly guiltily pleasurable about looking at such ‘naked’ websites… or maybe that’s just me.

The growing adoption of CSS has, unsurprisingly, coincided with the re-emergence in popularity of true markup tags (as opposed to ‘display’ tags like <font>).  ‘Raw’ content-based markup might not look pretty, but it does wonders for accessibility and as more and more devices from mobile phones to gaming consoles get stripped-down embedded browsers that’s only going to become relevant, not less so.

But, wait a moment.  Aren’t we forgetting something?

HTML vs XHTML

But with all this going on, HTML is not yet dead.  Actually, it’s probably less dead than ever and I’m going to share a secret with you about why.  It’s something you might not know.  I certainly didn’t know it, and it’s an easy titbit to blink and miss.  It’s also scandalous; viciously so.

Your XHTML is not valid.

I don’t care what your w3 validator link says.  I don’t care about your doctype.  I don’t care about any of these things, you see, because you are not using XHTML at all.  You are using HTML.  You are using HTML’s native syntactic flexibility to attempt to display XHTML, but what you’re actually displaying is HTML.  And you’re displaying it wrongly; force-validate your ‘XHTML’ site as  HTML 4.01 Strict and wince at the errors it throws.  Those are all the errors your browser and every single browser that visits your site is having to deal with; silently and internally.  You know how you’re using your oh-so XHTML <img /> tags?  I wonder if people realise that non-content tags are supposed to be written like <this/>, not <this />?  The reason we use the latter is a hack; because your browser is displaying your page as HTML, not XHTML, it essentially treats the ‘space-slash’ as a tag attribute.  Since it is an unknown attribute, it is ignored.  You read that right; it’s ignored.  Your trailing slash has no effect whatsoever, asides from causing silent internal errors in your browser.

Your browser doesn’t give a shit about your doctype.  Your browser cares about your site’s mime-type, and your site’s mime-type is saying that it’s HTML.  Invalid HTML.

So what’s a mime-type?  Very simply, it’s a hidden signature in your file that tells the browser what kind of file it is.  Generally, this is automatically provided for you by the webserver, though if you’ve done work with scripting you should know you can force a mime-type with an appropriate function call (indeed, in some older languages you have to manually send appropriate mime-types for your scripts; newer ones like PHP do it automatically for you).  This is how you do things like generate images and PDFs with PHP files.  The ‘default’ mime-type for HTML is text/html, and by default it’s the header that’s sent with .php and .html files.  When a browser sees a document with text/html it fires up its HTML parser; and that’s the content you’re used to seeing.

‘Real’ XHTML, on the other hand, uses the application/xhtml+xml mime-type.  When your browser sees this, it fires up a different parser; the XHTML parser.  If it has one; as of IE6, Microsoft had no support for XHTML at all, and any document served with that particular mime-type would prompt the user for a download.  HTML and XHTML parsers are fundamentally different on a OS level, and to be honest XHTML parsers are still simplistic at best (they’re arguably where browsers were in the late ’90s).  Firefox’s XHTML parser, for example, doesn’t support incremental page loading; it’s the whole page or none of the page.  How do you tell if you’re seeing a page parsed as HTML or XHTML?  Well, you don’t.  Unless there’s an error; an XHTML-over-HTML page will simply ignore errors, while real XHTML will throw an error and die (similar to the error you’ll get in a malformed XML page).

And none of this is even going into the issue of depreciated JavaScript elements in true XHTML (bye bye document.write()).

You can certainly force your webserver to display your XHTML using the correct mime-type.  There’s a .htaccess fix:

AddType application/xhtml+xml .xhtml .php

But of course that will cut out anyone using IE and most older alternate browsers.  There’s also a scripting option:

if( isset( $_SERVER["HTTP_ACCEPT"] ) && stristr( $_SERVER["HTTP_ACCEPT"], "application/xhtml+xml") )
  header( "Content-type: application/xhtml+xml" );
else
  header( "Content-type: text/html" );

Which won’t.  Of course, the real question is why do this at all?  Why are we so obsessed with XHTML?  Undoubtedly it’s the direction that the web is heading in – that’s fine – but, like the fight between IPv6 and IPv4, currently the only options for early adopters of the new technology are to either cut out the vast majority of people who don’t comply or to rely on substandard hacks like XHTML-over-HTML.

Of course, it would all go away if only we could bring ourselves to go back to HTML 4.01.  Maybe, in this case, the way forward really is to go back.  At least for now.

  1. Let’s get this straight. Being ‘uncritical’ does not equate to disliking everything and no-one of reasonable intelligence over the age of 7 should think that it does.  I’m sure that most of us love our parents but by the time we reach a certain age we stop being uncritical of them; it’s called growing up, and everyone does it. ^
  2. HTML is a mark-up language, not a programming language.  Very loosely, programming is for defining what stuff does, mark-up is for defining what stuff is. ^
  3. Erm, let’s ignore the ‘Interests’ section.  Seems they’ve changed the classes a bit since I did the layout. ^

Comments

  1. User Avatar

    Thanks for this, I didn’t know about all that. I did know that you aren’t “properly” serving XHTML without sending the application/xhtml+xml header (and sending it as a meta does not count!), but I didn’t know it was so bad for browsers. I’ve successfully got mine to echo the correct headers now (I’m too used to XHTML to go back to HTML, unfortunately… That and I use a lot of nl2br() PHP which automatically adds < br / > so that would invalidate everything. Plus I don’t *really* want to go back and edit all my stuff again tongue.png ) but it’s certainly something I will be thinking about carefully in the future.

    To me it seems like XHTML was never quite finished; the W3C sort of started on it in 1999 or whenever it was, and never really got round to doing much with it. I hope the long-awaited release of XHTML 2.0 does at least attempt to do something a bit better than XHTML 1.x has so far done (break browsers and confuse people).

  2. User Avatar

    but I didn’t know it was so bad for browsers

    It isn’t.  That’s the whole point; the HTML parser has been deliberately designed to not care about syntactic variations.  Serving things as application/xhtml+xml isn’t the ‘answer’ either.  Mozilla specifically recommends against it.  And so do various other authors, while simultaneously pointing out the serious pitfalls of even thinking of switching.

    To be honest with you, I don’t really give a crap about web standards myself; I think they’re a mess and until (hah) they get cleaned up I don’t think they’re really worth adopting as anything other than a curio/challenge.  The fact of the matter is there’s nothing XHTML does that strict HTML can’t.  Real HTML has the same validation (it’s not browser-enforced, but neither is most people’s ‘XHTML’) and the same separation of function and form.  These are the two things most of the XHTML bleaters get excited about, and I just find it hilariously amusing that the whole time they’re going on and on and on about standards (and, let’s face it, often attacking other websites who aren’t ‘compliant’) their own pages are horrible, HTML-invalid tag soup messes.  Because, when it comes down to it, most people who pretend they’re experts in this sort of thing don’t know jack.  I didn’t; and now I do I’m not sure what to do with it.  (I’m honest-to-gods considering re-writing the whole backend of sk.log to detect whether a user’s browser has an XHTML parser or not, and serving either XHTML or HTML content as appropriate.  The one problem being that I’m now not 100% sure of the wisdom of using XHTML at all.)

    To me it seems like XHTML was never quite finished

    From a design specification point of view, XHTML is fine.  Where it falls down is in implimentation, since browser support is either substandard (Mozilla) or non-existent (IE). XHTML 2.0 will be no different; like I said, the design specifications for the technologies aren’t the problem.  But the W3C doesn’t make browsers, and it can’t force either browser-makers or users to adopt its specifications.  The same thing, incidentally, goes on elsewhere.  The whole XHTML/HTML thing reminds me very much of IPv6/IPv4.

    I’m not sure if you know the history, but about ten years ago people started panicking that we would run out of IP addresses.  This mostly had to do with the fact that there were suddenly exponentially more devices attempting to connect to networks than the designers had imagined, and the fact that large swathes of addresses had been assigned to single entities (e.g. Xerox).  Now, according to IPv4 router specifications, every machine on a network should have a unique address.  But there weren’t, and aren’t, enough addresses (4,294,967,296 to be exact) and so ‘interim’ measures such as NAT and dynamic IPs were invented.  NAT is the system pretty much everyone in the universe now uses where you have one ‘outward’ facing machine (the gateway) which has your internet IP, and an internal network using one of the reserved private networks (usually either 192.168.xxx.xxx or 10.18.xxx.xxx).  Now, this technically breaks IPv4 specifications but it works, and everyone uses it.

    Meanwhile, some busy bees were working on IPv6, which uses more bytes and therefore has more potential addresses.  The problem with IPv6 is that it is not compatible with IPv4.  At all.  People started to realise that to implement it, you’d need to have two separate internets; the IPv4 internet and the IPv6 internet.  The Catch-22 was the no-one would shift address spaces unless there was content in the new internet, and there’d be no content in the new internet unless people switched address spaces.  So various tunnelling technologies were invented (passing IPv6 packets through IPv4 networks and vice-versa), which bought a whole plethora of other issues.  Most hardware nowadays supports both types of packets, but IPv6 is still not widely adopted.  The biggest push, incidentally, is coming from China with the US government (specifically defence-related areas) a little bit further back.

    Meanwhile, everyone else was used to NATing and dynamic IPs.  Some people are now starting to question whether they’d even want every machine on their local network to have a universally unique IP.  Local networks give LAN machines a measure of anonymity, and most network security in this area specifically relies on gateways breaking public and private networks.  Not to mention what’s the point of having unique IPs on, say, an air-gapped network?1

    The point being, it’s more or less the same argument as is currently going on about HTML/XHTML.  Sure, standards are great and whatever, but how much point is there if they don’t actually work in the Really Real World?

    … plus I also just like kicking iconoclasts in the shins.  It’s a bad habit of mine.

    1. A network that is not connected to anything other than itself.  If you connect two GameBoys together via a cable in order to share Pokemon, this is an air-gapped network.  (Okay, actually it’s not, but it’s a good enough example) ^
Add Comment
auto insert line breaks
use log.code
use smilies
Verification
  • v-s.net v0.6 and all content (unless noted) © Dee.
  • sk.log v0.6 spat this out in 2.061 seconds.
  • 48 / 216,534
artistic-twobyfour