bculkin2442's comments

bculkin2442 · on Feb 21, 2013

IMO The probablity of a given page being well formed, or being non-well formed in a manner that will not completely screw the parser is probably very low. Therefore, you should use a dedicated html parser for parsing html.

johndcook · on Feb 21, 2013

I agree as far as parsing most HTML. I use regular expressions all the time when I'm working with HTML I've written by hand because I know what's there.

I just find it interesting that the efficacy of regular expressions can be framed as a computer science question, a practical question, and a statistical question.

fuzzix · on Feb 21, 2013

In practical terms, how often are you actually parsing HTML? Building a DOM, rendering...

A knee-jerk reaction to 'HTML' and 'regular expression' being used in the same sentence is like someone seeing 'goto' without understanding the context and shouting "goto considered harmful!"

I recently use some combination of perl's split() and regexes to trivially pull all links from a piece of markup. I had to suppress the "Don't parse HTML with regular expressions!" voices echoing in my head the whole time I was writing it. I'm OK with the code now, of course.

johndcook · on Feb 21, 2013

Goto statements are a good example. It's easier to recite "goto is harmful" than to say "In most situations, other control structures are more expressive and easier to maintain than goto statements. However, there may be rare occasions, particularly in low-level system programming, where goto statements could be preferable."

"Don't parse HTML with regex" is good general advice, but no more an absolute than avoiding goto statements.

Dylan16807 · on Feb 21, 2013

>rare occasions, particularly in low-level system programming, where goto statements could be preferable

Or something as simple as "break 2;"

simonster · on Feb 21, 2013

In my experience, an XPath (or CSS selector) to get relevant information is both easier to construct and easier to understand than a regular expression, and the resulting code is more likely to continue to work as the website changes. I don't see why you would subject yourself to using regular expressions unless your environment doesn't have an HTML parser.

fuzzix · on Feb 21, 2013

In this particular instance, for reasons of a self-imposed contraint to favour ease of deployment and use by the target audience, I have no parser.

Using xpath (or just about anything else) would be smashing.

The likelihood of the markup changing to the point where the code breaks is unlikely, the target pages are open source mirrors exposed over HTTP. It even winds up being nicer code than the FTP handler.

RyanZAG · on Feb 21, 2013

Depends entirely on what you are doing. If you need to parse a page and be sure that you have found every single link in that page - even javascript - then a regex is going to fall flat.

If you need to collect a number of links from the web using a spider, and you don't really mind if you get them all or not, then regex is perfectly fine. You could do a regex to match the first href="" following "<a" and stopping when it hits a "/>". This will give you at least 75% of the links on the internet, and it's done it in far less processing power or ram usage as a full html parser.

As always, a chainsaw is not necessarily better than a butter knife for cutting down trees, if the 'tree' in question is just a 5cm shoot. In fact, using a chainsaw for that would be mighty funny.

blaabjerg · on Feb 21, 2013

Would that really count as parsing? Honest question.

RyanZAG · on Feb 21, 2013

I'm using the term 'parsing' because it's the one used by the article. What else would you use regex and html together for? You only have two options: extracting data from html using regex, or modifying html using regex. Modifying using regex is just string replacement. For example, replacing all hrefs in a html document with new links. You'd use the same method as described above, with a similar 75% of all links replaced.

nmcfarl · on Feb 21, 2013

Good question - but it is 90% of what I see people using HTML parsers for grab the content of a single tag from a page.

Which is why I often recommend "parsing" HTML with a regex. Particularly on throwaway projects. The overhead of a real parser is a waste.

And if the complexity grows - the sucker will need a rewrite anyhow.

bculkin2442 · on Nov 15, 2012

As a involved member of the Minecraft community. I can tell you that there is work being done on a Minecraft API, but I have no clue on a ETA. Also, I don't think they really have any plans on how to monetize things.