I've actually been using this to convert large PDF files to HTML to be displayed...

roel_v · on May 6, 2013

" * HTML semantics are non-existent

These are all relatively easy to fix, I believe. "

How? For example, how would you identify <span>'s (or whatever this converter uses) to identify headers, and page headers/footers, or a ToC, or a preface? IMO this is an AI-hard problem, for which even the 'simple' approximation (statistics) is very hard due to the wide variety in inputs (a corpus trained for multi-column journal articles will most likely not work at all for books, although I haven't tried and would love to be proven wrong).

Use case: a working (i.e., preserving semantics) pdf-to-epub converter. This would, imho, be a killer product / service.

coolwanglu · on May 6, 2013

Hey thanks for the info!

2nd & 3rd are in the future plan, as I'm still working on accuracy and speed. And #115(https://github.com/coolwanglu/pdf2htmlEX/issues/115) is about the 2nd issue.

About the first one, I've not got an elegant solution yet, maybe a CSS file per page?

Please file new issues at GitHub if you think it's necessary :)

acmecorps · on May 6, 2013

I love this! Kudos for this awesome app.