I've actually been using this to convert large PDF files to HTML to be displayed in-browser. It's for my work, so I don't feel comfortable posting a link to the demo instance here.
It is definitely the best solution I've found so far. The outputted HTML / CSS / images look almost identical to the source PDF. That being said, there are a few issues still:
* One Gigantic (600kb) CSS file from a single PDF
* Hundreds of individual fonts
* HTML semantics are non-existent
These are all relatively easy to fix, I believe. I have found my own solutions to most of the issues in post-processing.
Kudos to you, coolwanglu. Also, I'd like to get in touch with you about lending a hand to fix some of the issues I've encountered.
These are all relatively easy to fix, I believe.
"
How? For example, how would you identify <span>'s (or whatever this converter uses) to identify headers, and page headers/footers, or a ToC, or a preface? IMO this is an AI-hard problem, for which even the 'simple' approximation (statistics) is very hard due to the wide variety in inputs (a corpus trained for multi-column journal articles will most likely not work at all for books, although I haven't tried and would love to be proven wrong).
Use case: a working (i.e., preserving semantics) pdf-to-epub converter. This would, imho, be a killer product / service.
It is definitely the best solution I've found so far. The outputted HTML / CSS / images look almost identical to the source PDF. That being said, there are a few issues still:
* One Gigantic (600kb) CSS file from a single PDF
* Hundreds of individual fonts
* HTML semantics are non-existent
These are all relatively easy to fix, I believe. I have found my own solutions to most of the issues in post-processing.
Kudos to you, coolwanglu. Also, I'd like to get in touch with you about lending a hand to fix some of the issues I've encountered.
Thanks for a cool piece of software!