Sunday, January 06, 2013

Blog vs Epub, round 1

My fairy suggested a very nice Christmas gift for my (not so lil' anymore) nephews: she owns a thermal books binder and went "oh, but you could print some story of yours and bind it in a book with my machine... My audio processors catalyzed that into "how about stripping some meaningful text out of your blog and print that on 8x8" pages?" and immediately thanked her for that marvelous idea.

Of course, that required first a major upgrade of my "blogpressing" tools. The "images scanner" got complemented with list-post.pl which extracts posts having a certain tag and formats them into an HTML document. From there, I could use Open Office to import the document, export it into ODT format and start adjusting image sizes and other formatting annoyance to fit the documents into two ~50 pages illustrated text. The fight to get that printed out of my fairy's HP all-in-one printer is for another post.

Suffice to say that I decided that I'd avoid printing for my own reading needs and try to take advantage of my cybook instead.

I had no luck with the graphical front-end of Calibre this time, so I dug a bit the web and figured out that I could use the command line approach:

ebook-convert tagtionary.html test.epub --breadth-first --max-levels=8 --margin-left=2 --margin-right=2 --verbose

Even then, calibre gave me a hard time. I guess running Lucid Lynx in 2013 is the root of all my problems, so I'll have to upgrade sooner than wished. Btw:

non-ASCII characters in URLs abort the HTML-to-epub conversion -- leading with mysterious "ascii codec can't decode" exception (and no offending URL/file mentioned), and so did %-escaping in filenames. I had to manually interrupt the conversion after it took about half an hour in conversion attempts, with 3GB resident set and taking up to 7GB of virtual address space.

 It got to the "Creating EPUB Output" stage, and most pages said "No large tree found", then "splitting on page-break". All fine. A few pages have "large tree #0", with a split point defined. (I have no idea why "Split point: {http://www.w3.org/1999/xhtml}h3 /*/*[2]/*[663]" is mentionned there). Even the largest file "english.html" got happily split into 6 parts. Then for some curious reason, "mybrew.html" enters an endless series of "splitting... split tree still too large: 464KB."
From there on, it consumed more and more memory, obviously leaking all the prior attempts.
(edit: after dropping the offending mybrew.html, I managed to get the 54MB epub file. Checking on Odyssey ASAP).
(edit++: Calibre distributed in latest LTS handled mybrew.html out of the (virutal)box =:)


It could be that I'm missing some h2 level in the generated documents, but still ... the "english" page seems larger than "mybrew" according to all metrics ... I just don't get it... but a quick look at the stack trace generated when I pressed CTRL+C explains a lot of the memory issues:

  File "/usr/bin/ebook-convert", line 19, in
    sys.exit(main())
  File "/usr/lib/calibre/calibre/ebooks/conversion/cli.py", line 254, in main
    plumber.run()
  File "/usr/lib/calibre/calibre/ebooks/conversion/plumber.py", line 886, in run
    self.opts, self.log)
  File "/usr/lib/calibre/calibre/ebooks/epub/output.py", line 169, in convert
    split(self.oeb, self.opts)
  File "/usr/lib/calibre/calibre/ebooks/oeb/transforms/split.py", line 56, in __call__
    self.split_item(item)
  File "/usr/lib/calibre/calibre/ebooks/oeb/transforms/split.py", line 66, in split_item
    self.max_flow_size, self.oeb, self.opts)
  File "/usr/lib/calibre/calibre/ebooks/oeb/transforms/split.py", line 188, in __init__
    self.split_to_size(tree)
  File "/usr/lib/calibre/calibre/ebooks/oeb/transforms/split.py", line 402, in split_to_size
    self.split_to_size(t)
  File "/usr/lib/calibre/calibre/ebooks/oeb/transforms/split.py", line 402, in split_to_size
    self.split_to_size(t)
  File "/usr/lib/calibre/calibre/ebooks/oeb/transforms/split.py", line 402, in split_to_size
    self.split_to_size(t)
  File "/usr/lib/calibre/calibre/ebooks/oeb/transforms/split.py", line 402, in split_to_size
    self.split_to_size(t)
  File "/usr/lib/calibre/calibre/ebooks/oeb/transforms/split.py", line 402, in split_to_size
    self.split_to_size(t)
  File "/usr/lib/calibre/calibre/ebooks/oeb/transforms/split.py", line 402, in split_to_size
    self.split_to_size(t)
  File "/usr/lib/calibre/calibre/ebooks/oeb/transforms/split.py", line 402, in split_to_size
    self.split_to_size(t)
  File "/usr/lib/calibre/calibre/ebooks/oeb/transforms/split.py", line 402, in split_to_size
    self.split_to_size(t)
  File "/usr/lib/calibre/calibre/ebooks/oeb/transforms/split.py", line 402, in split_to_size
    self.split_to_size(t)
  File "/usr/lib/calibre/calibre/ebooks/oeb/transforms/split.py", line 402, in split_to_size
    self.split_to_size(t)

recursive retries is just a bad idea here, I'm afraid. Plus, they create two deep copies of the document before some nodes get selectively deleted in the copies. But what I don't get is the reason why sub-division keeps generating trees of the same size... It looks like the "split point" for the misbehaving document simply cannot be found 0_o.

After remembering myself to use pdb.set_trace() at critical points, I used some in-evaluator debugging and figured out that the document to be split has a curious structure. Granted, the split_point element (year 2010 marker) can be found, but the node has no children. All the content of the document happens to be direct children of the root node >_<. If this is a feature, I fail to see how it's supposed to work.

Unfortunately, my installed Ubuntu systems are too old to get support on that. I'll have to migrate first... Happy new year :P

1 comment:

PypeBros said...

Calibre did:
- create the manifest and the TOC tables,
- compile a stylesheet
- replace ]br/[ tags with ]p[ that enclose an UTF8 nbsp character
- inline image style have been removed and replaced by class="calibreX" attribute
- (same for fontstyle or color: inline styles).
- ]b[ and ]i[ tags still exist, but they also have a class assigned.