Spanish ebook title garbled — problem solved

We generate our ERG2012: Quick Lookup ebook using a combination of Ruby scripts and the Calibre ebook-convert tool. The last step for each ebook is to launch the ebook-convert.exe from a Ruby script, passing it a long list of commandline options.

This process was working well for us right up to the point where we were began to generate review copies of the Spanish language version, GRE 2012: Guía de Referencia Rápida. We could generate the ebook just fine and it had all the right contents but the embedded title was garbled: GRE 2012: Guía de Referencia Rápida.Screen shot 2013-06-21 at 10.59.21 AM

It took me a long time to track down what was really going wrong. The title was right in the localized text file it came from. It looked right in the printed version of the commandline that we were logging. But it was wrong in the title text embedded in the book.

A word of warning: Gory technical details lie beyond this point. If you aren’t interested in automated ebook generation or multi-lingual ebooks or how Ruby kernel methods work on Windows, you may wish to turn back now.

I was able to prove pretty early on that Calibre didn’t have any problem consuming the correct title string. I could manually generate the same book, using the Calibre graphical user interface and copying the string into the input field, and the title would come out fine. Given how global the Calibre community is, it would have astounded me if this was not true.

So I went to work on the idea that either the code that generated the commandline was getting the character set wrong or that it was specifically the ebook-convert commandline interface that was the problem. But, in the end, by capturing output at various points, turning it into little batch files and/or hand-feeding strings to ebook-convert, I eliminated those possibilities, too.

The only thing left was to suspect the Ruby system call — which was and should always be a dumb, last resort idea. Claiming you’re blocked by a defect in your underlying runtime library is a game for fools and novices, almost every time.

But not this time. Here’s the trivial line of code that invoked the book converter:
system(command_line)
where commandline_line is a string variable of a few hundred characters that looks something like:
ebook-convert infile.html outfile.epub --title="GRE 2012: Guía de Referencia Rápida” . . .

Why would a simple call to the underlying OS garble a well-formed commandline? Turns out it’s a Ruby-on-Windows problem. When I finally began to suspect the call itself, I found this StackOverflow discussion:
http://stackoverflow.com/questions/11768374/ruby-system-doesnt-accept-utf-8

To summarize the discussion there, the Ruby version 1.9.3 that I’m using, and at least any older version, implements system() on Windows using a Win32 API call that effectively only supports ANSI strings. When that call is made, my nice title string is coerced from UTF-8 to ANSI and the í and á are trashed.  (If you are still reading but that last line makes absolutely no sense to you, please leave this post now and go directly to Joel Spolsky’s hilarious but informative treatise on character encodings.  I’m still trying to really grok some of his details and I’ve been re-reading it off and on for years, but it is THE place to start if you are trying to understand encodings aka char sets.)

The obvious solution, the one I hoped to use and that would work with some applications, would have been to read all the parameters in from a file, rather than on the commandline. Not only would that have resolved this issue but the file would have been easier to proofread and validate than my god-awful-long commandline is. But, no good, ebook-convert doesn’t have an option to read parameters from a file.

However, it IS a Windows problem and that means one can always just fall back to the primitive. I wrote the commandline out a batch file and invoked the batch file with the system() call.
system("cmd /C ebook_convert_epub.bat")

Easy peasy, problem solved. Just took a couple of weeks of thinking about it in background mode, a couple of days off and on to actually figure out what was wrong, and a few attempts that failed before I came up with a five-line fix. Sheesh.

Leave a Reply

Your email address will not be published. Required fields are marked *