Word docx to ebook — generating a clean docx file

RunawayCoverAmazon(Second post in a series that begins here.)

I’ve got two files that Mom mailed me when we first agreed to publish these two short works in the Kindle and Nook marketplaces.  Both are .doc files, probably from Microsoft Word 2003, definitely from before MS Word 2007 when Microsoft switched to the .docx format.

There are two options for converting the files to .docx. If I thought the formatting in the files was super clean, my guess is they’d be equivalent.

  • Import the .doc to my Google Docs account, letting Google convert it to the native doc format, then export it as a .docx.  (But note that this will only work for relatively small files; there’s currently a 2mb size limit.)
  • Open the .doc in a current version of Word and resave it as .docx.

I’m going to do the latter.  Since I think the internal formatting for these two documents needs to be cleaned up to make Calibre’s conversion go more smoothly, I want to work with them in Word anyway. However the process I’m going to describe should work equally well via Google Docs for a short manuscript.  In other words, if you have an old .doc file (or even an old WordPerfect or Ami Pro file) and don’t have a current copy of Word, don’t despair.  Free and open software can come to your rescue.

Please do try this at home.  Please do NOT try this on the only copy of a file that you have.  BEFORE you start this process, make sure you have made a safe copy, other than the one you are about to work on. Pretty please?

Here’s the first page of the Runaway .doc file, opened in Word 2010:

FirstOpenOfDocFileAfterRename

Do a SaveAs type *.docx.  Leave the  “Maintain compatiblity” check box unchecked.

SaveAsDocxNoMaintainCompat 

Blow through the compatibility warning message:

CompatibilityWarning

Yes, folks, we are leaving the Word 2003 and earlier compatibility zone behind.  It is, after all, a whole new decade.  Be brave and OK on.

Now that we have a docx file, we get to the fun part for geeks like me:  finding and cleaning up the internal formatting issues.  But your mileage may vary on this sort of task to start with.  And your manuscript is sure to vary enormously from this simple one and be harder to fix.  I’m just going to show you what I did in this one case and leave your own cleanup as an exercise to the reader.

I know my mom pretty well and was recently privileged to hang out with her and provide some technical support while she finished her newest novel, Red Man Down, which will be out in a few months.  So I had a sense of what I’d find when I hit the “Show Formatting” button.

If you don’t know our friend the Show Formatting buttonShowFormattingButton, you should.  She’s the key to solving any number of pesky Word-won’t-do-what-I-want problems.  She’s not all powerful; there is a lot of what I consider formatting that she will not show us.  But today, with this file, she is exactly what we need.  

InitialFormattingAllDoneWithTabsShortenedAnd what we see here is just a lot of tabs (the arrows) and carriage returns (the backwards-P paragraph symbols).

My mom, when I was a kid, did her writing — the little she ever had time for — on a small, portable, manual typewriter.  It was a sleek little model, very advanced for its time, that came in its own little suitcase.  That’s only relevant because Mom uses Word the same way she typed:

She centers a title by tabbing over to the middle of the page:

TabsUsedToCenterTheTitle

She starts a paragraph with the tab key.

TabToStartParagraph

She starts a new page by hitting the Enter button enough times to start a new page.

You get the picture. All the auto-formatting that Word would be happy to do for her, she simply ignores.

It doesn’t take much formatting to help Calibre make a simple ebook like this look good.  But all those tabs make ME itchy and I do not want to risk confusing Calibre with them.  

Tabs are “special characters” to Word, which makes sense. They are invisible most of the time, after all. So to find them with the the standard Word Find and Replace dialog you have to get to the Special Characters menu by using the “More>>” option.

ReplaceMore

ReplaceTabChars

Once I found all the tabs and replaced them with nothing, I deleted some extra lines at the top and bottom of the file.  Then I tried applying a couple of Word’s standard styles.  I know my mom likes indented paragraphs with some space in between.  I might be able to control both features with Calibre but I know for certain I can add the spacing between paragraphs.  So I just found for a style that gave me indented paragraphs — Style Traditional — and called it good.

StyleTraditional

Then I bolded the title and added an italicized, centered byline and called it quits.

FirstPageAfterFormattingShortened

Easy, peasy.  If I hadn’t been taking screenshots as I went, I’d have been done with the cleanup of this small manuscript in ten minutes or less.

(My colleague Bret pointed out to me a bit later that I really should have added some front matter — a title page with a copyright, at least — before I quit.  He’s right but I didn’t think of it and want to push on to ebook conversion and the iBooks store.  If that works, I’ll come back and do an update and fix how resume is spelled, too.)

Our next step will be to go through the actual Calibre conversion and see what iBooks says.  Check it out here.

Questions or suggestions?  Please leave a comment below.

One Reply to “Word docx to ebook — generating a clean docx file”

  1. Pingback: Sheridan Programmers Guild · Word docx to ebook — overview

Leave a Reply

Your email address will not be published. Required fields are marked *