Clean HTML from Word: a Hack

We’ve been exploring ebook production here at Sherprog, and it seems like the best way to produce a high-quality EPUB or MOBI ebook using consumer tools involves starting from HTML. I’ve written about the inadequacies of InDesign’s export to HTML and EPUB, which is a problem many small publishers undoubtedly face. But individuals who publish or self publish ebooks are probably working from Microsoft Word or another common word processor. Unfortunately, Word also tends to produce messy HTML via its native save-as HTML function. So how do you get clean HTML from Word? I would like to present a life hack.

The Problem

Not very clean HTML from Word

Classes and spans and styles oh MY

When you export HTML from Microsoft Word, what you tend to get is a class and a span with styling info for every paragraph. This is highly unnecessary, and frustrating if you are trying to control your text’s formatting for web or ebook publication. In order to clean up this HTML to any web developer’s reasonable standards, you would have to remove all these tiresome span tags and CSS declarations just so you could do the same work with a handful of human-designed CSS declarations for paragraph style. I have done this before, and manually. It took several hours to work through a novel-sized manuscript using search and replace to knock out these span tags one related group at a time in a text/code editor.

Necessity is the mother of hack.

That is when I noticed that I routinely pasted Microsoft Word content into WordPress and could hit publish and magically get a reasonable webpage every time. I went to a blog post and used my browser’s “view source” option to take a look at exactly what was happening and, bingo-automattico, there was beautiful, simple HTML with every paragraph in a nice <p> tag and not a lot else going on!

Go ahead: try it. Try “view source” on this very post. Some of this text was pasted right from Word.

Of course, WordPress and most other online text editors will allow you the option to view the HTML as you are composing. In WP, this is the “Text” tab. Unfortunately, you cannot just use this to get your HTML, because it omits p tags and divs. You could go ahead and scrape from here and get all your <b> and <i>, but you would have to go in and add all the paragraph tags. So here is the process I uncovered based on these observations.

Do This:

Use WordPress to get clean html from Word!

  1. Select your whole text in Word. Hit cntl/cmmd+c.
  2. With your cursor in the WordPress composition window, press cntl/cmmd+v.*
  3. Save draft.
  4. Preview.**
  5. While looking at the post preview, use your browser’s “view source” option. In Safari, this is under “View” and brings up a pop-up with the page’s code.  In Chrome, this is under View>Developer. In Firefox, you are looking for Tools>Web Developer>Page Source.
  6. You will likely have to scroll down beyond all sorts of styling, header, and script information to find the content. You can use cntl/cmmd+f to search for a line you expect in the text.  Select all of the desired content including enclosing p tags. Use cntl/cmmd+c.
  7. Go to your text/code editor and paste into your working file. You will need to have your HTML document head in place, and you will need to add your own CSS to control the appearance. Make sure your content is enclosed in body tags, of course.
  8. Format and style as normal HTML according to your intended use. Like I said, I used this process for ebook creation, but you could conceivably use it for any HTML application.

I already had admin access to several WordPress accounts, so that is what I used, but I imagine that this technique can work in other platforms as well. While I didn’t test the process all the way through in any other editor, I used view source on a Google Sites web page and verified that the WYZIWYG editor there had produced clean HTML. But I also tried scraping from a random Blogger page and found it used div tags and <br /> in place of actual paragraph formatting, which is not optimal but may work for your purposes, I don’t know.

  • Will work: WordPress
  • Will probably work: Google Sites***
  • Will not work, or will not work very well: Blogger, Google Docs, Medium, MailChimp (Email Previewer)
The longest text I have tried this with is an 80,000-word novel. As far as I know there is no size constraint on WordPress posts.

Notes:

*I have found that using “paste from Word” will carry over some of the unnecessary markup we are trying to strip. On the other hand, using “paste plain text” will strip all of your formatting–italics, bold, etc. So use a regular paste from clipboard right into the editor.

**This way you don’t have to publish the content on your blog unless you intend to.

***I have a suspicion you might have to strip out an empty paragraph between each paragraph of content. Google Sites seems to use empty paragraphs to create line spaces for block paragraph formatting. Other than this drawback, I think Google Sites will work. I suppose you don’t necessarily have to strip out the blank lines, either; it depends on your purposes.

One Reply to “Clean HTML from Word: a Hack”

Leave a Reply

Your email address will not be published. Required fields are marked *