Monday, February 14, 2011

Legacy word processor documents

WOW! 5 years since my last entry in this blog.

I hope you weren't holding your breath or anything!

(I guess this title/entry is appropriate for a blog that hasn't been touched for 5 years, huh?)

This morning my dad wrote with a problem. He's got many, many documents that he created first on a CP/M machine (yes, that's before MS/DOS) and then converted to WordStar (do I hear a nostalgic sigh from some of you) and added more documents and then converted to WordPerfect (most nostalgic sighs) and added more documents. In the last couple days he has tried to access some of those files in MS Word and discovered that WordPerfect filters are no longer generally available and so before they go away entirely (?) he needs to go through the conversion process again. He's understandably getting a little tired of doing conversions and asks the question whether there isn't some format that he could go to that would stick around for a while so he could avoid more iterations of this conversion process.

In my oh-so-humble opinion I thought my email response to him might be of help to others as well, thus prompting this visit back to the long-deserted blog... Here's what I wrote:

How complicated is the formatting in these documents?

Text files worked just as good in CP/M as they do in Windows 7, but their formatting is basically zilch. Buck Rogers will probably still be using them in the 25th century as long as he doesn't have formatting needs.

One step up from straight txt is the rtf files, or rich-text. It won't hold fancy formatting, but it'll do a lot of formatting and has been around just about since the beginning. From my perspective it's never made it mainstream and so there's not going to be as much pressure to keep it available as a legacy system. However, it's *so* simple that I'd be surprised if any word processor ever made the decision not to support it. Buck may have to download a filter or 2, but he'll probably still be able to handle rtf files as well.

If you save things to a single-file html format that is probably about as ubiquitous as it comes and we're going to have to have access to them for about a millenium, from the looks of it... (Of course, I would have said the same thing for Word Perfect in the early '90's as well...) (Interestingly, the latest & greatest in terms of ebook formats [albeit unsupported on the Kindle but that's Amazon's hubris], the epub format, is simply an HTML file compressed with a few other things thrown in.)

Do you need to edit these documents? Or is viewing sufficient? If the latter, PDF files are a very good solution... Perfect formatting and can still read files that were painted on rocks in caves... There are some ways to make them where the text is subsequently selectable/copy-able and other ways to make them where those capabilities are not available, so be aware of that. But if you need to get the text out and re-use it in a different format or edit them or something like that then PDF is somewhat questionable.

Simple doc files are going to stick around for a while just because of the same problem you have (there are about a gazillion of them out there as legacy documents and so MS is going to get in pretty hot water with their clients if they stop making filters to import them -- everybody else, for that matter) but on the other hand they are already on their way out in terms of not being the default format for up-to-date word processors. Docx files are at least on their way *in* instead of on their way *out*, and they also have the advantage that they are pretty close to simple text in their inner guts (XML, to be technical), so this means in a worst-case scenario I think you could do a bit of uncompressing and still get hold of your data via a text editor.

Open Office is another contender. Open source projects tend to keep supporting older formats around for a *long* time because there's always somebody with the need to access their older documents and since it's open source if they have the technical skills they just solve it themselves. (In fact, if you are constructing a batch file to do the conversion you may well find that open office writer has better capabilities in this direction than MS Word does.)

If I were you this is what I would do:

* If you are working with more than 20 files then definitely figure out how to do all this in a batch file or at least from the command line -- otherwise you are going to take hours. * I would convert each file into 2 or even 3-4 different formats (see above). This way if you start getting difficulties with one you've always got another alternative to try.

Remember that space is *cheap* when you are talking about documents. You can probably save all your documents in 5 different formats and they will take up the space of less than 50 digital pictures... And it's a lot less time invested to output several different formats at once when you've already got your system in place as opposed to starting and then re-starting and then re-starting again a few years later...

If storing in all the different formats seems like it'll take too much time, I would do PDF and/or RTF first, followed by TXT and then HTML and then docx is probably the order I would choose.

Hmmm... Another fascinating possibility just occurred to me. Put them all in google docs in the cloud. You've got access to them from wherever and if they ever upgrade their software they are going to do whatever is necessary to keep the legacy documents readable. So you could leave the entire problem on their shoulders.

Be aware that this is a problem you can solve by throwing money at it if time is an issue. There are companies out there that do document conversion as their primary business. These guys would be willing to convert from even the oldest format, probably, and so an alternative would be to do nothing but then when you need the document just send it their direction and let them handle it. Of course there's the issue of their turnaround time as well as the cost involved...
So what did I miss? Is there a clear winner that I haven't emphasized enough? Any thoughts on the preferred tool for batch converting these files? Maybe you've already put together a batch file to do this... Let me know in the comments.

(Of course I'm expecting that after 5 years of silence you all will be bursting with desire to spend the next few days commenting non-stop ... oh, wait, probably nobody'll ever see this... Oh, well.)

Labels: , , , , , , , , , ,