« Happy Second Day of Spring | Main | Sony's Masterpiece »

Saturday, 22 March 2008


DOC files are often incompatible between different versions of MS Word, let alone reliably convertible to other word processors. They can also carry MS Word viruses and tend to be bloated due to a feature of Word that I recommend turning off--"Fast Saves."

I usually tell my students to convert their DOC files to RTF format, if they want to send me files. That keeps them smaller for e-mail, safe from viruses, and more readable by more word processors, so I don't have to worry about compatibility issues between their version of Word and mine.

I keep a copy of Word for dealing with files that people send me and for preparing manuscripts for publication, but I really write in Nota Bene, which is still the best software for academic writing around (www.notabene.com for more info on the current version).

Sorry David, but I have to disagree. ZThe .doc format is pretty muc ubiquitous. If your version od a .doc format is unrecognizable in one version of Word, then that means the .doc format is more than three versions older than the one you are currently using as Microsoft has legacy support back three generations. (So, 2008 will recognize XP versions of Office.) There are also converter packs you can get for earlier versions.

I have worked in several universities and only once have I come across a professor using Nota Bene. The entire Helpdesk staff chortled at his request for support. Nota Bene is NOT a mainstream program, nor in my experience is it very common even in higher education. The schools that I have had some exposure with include Alfred University, SUNY Cortland (and by extension all SUNY and CUNY schools), Skidmore College, Gettysburg College, Suffolk Count COmunity College, University of Missouri, Washington University, University of Colorado, Denver University, St. Clair County COmmunity College, University of Toledo, College of Charleston, Charleston University, Vanderbilt, Appalachian State, UNC CHapel Hill, Clemson, USC, and probably a few more. The one instance of Nota Bene was at CofC and the state of SC is woefully ranked either 49th or 50th in education depending on which poll you reference.

The .doc format is probably the closest to a ubiquitous formatted editor there is given the saturation of the Microsoft Office applications through the PC marketplace. If you wanted to pick a format that is entirely ubiquitous, then go with the txt format.

For anybody tackling a project like this again, check out this free program, 'Double Killer'. It searches for and displays duplicate files based on name, size, date, modified date and a checksum it generates (or any combination of the above).

It then displays a list of the doubles it finds so you can confirm its findings before deleting the doubles.

It won't do ALL of the work for you, but might take care of a large chunk of the more obvious ones. I'm bad with keeping things organized, so find that running this once in a while keeps things a bit cleaner.


You should look into a program called Dup Detector. It can find duplicate images based on the actual content of the image. After building a catalog, it will present pairs of images to you that have x percent correspondence (with x being a value you can set). It's really great for detecting identical and similar images on your computer.

I didn't claim that the NotaBene was widely used or mainstream, just that it is the only serious academic word processor out there, even if most of my colleagues don't realize it. Word is widely used, but is more oriented toward business applications. Try to call Microsoft to help you sort out your footnote formatting problem or making your document conform to an academic style manual, and they'll generally have no idea what you're talking about or why it's important. I've never run into Bill Gates at the MLA Conference, but I have spoken with Steve Seibert, the President of NotaBene there, and the company's help desk has always been helpful when I've called.

If the DOC format is only supported for three generations, well, then it's a moving target, and is in no way a long term archival solution, nor can it be called a single file format. Which DOC format is ubiquitous? There are several of them. Usually the problem is that one version of Word will open a DOC file from another version of Word, but there will be lots of garbage appended to the file, and the formatting won't be as the author intended. This never happens with RTF. The RTF format is much more stable, and it's easy to convert to it in Word.

I've long relied on d'peg to remove duplicates.

It has color content and pattern matching to find duplicates of the same image, despite rotation, cropping, color correction, borders, text added, you name it.

I seem to be happier with my older version than the latest, but in case anyone is interested:


Having said the above, I forgot completely the main point of the article which was to save things in a logical format and try to use descriptive file names and current software. You not only will be doing yourselves a favor in file management, but eventually your heirs as well. My heart goes out to Ctein for the extensive work he had to put in on this project. Great idea also to use Bridge to view a thumbnail of everything - had never thought to use it for these purposes, but it makes perfect sense. Now, excuse me while I go organize my own hard drive(s)! :)

Guys, enough with the software debate. This isn't a software forum.


Mike J.

Dear David,

In your zeal to damn Microsoft, you've missed the point, and you're wrong in detail.

The DOCs are compatible with every version of Word as far back as 6. So many documents exist in this one format that readers *will* be available into the indefinite future even if Microsoft disappeared tomorrow. That's all that's important.

The virus argument is ludicrous. Maclink Pro doesn't attach viruses to files. The Fast Save argument is equally silly, for much the same reason.

I don't use Word either; I hate it. I happen to much prefer Nisus. So what? This isn't about me. It isn't even about you.

Please save the rants for appropriate venues.

Hopefully this will get to be the last word on this unfortunate diversion.

pax / Ctein

Dear Folks,

Double Killer, Dupe Detector and d'peg all sound seriously worthwhile. I will investigate them. Thanks so much for the suggestions! It's great when I can write a column like this and learn so much.

pax / Ctein

Ms. Corinne's fans are very fortunate to have someone of Ctein's calibre to save her work from being lost in obscurity.

The vast bulk of this century's digital image archives on personal computers will likely be lost in oblivion because there won't be anyone with enough skill or caring among the departed's friends and associates who could make the time and effort to wade through the digital swamp.

I make a print of whatever I think is a 'good' photo in my own archives, so that subsequent to my death someone can find them easily. I wouldn't wish it on anyone to sit at my workstation and trawl through the tens of thousands of images (and perhaps hundreds of thousands if I'm fortunate to live that long).

I think the core lesson here is about organizing your own data. Not for your executor - most of us will not merit that - but because I do not think many folks are really able to make good use of their digital collections. Sloppy data habits that were limited by how much film you could shoot become nightmarish with 8 gig memory cards. Fred Picker advised throwing away marginal negatives once a year so that you would not waste time with them, rather than shooting and moving forward. While I do not do that, the wisdom in it is more acute for the digital world. I do ruthlessly prune my digital files, then use a filing program to catagorize them and include context info.

We have an interest around here in things that are archive-quality (or "archival", an adjective I'm not fond of).

For physical objects, we have some idea what makes things age-resistant. Pigments instead of dyes; acid-free paper; proper storage.

For information, we are still groping our way around. It's as much a problem of library science as of computer science ... but for librarians and computer scientists both, it's uncharted territory to some degree.

However, there are some aspects of data conservation that are well understood due to their similarities to other interoperability problems. A good file format for conservation is described by an *open specification*. This specification has *no intellectual property encumbrances* and exists in *multiple independent implementations*.

RTF fits the bill. Word does not. This is not about Microsoft-bashing; if we were talking about AppleWorks or WordPerfect files, I'd say the same thing.

I think people were unfairly grumpy in David's direction. The concern he brought up was relevant, and the solution he proposed was easy, cheap, and, I believe, correct.

Sorry for the off-topic comments about word processing software. I think you're assimilating my skepticism about the archival suitability of DOC files to the general anti-Microsoft sentiment out there or the PC/Mac wars. I've done all of my writing on a PC since around 1985 after a year or so of using a Xerox 3030.

My point is that RTF (Rich Text Format), which is a standard developed by Microsoft, is much more stable than DOC, which changes with every version of Word. RTF files are stored in ASCII format and are readable in any text editor, and are easily convertible to XML-based formats, which are the real library standard for digital archives of formatted text documents.

Only one nit to pick, really. Why on earth didn't you use ISO dates?


Also, and more a personal preference, I'd put them at the start of the filename, eg.


and so on. This means that things will appear neatly in chronological order without any special effort.

The date format you have selected shows significant lack of foresight.

Dear folks,

Re: file formats, we're just going to have to agree to disagree. There's nothing wrong with RTF but I do not accept that DOC is any less useful. What will determine the readability of formats far into the future is not whether or not they conform to an open standard but how much information of import is stored in them. There are a number of open-standard data formats out there which have turned out to be so little-used that I am certain that most computers will be unable to read them in 30 years. While open code allows for the writing of translators at any time, the practical reality is that those activities consume time and money and those are always in short supply. Regardless of the future of Microsoft, the existence of a gazillion important documents in DOC form ensures widespread readability for a very long time.

The same can be said of RTF. But this does not make the use of DOC inappropriate. It is merely that there is more than one way to address this problem satisfactorily. I could actually trot out some arguments about why DOC is better for my purposes, but frankly I think they boil down to a kind of nit-picking that I find both boring and useless. So I won't. Because both formats work. Pick the one that you think will work better for you. You can't guess wrong in this case.

Re: date codes, I did put some substantial thought into the question of whether the date should be at the beginning or end of the file name. I decided end of file name was better because it was not as important to be able to sort the documents into chronological order easily as it was to be able to easily collect documents that belonged to the same project. For example a book manuscript of Tee's spanned a considerable number of years in various versions, addenda and subsections. They tended to fall nicely together if organized by original filename. Sorted by date, though, they were dispersed among the myriad other documents that had been created over the same period. Since scholars would also have access to Tee's original data and could look at the chronology of her file and folder structures there, I decided it was not important to emphasize this in my translations, rather it was more important to make it easy for scholars to find all the documents that might belong in a particular group of interest.

But one could just as easily argue it the other way under slightly different circumstances. I'm only saying that this was not a thoughtless decision.

~ pax \ Ctein
[ please excuse any word salad. MacSpeech in training! ]
-- Ctein's online Gallery http://ctein.com
-- Digital restorations http://photorepair.com

The comments to this entry are closed.