Saturday, April 28, 2012

Regarding My Title Change for the Article Secrets of Word DOCX File Recovery

I recently changed the title and contents of my post on recovering corrupt docx documents.  Before the change I had unwarrantedly generalized the subject to refer to docx, xlsx and pptx files, however I have come to see that Microsoft Office handles file corruption in the three programs differently. So the article is now solely about docx corruption and what to do about it.

Also note, the article is meant to describe how to recover the text and formatting for corrupt docx files. In other articles and elsewhere on the Net I have provided links to my own GUI software, Corrupt DOCX Salvager and Corrupt MS Office 2007/2010 Extractor, as well as the my own and other's command line tools, Command Corrupt OfficeOpen2txt, Silvercoder's DocToText and Sandeep Kumar's docx2txt, which will do a pretty good job of recovering just the text, not the formatting.

Note even though all these text extracting programs work pretty well for recovering text from corrupt docx files, I have come to find they all benefit from passing the files through a good zip repair program first, like InfoZip zip.exe -FF command. A typical command line using zip.exe for repair would be for our purposes here: zip.exe -FF input_file.docx --out input_file_zip_repaired.docx. You don't need to input files with a zip extension or specify that for the output.

Anyway my new open source app, Savvy Corrupt DOCX File Recovery, will automatically repair the zip as a first step and then try to recover the formatting too. If it is unsuccessful with that it goes to a plan B and C to recover just the text, so maybe using the new app, you get the best of both worlds.

Postscript: I now have a docx file where the document.xml and styles.xml file are both corrupt. This causes Savvy DOCX Recovery to choke. Until we fix it, if Savvy DOCX Recovery chokes, try manual recovery and delete all corrupt xml files other than document.xml. You can tell an XML file is corrupt because following the method described in my previous post, when using 7z.exe as the unzipper, the command line results will show "data error" next to the unzipped file.

