Friday, March 23, 2012

Secrets of Recovering Corrupt Word DOCX Files

The first Word file corruption message you'll see.
The first error you'll see if you can't open a Word file due to corruption.
  • DOCX format Microsoft Word 2007 and afterwards (as opposed to the older DOC Word 97-2003 format ones) files are in reality conventionally zipped collections of mostly XML sub-files. The text of MS Word DOCX files is all stored in just one sub-file, the word/document.xml (the slash direction doesn't matter), where the "word" sub-folder is in the root of the zip file structure.
A screenshot of a typical unzipped root of a DOCX file.
Root zip structure of the DOCX file showing the [Content_Types].xml 
file and the other XML subfile containing folders.
  • One kind of problem with the document.xml sub-file is one or a few tags get out of order. Since XML is purposefully designed to be intolerant of errors (in contrast to the related HTML, XML was designed not to have the same room for differences of interpretation), once there is an error like this, Word will refuse to initially open the file and even though it is sometimes capable of doing so, will also refuse during a second pass, to extract the text without formatting. This usually occurs because the document's author has unknowingly placed a field inside another field and then edited it, for instance a text box being placed inside a math equation. Here is some further information regarding at least Math tag order issues.
An example of invalid tag order on the left and its valid correction on the right.
An example of invalid and valid math tag sequences.
The invalid one is on the left, the valid is on the right.
  • In addition to tag mistakes made by Word, there is a much more serious type of corruption where instead of full or almost full recovery being possible, the best that can be done is to truncate the document.xml file and reconstruct legitimate ending tags for the sub-file. The document.xml sub-file is usually the largest one in the zip/docx and when a power outage occurs during an operation to write the file say to an external USB memory stick, text/data will be lost and the Word file won't open. To salvage what still exists of the file, involves manually or programmatically finding the last good tags, truncating the file there and reconstructing good XML completion tags for what remains. You can get more information regarding this issue as well as some more on tag order issues here.
  • A command line program called xmllint, using the --recover command, can sometimes in one operation find the last good XML, truncate the file and add appropriate tags to the end such that it can be validated as valid XML. However I have found that this is not always reliable and despite the supposed common interpretations of what is considered valid XML, what xmllint says is good XML is not necessarily what Microsoft Word will find so. Instead what is often necessary is to load the document.xml file in a text editor, find the error, backtrack to where the tag began, truncate the rest and only then use xmllint to finish the file with good XML. In the rest of the article I will describe how I think best to do this manual type of repair.
  • One last point, in addition to conventional issues causing Word and other file corruption, like program bugs, power outages, viruses, and overheating computers, unfortunately there exists a pernicious problem that I recently became aware: "fake memory." Fake memory is USB or flash memory that reports one size to Windows, say 16GB, but in reality the device has a much smaller amount of memory available, say 2GB. These pieces of memory are labeled with the major brands but sold at cut rate prices.
  • The problem comes when the memory's real limit has been reached. At this point, it either throws out that part of the data it is writing that pushed it over the top, or it overwrites the oldest files with the new data. Either way the results are corrupt files. There are some estimates that 1/3 of the memory of some brands are bad in this way. You can test your memory for this issue and to find out more information here.
Depiction of the 2 Corrupting strategies fake memory uses with overflow data.
What the two kinds of fake memory do to data what's they reach capacity.
  • A Warning: MAKE COPIES OF YOUR FILE, DO NOT WORK ON THE ORIGINAL IN CASE THE TREATMENTS BELOW MAKES YOUR FILE'S CORRUPTION WORSE!
The most effective scheme for recovering corrupt DOCX files I found is as follows:
  1.  First you will want to know if you have a tag order issue or whether your document.xml or styles.xml (those are pretty much the only two file which will prevent a DOCX from opening) are truncated and will need an amputation of some kind.
    • I believe most of the time the quickest way to determine what is needed, is to see if the zip file structure remains intact. A DOCX file producing an error when Word tries to open it, but which still has an intact zip structure is most likely one which will need tag reordering or a tag set removed. Luckily these issues can often be fixed with one of these free programs: the Microsoft Mr. Fixit for Math Tags, Tony Jollans Word Add-In (maybe also just for math tags), Word Corrupt Document Checker or my program: Savvy DOCX Recovery. Additionally, these two threads on Microsoft Support Forums, have experts that will look at your corrupt DOCX files (with tag order or even XML file truncation needs) and sometimes give timely free help: here and here. I hang out on the first one and give out free help too sometimes.
Word Corruption Checker and Savvy DOCX Recovery's interfaces side by side.
Word Corruption Checker is on the left and Savvy 
DOCX Recovery's main interface is on the right
    • To test if the zip structure still is intact, either change the extension from or simply add the extension .zip and then try to unzip the file in Windows. Note: if you can't see an extension on your file, right click your Word document and choose Properties. Then up top on the first General tab, simply add .zip to the file.
      General Tab of the Properties of a typical DOCX file where you can change the extension.
      Adding the ".zip" extension to the file name of a DOCX document.
    • If you get a message from Windows that the zip structure is corrupt, then your issue is probably not a tag order one and you will need to proceed with zip file repair in step 2. These repairs below can be done even with the .docx extension, so after your unzipping test, change back the file to a .docx extension or remove .zip again.
  1.  Next repair the zip structure of the file with the command line program InfoZip's Zip.exe's -FF command (you can launch a command prompt from any folder from the Explorer File Menu in Windows 10 - and I believe in Windows 8 and 8.1 too).

    • For a command line, if for example your file was called gwn.docx, you might do: 
zip.exe -FF gwn.docx --out gwn_repaired.docx

Results of using InfoZip's Zip.exe's -FF repair command on a corrupt DOCX file.
Hopefully you will see results similar to the above when
using InfoZip's command line Zip.exe's -FF command.
  1. Try to open your zip repaired DOCX in Word. You might get lucky that this   repair is enough to get the file open.
    • If you do get an error and if prompted with a second message that says: "Word found unreadable content in [name of your file]. Do you want to recover the contents of this document? If you trust the source of this document, click Yes.," say Yes to try to salvage the text. 
Query message after closing the first error message in Word, asking if you want to recover/salvage the contents of the file at least - that is without formatting.
Text content recovery notice mentioned in Step 2.
  1. If Word next tells you it can't be open the file because of an XML Parsing error, click on the Details button and carefully make note of which XML file and what line and column the program is choking on. For instance the message below indicates, that at least the first bit of file corruption exists at on line 2 of the document.xml file at the 31,589th character.
    The details of the second error from a typical corrupt DOCX file.
    More details notice also mentioned in Step 2. See the error location is in line 2 column
    31589, which just means the 31589th character in the second line of the XML file.
    • If no XML file is referenced in this Details window, usually you won’t be able to recover anything. However give my open source software, Savvy Corrupt Word DOCX File Recovery, mentioned before, a chance, as it occasionally fixes those. By the way in addition to tag order issues it tries to programmatically reproduce the manual steps in this article for files that need truncation as well. The program is a bit of a work in progress though and manual repair is sometimes needed.
  1. Next, extract your zip repaired DOCX file with 7zip's 7z.exe command line program with the x command. A possible hang-up here is that the name of the folder you wish to extract your zipped DOCX, should follow the -o parameter with no spaces.

    • So a command might be: 

    • 7z.exe x gwn_repaired.docx -ogwn_repaired_output
      Output from using 7z.exe to extract a zip repaired corrupt file.
      Hopefully what you'll see in the command line results of using 7z.exe on a repaired docx file.
      The program knows there is still a problem with document.xml but it extracted in anyway.
    • Note, I found 7z.exe more effective than InfoZip's unzip.exe for recovering XML files from partially corrupt Office Open files and even tried multiple other command line unzip programs, but found 7z.exe was the best for extracting the most data from damaged files. Note also, if any file other than document.xml is referenced in the error message details of the second attempt to open your DOCX file in Word, for instance word/styles.xml, a shortcut to get a quick recovery result is simply to remove that XML file from your 7z.exe output and then go to Step 9 and rezip the contents. Word will generate a dummy styles.xml file and possibly other missing files on its own with apparently little harm to the document other than perhaps a loss of formatting that can be easily recreated.
  1. Now in your extraction folders, find the XML file referenced in the more detail error in step 2. If it is the word/styles.xml file, you can first try to repair it by following what is outlined below and as mentioned, if it doesn't work you can always delete it.
    • Open the XML file with the freeware NotePad++ or some other XML or programmer geared text editor that gives XML line and column numbers. Locate the error referenced and then remove the rest of the XML from the error to the end of the file. A lot of times what you'll see after the error is your further text or repeated portions of your text mixed in with all kinds of broken tags. Of course if you are knowledgeable about XML, feel free to try to correct tags where they start going bad. That way you may be able to recover more of your text. However I have found this can quickly become tedious and unrewarding, with your time better spent by simply recreating the content of what you are unable to recover and open in Word with this amputation method.
    • Since most of the document.xml file is stored on one line of XML, usually line 2, a possibly easier way to find where your XML error starts is to pretty print the code. This means to put each XML tag (similar to HTML tags) on its own line. Once you then rezip the sub-files and change the zip file extension to docx, the details of the error that Word reports will now be a line error other than line 2, say line 749. Pretty Printing XML does not effect the content or formatting of your Word file as displayed in Word (or printed out for that matter).
The details of the second Word error from the same file where the document.xml file has been pretty printed.
More useful error detail after document.xml was pretty
printed and all the sub-files rezipped into a DOCX
  1. You can pretty print your XML by copying all of it from your text editor and pasting it here: http://www.cleancss.com/xml-beautify/ and pressing the Format Code button.
    Document.xml without pretty printing formatting on the left, with pretty printing formatting on the right.
    The image on the left is the document.xml before beautifying/
    pretty printing. The image on the right is afterwards.
    • After pretty printing your XML sub-file and copying it back to your text editor (and saving the change to the file!) rezip all the sub-files originally found in your corrupt file back to a DOCX one as described in step 9. When you try to open the file again Microsoft Word will now by give you a more informative line number where the error is, which is information easier to use.
    • Note also, once you locate the error, you are going to want to go a little upstream from where the error is indicated so that the file now ends only in a complete tag, text, formulas or data. That probably means getting rid of the whole line and beyond where the error is indicated, even though Word will say that for example the error begins on the 10th column or 10th character of say line 749. What really Word is reacting to, is at this point it knows there is an error but the real error begins at the start of line which is the start of the tag. So it is better to get rid of the whole line from that point onward so you don't go into the next step with an open tag.
  1. Now use the xmllint --recover command on the truncated XML file.

    • An example command might be:

    • xmllint --recover gwn_repaired_output/word/document.xml -o gwn_repaired_output/word/document.xml
      Xmmlint output without manual truncation first is shown on the left and the output after manual truncation is on the right.
      Left shows xmllint's screen output where it  found, truncated and added valid ending XML tags. Word did not however accept the work and the file would not open. On the right is instead the display of  xmllint's output where it was used to just add valid end tags, where previously the first XML error was located and  the rest truncated as described. When zipped up with its other accompanying XML sub-files, Word was able to open the file .
    • A fast way to get xmllint may be to download it from here. Otherwise, you can install xmllint and its dll support files by installing the free Strawberry Perl. Xmllint usually gets installed into the C:\strawberry\c\bin directory where you can copy it along with the support files libiconv-2_.dll, libXML2-2_.dll and libz_.dll to the directory where you are trying to repair your file. However I wouldn't bother most of the time, because, I think the C:\strawberry\c\bin folder gets added to the environmental path variable in the Control Panel's Windows System app during Strawberry Perl installation after which you can use and xmllint command from any folder. There are other ways to install xmllint which you can find by Googling. Here's one that might be good for instance: http://flowingmotion.jojordan.org/2011/10/08/3-steps-to-download-xmllint/.
  1. So now we can rezip all our XML files and change the extension back to DOCX. Rezip all the files within the folder you made with 7z.exe extraction in step 5., but don't zip the larger folder itself, just the contents. Change the extension of the rezipped contents to DOCX and try to open the file in Word. If you still get an error, it may be time to try to contact me at socrtwo@s2services.com or post your issue to one of the forums, which I'm listing here and here again, mentioned in the introduction.
A redisplay of the structure of the root of the zip file from which the file must rezipped after document.xml or styles.xml repair.
Rezip all the files and folders at the root here, but don't go up one level and zip up the one
folder that contains this file and folders, as Word will not be able to make file of such a zip
So that is it. I'm working to make this process automatic in my free open source DOCX recovery program, Savvy Corrupt Word DOCX Recovery and a newer program I haven't released yet. If you would like an exhaustive bunch of things you can do to recover/repair both DOC and DOCX files that are corrupt that will still open and those that won't, check out my ridiculously exhaustive new post in this blog here.

12 comments:

waker477 said...

hi,,
If file system get corrupted under any circumstances, results in data loss which is always unbearable by computer user. But nowadays it is possible to recover your all corrupted files in any format like doc, word, jpg, jpeg etc by the use of any recovery tools that can be word file recovery. To get more info and trial version of this tool follow the below mentioned link.

http://www.officefilerecovery.com/word-file-recovery.html
http://www.remosoftware.com/remo-recover-windows.exe

James Tornton said...

pdfrecovery recovers text, graphics, hyperlinks and object forms used in the .pdf document reads/analyzes and repairs the data from the source PDF file without changing its original structure.

Paul D Pruitt said...

Another easy to do often successful step is to just try to use WordPad to open the docx file.

Shadab Samer Ahmad said...

Thanks dear
Plz Visit http://iamsamer.blogspot.com

link submit said...

Really good information, if you want to easily recover your deleted data then see in post, How to do it: http://msofficerecovery.blogspot.com/2014/04/how-to-recover-corrupt-word-documents.html

Daisy Zoe said...

Good job guys! Really very informative for all users. But you can also use Zip recovery software, it is a special tool made for archive files, like RAR files, ZIP files, also can repair the damaged data.

aalia lyon said...

That’s nice blog , its help you to remove rundll error , click this link and get a free from rundll error .
How To Remove Rundll Error
Thank you
Aalia lyon

Johnson Smith said...

Great posting and it can help to recover files of word. You can also recover the password of word with the help of microsoft office password recovery tool.

john phlip said...

I refer an excellent or effective word repair software, you may try Kernel for Word Repair Tool. This software quickly recover all lost corrupt data. You can also download free trial version : http://www.wordrecovery.org

John Brad said...

Really it's very informative blog post. I would like to share one of my best experience which I have got recently. Few days back my MS word data file got corrupted due to unscheduled shut down of my laptop. I was very worried to get my original word data because it was hard to loose that important data. One of my friend recommended me to try this Word Recovery Software to get the original word data. I tried this software and got the success.

Thanks a ton N. Bela Blogger to be with us all the time to recover our corrupted data.

Henry Gill said...

Really losing important word file frustrate you. I too encounter same issues but thanks to Word File Recovery Tool using which I was able to get them back

Damian Higgins said...

I would like to suggest Word File Recovery application for recover all data from corrupted or damaged Word document without any error.

Removing a Bios - CMOS Password - Free Article

http://www.dewassoc.com/support/bios/bios_password.htm "Unfortunately, access to computers can, at times, be blocked for all of t...