|The first error you'll see if you can't open a Word file due to corruption.|
- DOCX format Microsoft Word 2007 and afterwards (as opposed to the older DOC Word 97-2003 format ones) files are in reality conventionally zipped collections of mostly XML sub-files. The text of MS Word DOCX files is all stored in just one sub-file, the word/document.xml (the slash direction doesn't matter), where the "word" sub-folder is in the root of the zip file structure.
|Root zip structure of the DOCX file showing the [Content_Types].xml |
file and the other XML subfile containing folders.
- One kind of problem with the document.xml sub-file is one or a few tags get out of order. Since XML is purposefully designed to be intolerant of errors (in contrast to the related HTML, XML was designed not to have the same room for differences of interpretation), once there is an error like this, Word will refuse to initially open the file and even though it is sometimes capable of doing so, will also refuse during a second pass, to extract the text without formatting. This usually occurs because the document's author has unknowingly placed a field inside another field and then edited it, for instance a text box being placed inside a math equation. Here is some further information regarding at least Math tag order issues.
|An example of invalid and valid math tag sequences. |
The invalid one is on the left, the valid is on the right.
- In addition to tag mistakes made by Word, there is a much more serious type of corruption where instead of full or almost full recovery being possible, the best that can be done is to truncate the document.xml file and reconstruct legitimate ending tags for the sub-file. The document.xml sub-file is usually the largest one in the zip/docx and when a power outage occurs during an operation to write the file say to an external USB memory stick, text/data will be lost and the Word file won't open. To salvage what still exists of the file, involves manually or programmatically finding the last good tags, truncating the file there and reconstructing good XML completion tags for what remains. You can get more information regarding this issue as well as some more on tag order issues here.
- A command line program called xmllint, using the --recover command, can sometimes in one operation find the last good XML, truncate the file and add appropriate tags to the end such that it can be validated as valid XML. However I have found that this is not always reliable and despite the supposed common interpretations of what is considered valid XML, what xmllint says is good XML is not necessarily what Microsoft Word will find so. Instead what is often necessary is to load the document.xml file in a text editor, find the error, backtrack to where the tag began, truncate the rest and only then use xmllint to finish the file with good XML. In the rest of the article I will describe how I think best to do this manual type of repair.
- One last point, in addition to conventional issues causing Word and other file corruption, like program bugs, power outages, viruses, and overheating computers, unfortunately there exists a pernicious problem that I recently became aware: "fake memory." Fake memory is USB or flash memory that reports one size to Windows, say 16GB, but in reality the device has a much smaller amount of memory available, say 2GB. These pieces of memory are labeled with the major brands but sold at cut rate prices.
- The problem comes when the memory's real limit has been reached. At this point, it either throws out that part of the data it is writing that pushed it over the top, or it overwrites the oldest files with the new data. Either way the results are corrupt files. There are some estimates that 1/3 of the memory of some brands are bad in this way. You can test your memory for this issue and to find out more information here.
|What the two kinds of fake memory do to data what's they reach capacity.|
- A Warning: MAKE COPIES OF YOUR FILE, DO NOT WORK ON THE ORIGINAL IN CASE THE TREATMENTS BELOW MAKES YOUR FILE'S CORRUPTION WORSE!
- First you will want to know if you have a tag order issue or whether your document.xml or styles.xml (those are pretty much the only two file which will prevent a DOCX from opening) are truncated and will need an amputation of some kind.
- I believe most of the time the quickest way to determine what is needed, is to see if the zip file structure remains intact. A DOCX file producing an error when Word tries to open it, but which still has an intact zip structure is most likely one which will need tag reordering or a tag set removed. Luckily these issues can often be fixed with one of these free programs: the Microsoft Mr. Fixit for Math Tags, Tony Jollans Word Add-In (maybe also just for math tags), Word Corrupt Document Checker or my program: Savvy DOCX Recovery. Additionally, these two threads on Microsoft Support Forums, have experts that will look at your corrupt DOCX files (with tag order or even XML file truncation needs) and sometimes give timely free help: here and here. I hang out on the first one and give out free help too sometimes.
|Word Corruption Checker is on the left and Savvy |
DOCX Recovery's main interface is on the right
- To test if the zip structure still is intact, either change the extension from or simply add the extension .zip and then try to unzip the file in Windows. Note: if you can't see an extension on your file, right click your Word document and choose Properties. Then up top on the first General tab, simply add .zip to the file.
- If you get a message from Windows that the zip structure is corrupt, then your issue is probably not a tag order one and you will need to proceed with zip file repair in step 2. These repairs below can be done even with the .docx extension, so after your unzipping test, change back the file to a .docx extension or remove .zip again.
|Adding the ".zip" extension to the file name of a DOCX document.|
- Next repair the zip structure of the file with the command line program InfoZip's Zip.exe's -FF command (you can launch a command prompt from any folder from the Explorer File Menu in Windows 10 - and I believe in Windows 8 and 8.1 too).
- For a command line, if for example your file was called gwn.docx, you might do:
|Hopefully you will see results similar to the above when |
using InfoZip's command line Zip.exe's -FF command.
- Try to open your zip repaired DOCX in Word. You might get lucky that this repair is enough to get the file open.
- If you do get an error and if prompted with a second message that says: "Word found unreadable content in [name of your file]. Do you want to recover the contents of this document? If you trust the source of this document, click Yes.," say Yes to try to salvage the text.
|Text content recovery notice mentioned in Step 2.|
- If Word next tells you it can't be open the file because of an XML Parsing error, click on the Details button and carefully make note of which XML file and what line and column the program is choking on. For instance the message below indicates, that at least the first bit of file corruption exists at on line 2 of the document.xml file at the 31,589th character.
|More details notice also mentioned in Step 2. See the error location is in line 2 column |
31589, which just means the 31589th character in the second line of the XML file.
- If no XML file is referenced in this Details window, usually you won’t be able to recover anything. However give my open source software, Savvy Corrupt Word DOCX File Recovery, mentioned before, a chance, as it occasionally fixes those. By the way in addition to tag order issues it tries to programmatically reproduce the manual steps in this article for files that need truncation as well. The program is a bit of a work in progress though and manual repair is sometimes needed.
- Next, extract your zip repaired DOCX file with 7zip's 7z.exe command line program with the x command. A possible hang-up here is that the name of the folder you wish to extract your zipped DOCX, should follow the -o parameter with no spaces.
- So a command might be:
- Note, I found 7z.exe more effective than InfoZip's unzip.exe for recovering XML files from partially corrupt Office Open files and even tried multiple other command line unzip programs, but found 7z.exe was the best for extracting the most data from damaged files. Note also, if any file other than document.xml is referenced in the error message details of the second attempt to open your DOCX file in Word, for instance word/styles.xml, a shortcut to get a quick recovery result is simply to remove that XML file from your 7z.exe output and then go to Step 9 and rezip the contents. Word will generate a dummy styles.xml file and possibly other missing files on its own with apparently little harm to the document other than perhaps a loss of formatting that can be easily recreated.
|Hopefully what you'll see in the command line results of using 7z.exe on a repaired docx file. |
The program knows there is still a problem with document.xml but it extracted in anyway.
- Now in your extraction folders, find the XML file referenced in the more detail error in step 2. If it is the word/styles.xml file, you can first try to repair it by following what is outlined below and as mentioned, if it doesn't work you can always delete it.
- Open the XML file with the freeware NotePad++ or some other XML or programmer geared text editor that gives XML line and column numbers. Locate the error referenced and then remove the rest of the XML from the error to the end of the file. A lot of times what you'll see after the error is your further text or repeated portions of your text mixed in with all kinds of broken tags. Of course if you are knowledgeable about XML, feel free to try to correct tags where they start going bad. That way you may be able to recover more of your text. However I have found this can quickly become tedious and unrewarding, with your time better spent by simply recreating the content of what you are unable to recover and open in Word with this amputation method.
- Since most of the document.xml file is stored on one line of XML, usually line 2, a possibly easier way to find where your XML error starts is to pretty print the code. This means to put each XML tag (similar to HTML tags) on its own line. Once you then rezip the sub-files and change the zip file extension to docx, the details of the error that Word reports will now be a line error other than line 2, say line 749. Pretty Printing XML does not effect the content or formatting of your Word file as displayed in Word (or printed out for that matter).
|More useful error detail after document.xml was pretty |
printed and all the sub-files rezipped into a DOCX
- You can pretty print your XML by copying all of it from your text editor and pasting it here: http://www.cleancss.com/xml-beautify/ and pressing the Format Code button.
|The image on the left is the document.xml before beautifying/|
pretty printing. The image on the right is afterwards.
- After pretty printing your XML sub-file and copying it back to your text editor (and saving the change to the file!) rezip all the sub-files originally found in your corrupt file back to a DOCX one as described in step 9. When you try to open the file again Microsoft Word will now by give you a more informative line number where the error is, which is information easier to use.
- Note also, once you locate the error, you are going to want to go a little upstream from where the error is indicated so that the file now ends only in a complete tag, text, formulas or data. That probably means getting rid of the whole line and beyond where the error is indicated, even though Word will say that for example the error begins on the 10th column or 10th character of say line 749. What really Word is reacting to, is at this point it knows there is an error but the real error begins at the start of line which is the start of the tag. So it is better to get rid of the whole line from that point onward so you don't go into the next step with an open tag.
- Now use the xmllint --recover command on the truncated XML file.
- An example command might be:
- A fast way to get xmllint may be to download it from here. Otherwise, you can install xmllint and its dll support files by installing the free Strawberry Perl. Xmllint usually gets installed into the C:\strawberry\c\bin directory where you can copy it along with the support files libiconv-2_.dll, libXML2-2_.dll and libz_.dll to the directory where you are trying to repair your file. However I wouldn't bother most of the time, because, I think the C:\strawberry\c\bin folder gets added to the environmental path variable in the Control Panel's Windows System app during Strawberry Perl installation after which you can use and xmllint command from any folder. There are other ways to install xmllint which you can find by Googling. Here's one that might be good for instance: http://flowingmotion.jojordan.org/2011/10/08/3-steps-to-download-xmllint/.
- So now we can rezip all our XML files and change the extension back to DOCX. Rezip all the files within the folder you made with 7z.exe extraction in step 5., but don't zip the larger folder itself, just the contents. Change the extension of the rezipped contents to DOCX and try to open the file in Word. If you still get an error, it may be time to try to contact me at firstname.lastname@example.org or post your issue to one of the forums, which I'm listing here and here again, mentioned in the introduction.
|Rezip all the files and folders at the root here, but don't go up one level and zip up the one |
folder that contains this file and folders, as Word will not be able to make file of such a zip