Sunday, June 08, 2014

Possible Solution to Recovering Binary Files from Opening and Then Saving the Raw Binary Code in NotePad or Other Text Editor (That Alter Binary Code)

I sometimes receive files from clients for my manual repair service, through my online form: I charge $22 for the service. Currently the site says it's $22 if I succeed, however, I very rarely do because I'm the last resort. I should really charge money if I succeed or not and I think I did to a customer or two recently by accident. All the other commercial services charge for manual repair and charge more.

Anyway I received a corrupted file which had seemingly a lot of zip structure because the file had a lot of instances of the text "PK" and XML file named nearby which are the indicators of the individually zipped up subfiles that make up  the larger zip structure which constitutes the DOCX file. The only problem was the various parts of code were missing all the Null characters which usually surround the PK marker and the subfile name (PK is a leftover designation from the first zip program PKZip and is the initials I believe of the zip file format developer). See the two screenshots below.

How the file looked.
How the file should have looked.
So anyway I suspected the file was corrupted to begin with, and then someone had opened it up in Notepad or Word as machine code, in an attempt to fix the corruption and then saved the file from within NotePad or some other text editor other than NotePad++. I later did an experiment and indeed NotePad will replace the Nul characters with spaces.

So my question was, is there a way to reliably reverse this? I did find a possible answer in a program I earlier reviewed in this blog called fixgz. So fixgz.exe or it's Linux version can take a gzip file that has been transferred via FTP as text instead of the proper binary format and fix the file. I had high hopes that maybe this was the cause of the corruption. However, the program didn't work for me and clearly either fixgzip only works on gzip files, not ordinary zip ones, or transferring the DOCX via FTP in text format was not the issue.

I then found this interesting post: So this gave me the idea that I really should open the program in a hex viewer and maybe try to find the instances of 0d 0a hex byte pairs which are indications of Windows line returns possibly added in a text file like fashion instead of a binary one. This may really be wrongheaded and 0d 0a hex byte pairs may be the way all Windows files, both binary and hex indicate line returns. However that notion didn't stop me from trying.

Anyway here are screenshots of the corrupted file and a good one in a the HxD editor:

Corrupted file in hex editor.
Good file in hex editor.
My little notion quickly fell to the ground when I was not able to find a single instance of 0d 0a hex byte pair in either my corrupt file nor a healthy DOCX. So then I decided, well maybe I could fix the file myself by looking at the difference between a healthy and apparently NotePad opened and saved file. Indeed it was pretty obvious looking at the two file that all the 00 hex "null" characters had been replaced by 20 hex "blank space" characters. I was pretty sure there were other substitutions or the 20 character had been substituted for several different characters and the file would then never be recoverable, but it seemed worth trying replacing all the 20 hex characters with 00, then trying to repair the zip structure. The latter is usually the first step in recovering a corrupt DOCX file.

So that's what I did. See below for my screenshots of the original file in ZipRepair Pro and the file with the20 character substituted with 00 characters. The substituted file is the 2nd screenshot.
With the original, ZipRepair Pro wants to skip all the files because it can't recover any.

Contrastingly ZipRepair Pro thinks it can recover some of the
subfiles in the 00 character substituted version of the target file.

So ZipRepair Pro actually was able to recover some of the XML sub-files from the 00 character substituted file, including the all important word/document.xml one where all the text is stored in a normal DOCX file. I then tried to rezip these files, and open the zip repaired version in Word, however it was missing too many parts and Word refused, even with the "Open and Repair" routine. I then made a blank DOCX file and replaced the parts I had from my recovered effort. Again however I came up empty.

However now I could look at the document.xml file and use the trick of changing the extension to html, to see the text. Clearly there was some text recoverable, but not very much. This is typical with DOCX file recoveries. Users are often saving files to external drives and the write operation becomes interrupted prematurely by the yanking out of the drive without "ejecting" it safely or simply turning of the computer first. The document.xml file is the largest subfile often by far, so it is often the one to get corrupted in an interrupted writing routine.

I was however determined now to get this file opened as a full Word file even though I knew that there wasn't really worth it for so little text. So I fired up my new version of Savvy DOCX Recovery, which I haven't released yet, and low and behold, my Frankenstein file combination of the blank file with the recoverable parts opened up and there was some formatting to be had to boot. I blurred out the trxt and file name to protect the privacy of my client. 

I'm not sure whether the current released version of Savvy DOCX recovery would also do the trick, but clearly I have stumbled on a new to me method for recovering corrupt DOCX files. I will incorporate this in my new version of Savvy DOCX recovery. I will as usual first attempt in the program to repair the zip structure, however, if this is unsuccessful, because it might be a sign that the file was opened and saved in NotePad, I'll then try to replace all instances of the hex character 20 with the 00 character.

I may have been correct that some of the subfiles were unrecoverable because the NotePad treatment had substituted other characters with character 20, however it appears to be enough to reverse the treatment to replace all the hex 20 characters with 00 copy the recovered parts to a blank file and then to recover from the original corruption by running a respectable DOCX recovery program against the substituted hybrid file.

No comments:

Hasleo Data Recovery FreeV3.2 - Free as in Freeware - Permanently from Hasleo Software "Hasleo Data Recovery FreeV3.2 100% Free Data Recovery Software...