Saturday, February 28, 2009

Extract Text from Corrupt DOCX, XLSX, and PPTX Files Office 2007 Files

It has already been mentioned in an earlier post how to extract text
from corrupt Word 2007 files that Word 2007 refuses to extract the
text from. This is a broader view of Office 2007 files including
Excel and PowerPoint 2007 files and how to extract the strings from
these files if file corruption occurs and the particular Office
program balks at extracting the text. This Office 2007 "last resort"
repair failure is probably due to the file corruption causing
malformed XML in the the XML files where the text strings are stored.
Luckily, the Open Source program, Tidy HTML, will extract text from
malformed XML files.
Office 2007 files are compound zip files of XML and images or other
multimedia. The text for Office 2007 is in the document.xml file part
of for Word 2007 compound file, in the sharedstrings.xml file part of
the Excel2007 compund file and contained on each slide numbers xml
parts for PowerPoint 2007 compound files, such as slide1.xml,
slide2.xml, or slide3.xml.
To extract the strings from these files even if they are corrupt, you
can fist repair the zip nature of the docx, xlsx, and pptx files, then
just extract those files xml files already mentioned in the second
paragraph, and then feed those xml files though Tidy HTML. Tidy HTML
will make nice Web pages of the extracted text even if the original
XML is corrupt and no longer well formed. The text will appear as one
large text block, but that is preferable to retyping. There will be
malformed XML returned in the section of text where the XML was
malformed, but with a little effort, even this can be separated from
surrounding actual text.
One can easily make a web service that does those three steps 1.
Repair the zip. 2. Extract the relevant text containing xml file and
3. running Tidy HTML to extract the text even if the file is no longer
well formed XML and then retuning a Web page the Tidy HTML results as
a block text.

Monday, February 23, 2009

Good Outlook Express Info - "dbt
files are the temporary file created for dbx files. If for due to any
reason Microsoft Outlook Express crashes and you cannot find a
required dbx file or recover an important email from the dbx file,
then try and locate dbt files on your local hard disk using Windows
Search, rename their extensions to .dbx."

Thursday, February 19, 2009

Possibly Extract Text from Corrupt docx Files With This Freeware When Word Won't

PFCEx - AOL Personal Filing Cabinet info Extractor - "Simple tool to extract
favorites / URLs from AOL Personal Filing Cabinet (PFC) files. It
generate a PFCEx.html file with all the URLs collected. Works also on
corrupted PFC files, so it could be used as a rescue utility to
recover favorites and other URLs!"

[Data Recovery Freeware] BitmapRip - Bitmap Ripper - Simple tool to extract embedded bitmap (JPEG, PNG, GIF) files from a given file. - "Simple tool to extract
embedded bitmap (JPEG, PNG, GIF) files from a given file. It search
for bitmap's headers / signatures, and create a new file for every
data block that "seems" a valid image." You can extract images say
from a PowerPoint file or a camera memory image file

[Data Recovery Freeware] Advanced PSD Repair(APSR) is Actually Freeware - " Advanced PSD Repair(APSR) is a
powerful and free tool to recover corrupt Photoshop image(PSD, PDD)
files. It uses advanced technologies to scan the corrupt or damaged
Photoshop image files and recover your data in them as much as
possible, so to minimize the loss in file corruption.
Download Free Download Now
Advanced PSD Repair
Main Features in Advanced PSD Repair v1.4
Support to recover PSD and PDD files produced by all versions of
Adobe Photoshop.
Support to recover the image as well as the separate layers.
Support to recover pixels, dimension, color depth and the palette
of the image and layers.
Support to recover uncompressed and RLE compressed images.
Support to recover PSD image with depth of 1, 8, 16, 32 bits per
Support to recover PSD image with color mode of bitmap, grayscale,
indexed, RGB, CMYK, mutlichannel, duotone, lab.
Support to repair PSD and PDD files on corrupted medias, such as
floppy disks, Zip disks, CDROMs, etc.
Support to repair a batch of Photoshop image files.
Support integration with Windows Explorer, so you can repair a
Photoshop image file with the context menu of Windows Explorer easily.
Support to find and select the PSD and PDD files to be repaired on
the client computer.
Support drag & drop operation.
Support command line parameters.

Friday, February 13, 2009

Instructions for Recovering the Text from a Corrupt .docx Word 2007 Document

1. Change the docx extension to zip. Docx and all the new MS Office
2007 format documents are made up of many XML documents in one zip
2. Repair the file with a zip repair program. Document corruption
seems to be often caused by the zip file corruption. One of the best
I have found is the zip repair program that is part of the zip suite
Ccy's HaHaZip -
3. Look in the "Word" folder of the zip file and extract the file
"document.xml", or extract the whole file. The text of your original
document though will probably be found exclusively in the document.xml
4. Extract the XML from the document.xml file at Use
the HTML output. If you get an error message about malformed XML, you
have more work to do.
5. The best program I have found for fixing malformed XML is a
combination of Microsoft Expressions or FrontPage, and XMLShell found
at: Opening the
document.xml file in Microsoft Expressions I or II will highlight
malformed XML in yellow. What will happen probably is that a section
toward the end of what was recoverable by the zip repair, will start
being malformed or a whole middle section of the XML file will be
malformed. Excise this until you no long see the yellow highlighting
indicating illegal formatting. After this open the file in XML
shell. It immediately will tell you the first XML error it
encounters. If you are lucky, it will indicate only which XML
elements are missing to properly close up the file. If you then type
the characters "</" it will start finishing the elements with what is
missing to close up the file, such as </w:rPr>, </w:r>, </w:p>, </
w:body> and </w:document>. If you are unlucky, you may be stuck with
a cryptic XML error. If you can't figure out how to fix, you may
think about cutting the entire section out from the error to the end
of the file and turning just that section into a new XML file in
FrontPage or MS Expressions, to play around with.
6. Hopefully now in XML shell you have a file that won't give
errors. You can check this fir sure by choosing "Check well formed"
on the Tools Menu of the XML Shell. Once you have a well formed XML
document, try step 4. again. If that doesn't work, E-mail me the file
at I charge $22 for the text extraction.
Another possibility is to open well formed XML in Excel and copying
the the text column (usually the 10th or so column) to Word and doing
a paste special of just text no formatting, and then removing the line

Hasleo Data Recovery FreeV3.2 - Free as in Freeware - Permanently from Hasleo Software "Hasleo Data Recovery FreeV3.2 100% Free Data Recovery Software...