Thursday, June 18, 2009

Recreating XML files from fragments

I'm working on an interesting problem right now.  Occasionally I acquire fragments of files that I would like to re-create as much as possible.  Many of these are Microsoft Word 2007 files.  MS Word 2007 uses an XML format, so it would seem possible to parse the file to detect tags that were ended, but don't have a matching opening (because the beginning was cut off).

I figured that I'm probably not the first one to think about this problem, so I went trolling the intertubes for ready-made solutions.  Since perl is my glueware language of choice, I searched until I found the following handy snippet from prlmnks.org:
use XML::LibXML;
my $parser = XML::LibXML->new();
$parser->recover(1);
my $doc = $parser->parse_file($ARGV[0]);
print $doc->toString(1);
Very, very nice!  Now I am part of the way there.  Next, I took a pre-existing MS Word document of similar make and model, and prepended it.  With a little manual massaging, I got the script above to parse it, and even pretty-print it (a nice bonus).  Unfortunately, Microsoft Word still doesn't like the resultant "document."

I'm still working on this problem, but that's decent progress for an hour of work.

No comments: