Splitting large XML files with xml_split and sed (preserving root element and namespace declaration)
Feb 082012
Slicing up XML files is best done with an XML parser. (Regular expressions, csplit, etc. are too easily confused by arbitrary strings in CDATA sections.) xml_split (may be obtained with CPAN by installing XML::Twig) mostly does the trick. Given a file like:
<?xml version="1.0" encoding="UTF-8"?> <foo:Root xmlns:foo="http://www.foo.bar/fnarf/foo"> <foo:child> ... </foo:child> <foo:child> ... </foo:child> </foo:Root>
…xml_split can create many files, each containing:
<?xml version="1.0" encoding="UTF-8"?> <foo:child> ... </foo:child>
However, this loses the namespace declaration and the enclosing root element. Luckily, a little sed magic can bring those back:
find . -name '*.xml' | xargs -n1 sed -e '1 a\ <foo:Root xmlns:foo="http://www.foo.bar/fnarf/foo"> ' -e '$ a\ </foo:Root> ' -i ''
find lists all the files, xargs invokes sed on them one by one (-n1
), and sed adds the opening tag with namespace declaration after the first line (1 a
) and the closing tag after the last line ($ a
). Now each file looks like this:
<?xml version="1.0" encoding="UTF-8"?> <foo:Root xmlns:foo="http://www.foo.bar/fnarf/foo"> <foo:child> ... </foo:child> </foo:Root>
No Responses to “Splitting large XML files with xml_split and sed (preserving root element and namespace declaration)”