Today I faced a task where I had to parse huge XML-files. And when I say huge, I mean 6-14 GB. My weapon of choice is Python, since I’m comfortable with it. However, I had never parsed XML with it before. Because of the size of the files, it was unfeasible to load the entire file into memory, and for me that was not necessary either.
After Googling for a while, I found that many people recommend the
ElementTree module and it’s C-equivalent cElementTree. The function
iterparse
proved to be a real life saver. By iterating through the
element tree and deleting elements as you go, you will only consume small
amounts of memory. The following snippet is more or less taken from
the documentation.
I don’t like the way I had to specify the namespaces, but I guess there’s a better way of doing it. When running this on a 6.4 GB XML file (161 million rows, 81 million elements), the code above did not consume more than 15 MB of memory. I don’t remember how long it took, but it was reasonably fast.