Thursday, December 19, 2013

Combining the contents of multiple word files with a win32 com Python script

A former member of the lab used to keep his lab notebook as word files. One word file per day. For 8 years. Searching through these files was a real pain, so I decided to try to combine them all into one single massive file that will hopefully be easier to search through and convert to other formats.


Fortunately, the files are nicely arranged and named. A folder for each year, and within each folder word files named by the date MM-DD-YY. Some are .doc and some are .docx

I have a Windows computer with Word 2010 so I can use win32 com service.


First, I installed the win32 com Python library. I'm sure there are better ways to do this, but the strategy I used was to open each of the notebook files one by one and copy-paste their contents into a new document. One side effect is that while the program is running, I couldn't use the windows clipboard otherwise it would have pasted the wrong stuff into the new document (oh well, it was a good excuse to go get a milkshake).


import os
import win32com.client as win32
import re

#configuration
root_path = 'E:\\Lab Notebooks'
outname = 'combined.docx'

#open word
word = win32.gencache.EnsureDispatch('Word.Application')
word.Visible = False #do this in the background

#find all the right files
files = [os.path.join(root, name) for root, dirs, files in os.walk(root_path) for name in files if name.endswith(('.docx','.doc'))]
files2 = list()
for f in files:
    m = re.search('[0-9]{2}-[0-9]{2}-[0-9]{2}', f)
    if m:
        files2.append(f)
# in most cases, you'd probably want to sort the file names, but I was fortunate the directories and files
# are conveniently named that os.walk collects them in the correct order

newdoc = word.Documents.Add()
for f in files2:
    doc = word.Documents.Open(f)
    doc.Select() #select everything
    word.Selection.Copy()
    newdoc.Activate()
    word.Selection.Paste()
    doc.Close()
    print f
     
newdoc.SaveAs(os.path.join(root_path, outname))
newdoc.Close()

Edit: There must be a better way to do this, I ended up having to break this up into subtasks because of memory limitation issues, and it went slower that seems reasonable. It did work though, just not as slick as I would prefer.

references:
http://stackoverflow.com/questions/5817209/browse-files-and-subfolders-in-python
http://www.blog.pythonlibrary.org/2010/07/16/python-and-microsoft-office-using-pywin32/
http://www.galalaly.me/index.php/2011/09/use-python-to-parse-microsoft-word-documents-using-pywin32-library/

No comments:

Post a Comment