Fortunately, the files are nicely arranged and named. A folder for each year, and within each folder word files named by the date MM-DD-YY. Some are .doc and some are .docx
I have a Windows computer with Word 2010 so I can use win32 com service.
First, I installed the win32 com Python library. I'm sure there are better ways to do this, but the strategy I used was to open each of the notebook files one by one and copy-paste their contents into a new document.
One side effect is that while the program is running, I couldn't use the windows clipboard otherwise it would have pasted the wrong stuff into the new document (oh well, it was a good excuse to go get a milkshake).import os import win32com.client as win32 import re #configuration root_path = 'E:\\Lab Notebooks' outname = 'combined.docx' #open word word = win32.gencache.EnsureDispatch('Word.Application') word.Visible = False #do this in the background #find all the right files files = [os.path.join(root, name) for root, dirs, files in os.walk(root_path) for name in files if name.endswith(('.docx','.doc'))] files2 = list() for f in files: m = re.search('[0-9]{2}-[0-9]{2}-[0-9]{2}', f) if m: files2.append(f) # in most cases, you'd probably want to sort the file names, but I was fortunate the directories and files # are conveniently named that os.walk collects them in the correct order newdoc = word.Documents.Add() for f in files2: doc = word.Documents.Open(f) doc.Select() #select everything word.Selection.Copy() newdoc.Activate() word.Selection.Paste() doc.Close() print f newdoc.SaveAs(os.path.join(root_path, outname)) newdoc.Close()
Edit: There must be a better way to do this, I ended up having to break this up into subtasks because of memory limitation issues, and it went slower that seems reasonable. It did work though, just not as slick as I would prefer.
references:
http://stackoverflow.com/questions/5817209/browse-files-and-subfolders-in-python
http://www.blog.pythonlibrary.org/2010/07/16/python-and-microsoft-office-using-pywin32/
http://www.galalaly.me/index.php/2011/09/use-python-to-parse-microsoft-word-documents-using-pywin32-library/
No comments:
Post a Comment