Dumping CONTENTdm collections for migration — the basics

Methods for scraping files from CONTENTdm. It (partially) solves another specific problem. I began with:

  • a couple thousand images stored in Midd’s CDM collection with good metadata
  • full-resolution archival masters in local storage
  • additional metadata in an Excel spreadsheet that i wanted to include

I wanted to migrate the items to Internet Archive, with a little manual metadata cleanup in between. However, because CDM uses it’s own internal IDs and filenames, there was no good way to associate exported metadata records with either the original files or the spreadsheet entries with extra metadata.

It seemed to me that the easiest solution was to rip the files out of CDM en masse so that they would retain the filenames referenced in the CDM metadata, perform any necessary cleanup, and repackage everything for Internet Archive by pasting columns from the CDM- metadata spreadsheet to the Internet Archive template.

The downside to the following methods is that, in mixed collections containing both single objects like images and compound objects like monographs, one ends up with individual images and metadata records for every page of the compound object. There’s not a straightforward way to disambiguate, say, a single image from an image representing a page in a compound object…

Step 1: OAI-PMH

Harvest the collection metadata is some reasonably structured way. The key thing here is being able to associate metadata records with the arbitrary filenames CDM assigns objects in storage.

https://github.com/vphill/pyoaiharvester is absurdly handy. The following returns collection metadata in an xml file:

python pyoaiharvest.py -l http://server.domain.edu:port/oai/oai.php -o outfile.xml -m collectionID

Step 2: image dump

Usage: python script.py cdm_export.xml {cdm_collectionID} output_directory

from sys import argv
from urllib2 import urlretrieve
import urlparse
import xml.etree.ElementTree as ET

xmlfile = argv[1]
collection_to_dump = argv[2]
outdir = argv[3] #### note: no error checking for if outdir exists; TODO

max_x = 10000
max_y = 10000

#Extract URLs

parsed = ET.parse(xmlfile)

for elem in parsed.iter(tag='identifier'):
    pathchunks = urlparse.urlparse(elem.text).path.rsplit(':')[1].split('/')
    if pathchunks[0] == collection_to_dump:
         url = str('http://server.domain.edu/utils/ajaxhelper/?CISOROOT=' 
                   + pathchunks[0] + '&CISOPTR=' + pathchunks[1] 
                   + '&action=2&DMSCALE=100&DMWIDTH='
                   + max_x + '&DMHEIGHT=' + max_y 
                   + '&DMX=0&DMY=0')
         urlretrieve(url, str(outdir + '/' + pathchunks[1] + '.jpg'))
         print 'Retrieved ' + pathchunks[1] + '.jpg'

The code above dumps a CDM collection as .jpg images of a size up to the values of max_x and max_y into a directory, plus a bunch of crap (e.g., empty files where the ID number references a compound object). These retain the same filenames as the exported metadata sheets, however, so can be matched up if no cross reference for the images’ filenames in a remote store exists.

Step 3: PDF dump

Usage: python script.py cdm_export.xml {cdm_collectionID} output_directory

from sys import argv
from urllib2 import urlretrieve
import urlparse
import xml.etree.ElementTree as ET

xmlfile = argv[1]
collection_to_dump = argv[2]
outdir = argv[3] #### note: no error checking for if outdir exists; TODO

#Extract URLsparsed = ET.parse(xmlfile)

for elem in parsed.iter(tag='identifier'):
    pathchunks = urlparse.urlparse(elem.text).path.rsplit(':')[1].split('/')
    if pathchunks[0] == collection_to_dump:
        url = str('http://server.domain.edu/utils/getfile/' 
                   + pathchunks[0 + '/id/' + pathchunks[1] + '/')
        urlretrieve(url, str(outdir + '/' + pathchunks[1] + '.pdf'))
        print 'Retrieved ' + pathchunks[1] + '.pdf'

This works to get PDFs out of CDM for collections that are mostly compound objects.