Dumping CONTENTdm collections for migration — the basics 

Methods for scraping files from CONTENTdm. It (partially) solves another specific problem. I began with:

  • a couple thousand images stored in Midd’s CDM collection with good metadata
  • full-resolution archival masters in local storage
  • additional metadata in an Excel spreadsheet that i wanted to include

I wanted to migrate the items to Internet Archive, with a little manual metadata cleanup in between. However, because CDM uses it’s own internal IDs and filenames, there was no good way to associate exported metadata records with either the original files or the spreadsheet entries with extra metadata.

It seemed to me that the easiest solution was to rip the files out of CDM en masse so that they would retain the filenames referenced in the CDM metadata, perform any necessary cleanup, and repackage everything for Internet Archive by pasting columns from the CDM- metadata spreadsheet to the Internet Archive template.

The downside to the following methods is that, in mixed collections containing both single objects like images and compound objects like monographs, one ends up with individual images and metadata records for every page of the compound object. There’s not a straightforward way to disambiguate, say, a single image from an image representing a page in a compound object… (More …)