Image collection migration : Internet Archive to Omeka

Here’s a little experiment, wherein I use the good ol’ internetarchive Python library to create a CSV suitable for Omeka’s CSV Import plugin, with the goal of migrating an image collection.  Omeka’s CSV importer allows files stored online to be automatically imported by URL, so one can just point to the IA download link for the image rather than deal with local files.

Transporting large batches of content and metadata from IA to Omeka could have a number of uses. beyond migrating one’s own collections across platforms, this could be used as a quick way to build thematic Omeka collections using free materials from across IA by integrating this code with the more advanced search functionality of the internetarchive package (or using parts of the script to proces results from IA’s advanced search tool).

The CSV generator

First, it’s necessary to create a CSV from an Internet Archive collection metadata dump that the Omeka CSV importer can handle.  Not too hard.

Available on GitHub

Requires: internetarchive

Usage: python {collection ID} {outfile}

Important presumption: The Omeka CSV importer wants only items of a single type (e.g., still images, documents, etc).  This script presumes all items in a collection are of mediatype:image (i.e., still images in jpg format).


import internetarchive
import json
import sys
import csv

argv = sys.argv
outfile = argv[2]
csv_out = csv.writer(open(outfile, 'w'), delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)

def main(argv, csv_out):
  rowheaders = ('url', 'identifier', 'title', 'creator', 'description', 'subject', 'date', 'type', 'format', 'source', 'rights')
  collection = argv[1]
  search_collection = internetarchive.search_items('collection:' + collection)
  getMetadata(collection, outfile, csv_out, search_collection)

def writeout(csv_out, dl_url, item_identifier, collection, title, creator, description, subjects, date, itemtype, itemformat, source, rights):
    if type(subjects) is list:
      subject_field = ''
      for subject in subjects:
    if subject_field != '':
            subject_field = subject_field + '; ' + subject
            subject_field = subject
      subject_field = subjects
    rowdata = (dl_url, item_identifier, title, creator, description, subject_field, date, itemtype, itemformat, source, rights)

def getMetadata(collection, outfile, csv_out, search_collection):
  print str(search_collection.num_found) + " items in collection"
  for result in search_collection:
    title = subjects = creator = description = date = itemtype = itemformat = source = rights = ''
    item_identifier = result['identifier']
    item = internetarchive.get_item(item_identifier)
    metadata = item.item_metadata['metadata']
    print "Downloading " + item_identifier + " ..."
    if 'title' in metadata:
      title = metadata['title']
    if 'subject' in metadata: 
      subjects = metadata['subject']
    if 'creator' in metadata:
      creator = metadata['creator']
    if 'description' in metadata:
      description = metadata['description']
    if 'date' in metadata:
      date = metadata['date']
    if 'type' in metadata:
      itemtype = metadata['type']
    if 'format' in metadata:
      itemformat = metadata['format']
    if 'rights' in metadata:
      rights = metadata['rights']
    if 'source' in metadata:
      source = metadata['source']
    fnames = [ for f in internetarchive.get_files(item_identifier, glob_pattern='*jpg')]
    imageid = fnames[0]
    dl_url = str('' + item_identifier + '/' + imageid)
    writeout(csv_out, dl_url, item_identifier, collection, title, creator, description, subjects, date, itemtype, itemformat, source, rights)

main(argv, csv_out)

The Omeka Importer

This part should be mostly self-explanatory after looking at the finished CSV.  Select the CSV for import, set the “Tag delimiter” field to a semicolon [;], and map the metadata fields appropriately.  When mapping fields, make sure to check “files” for the URL field and “tags” for the subject field.  Depending on how your IA item metadata is formatted, you may also want to check the “use HTML” checkbox for some fields — I typically do for Description and Rights, which may contain formatting or links.

That should be it.  Magic!