Bulk management for archive.org: generate a CSV list of files by item identifier and file extension

When prepping items Internet Archive, I try to create identifiers that include a short prefix denoting the collection. For example, photos added to the Middlebury College News Bureau collection from CONTENTdm received the prefix “mnb_” to their identifiers (the rest is an abbreviated identifier for the original photo and scans).

This script provides an easy way to search for files by some string in their identifier and spit out a file list as CSV.

I wrote this to help correct an error, wherein duplicate identifiers in the metadata CSV [note: super annoying — DO use unique identifiers] cause some newly created IA items to have two images. In a panic, I came up with this way to find out which items in the collection had one JPEG associated with them, and which had two. Correcting the actual problem was more convoluted, and brought Excel into the mix (and there’s still some items in IA that have two identical images associated with them), but this is a start.

Available on GitHub

Requires: internetarchive

Usage: python script.py {collection_to_search} {identifier_fragment} {file_extension}

import internetarchive
import json
import sys
import csv

argv = sys.argv
collection = argv[1]
search_key = argv[2]
file_ext = argv[3]

search_collection = internetarchive.search_items('collection:' + argv[1])
search_ext = '*.' + file_ext

print str(search_collection.num_found) + " items in collection"
count = 0
csv_out = csv.writer(open('filelist.csv', 'wb'), delimiter=',',
                                                 quotechar='"',
                                                 quoting=csv.QUOTE_MINIMAL)
for result in search_collection:
  item_identifier = result['identifier']
  if any([search_key in item_identifier]):
    fnames = [f.name for f in internetarchive.get_files(item_identifier, 
                                                        glob_pattern=search_ext)]
    csv_out.writerow([item_identifier, fnames[0]])
print(str(count) + ' items found.')