Fetching subject tags from an Internet Archive Collection
This script creates a CSV file compatible with the internetarchive CLI tool. It is especially useful for updating subject tags for large batches of items in IA. (More …)
This script creates a CSV file compatible with the internetarchive CLI tool. It is especially useful for updating subject tags for large batches of items in IA. (More …)
Methods for scraping files from CONTENTdm. It (partially) solves another specific problem. I began with:
I wanted to migrate the items to Internet Archive, with a little manual metadata cleanup in between. However, because CDM uses it’s own internal IDs and filenames, there was no good way to associate exported metadata records with either the original files or the spreadsheet entries with extra metadata.
It seemed to me that the easiest solution was to rip the files out of CDM en masse so that they would retain the filenames referenced in the CDM metadata, perform any necessary cleanup, and repackage everything for Internet Archive by pasting columns from the CDM- metadata spreadsheet to the Internet Archive template.
The downside to the following methods is that, in mixed collections containing both single objects like images and compound objects like monographs, one ends up with individual images and metadata records for every page of the compound object. There’s not a straightforward way to disambiguate, say, a single image from an image representing a page in a compound object… (More …)
Adapted from: https://internetarchive.readthedocs.io/en/latest/
This represents a basic use of the internetarchive python library, which is the backbone of all the IA scripts I use. This script does nothing except populate some variables. But, it’s a kernel for building lots of other things.
import internetarchive as ia import json from sys import argv collection = argv[1] search_collection = ia.search_items('collection:' + argv[1]) print str(search_collection.num_found) + " items in collection" for result in search_collection: item_identifier = result['identifier'] item = ia.get_item(item_identifier) print "Downloading " + item_identifier + " ..." title = item.item_metadata['metadata']['title'] subjects = item.item_metadata['metadata']['subject'] # ... and so on...
Proudly powered by WordPress. Theme: P2 by WordPress.com.
Reply
You must be logged in to post a comment.