Digitized monographs: packaging for Internet Archive

This is a draft workflow for preparing scanned monographs and other multipage items with mediatype:texts for bulk uploading to Internet Archive. 1

The tools:

Preparing the files

Ensure a master copy of all scanned documents exists in the local archive; make a local working copy of the archival master.

Never work on or from original files; always make a complete working copy of the files/directories to be uploaded before proceeding.  Ensure there is no confusion with the archival masters during the workflow process.

Collocate page scans in a single working directory for each item.

  • !!! Directory names should be the same as the item’s identifier followed by “_images” !!!
    Renaming directories can also be done after compression:
     for i in */; do zip -r “${i%/}_images.zip" “$i”; done
  • Use a standard identifier found via existing archival finding aids, or invent a reasonable and descriptive identifier based on the title, creator, and/or date of the work.  Best practice is to use international standards (maybe obvious, but not always practical…  do your best.)
  • Identifiers should be formatted consistently between items in a single thematic collection.
    • The naming is essentially arbitrary and there are different views on what is the best practice, but having some consistent “flag” in the identifier really, really, really helps to make changes and updates to the objects and metadata via the the API. I tend to use a short string prefix with underscore:
    • E.g., from the Abernethy Pamphlets Collection:
      • aberpa_longfellowhw.1845.nightvoices_images/
      • aberpa_whittierjg.1874.agassiz_images/
    • E.g., from the Vermont Rare Books & Manuscripts Collection
      • vtrbms_SVSC_images/
      • vtrbms_JCC1827_images/

Rename files inside your working copy to allow automated processing

  • All files should be clearly named with the item’s standard identifier followed by an underscore (_) and number with leading zeros sufficient to sort them alphanumerically.
    • {filename}_{0xxx}.tif
  • For preparing files on Windows computers, Bulk Rename Utility (http://www.bulkrenameutility.co.uk/Main_Intro.php) is an easy to use free software for this purpose.  A wide variety of scripts and tools are available to accomplish this in a Linux environment; Thunar file manager’s batch rename utility is a good GUI tool.

Convert TIF images to JPG

  • ImageMagick (http://www.imagemagick.org) is the tool of choice for batch conversions.
  • To convert images in a single directory:
    • find *.tif -exec bash -c “mogrify -format jpg -quality 100 {} && rm {}” \; (removes .tif originals automatically after conversion)
    • In special cases where .tif files are needed for another process, use convert *.tif *.jpg (retains .tif originals, which still must be deleted before Step 5; running rm *.tif in the images’ directory will suffice.)
  • To convert images in nested directories:
    find . -name ‘*’.tif -exec bash -c “mogrify -format jpg -quality 100 {} && rm {}” \; (automatically converts .tif to jpeg and deletes originals, leaving other directory and naming structures intact)

Delete superfluous Windows files

  • For single directories, delete the “thumbs.db” file if it exists
  • If working with many directories run the following from the parent directory:
    find . -name ‘*’.db -exec rm -f {} \;

Zip the directory

File compression is required even for single-page items; the {filename}_images.zip naming convention triggers Internet Archive’s automated derivation process and ensures thumbnails, OCR transcripts, and alternative formats are generated without further intervention.

  • Compress the containing folder using ZIP compression
    • {filename}_images.zip
  • 7-Zip (http://www.7-zip.org) is the utility of choice; compression of files over 4Gb might be problematic using other software.
  • To compress several directories at once and ensure correct naming on Linux, run the following from the parent directory:
    for i in */; do zip -r “${i%/}.zip” “$i”; done
  • Final results should be a single .zip file for each book or item, containing numbered page scans in .jpg format.  For example:
    • walden_images.zip
      • walden_0001.jpg
      • walden_0002.jpg
      • walden_0003.jpg
      • …… and so on.
  • Collocate all zip files in a single directory, along with the metadata CSV file.

Notes:

  1. More information is available at: