Updates from Patrick Wallace Toggle Comment Threads | Keyboard Shortcuts

  • Patrick Wallace 12:41 on 2019-03-05 Permalink |
    Tags: , comms, , ,   

    Tips for librarians : searching Internet Archive 

    Internet Archive help documentation – including walkthroughs for many common tasks – can be found here.

    A tale of two search methods

    “It was the best of UIs, it was the worst of UIs…”

    The “details” page for a Collection or User on archive.org has two search boxes, and each box works somewhat differently.

    • Both search query boxes can be set to search either metadata records or the fulltext of OCR’d content from “text” type objects.
    • The search box in the site banner searches all of Internet Archive and is located at the upper-right of the page.  In addition to metadata and fulltext searches, it can also be set to search closed caption text in IA’s television archives.
    • The search box in the left-nav bar, above the fasearches only within the current collection or user library. While fulltext searches from this box function identically to the one in the global (upper-right) search box, the metadata search works on a totally different principle.

    To help users find information in IA effectively, it is critically important for librarians understand the difference between these two “search” mechanisms

    • Queries entered into the top-right search bar (as well as all fulltext searches) are “true” searches.
      • They pass the query to a search engine and return a search results page to the user.
      • Although they look similar to a collection, search results cannot be browsed or filtered by facet in the same way.
      • Search result filters are limited to: Collection, Subject, Creator, and Language.
    • In contrast, metadata (but not fulltext) searches within a collection create custom filters based on the query string.
      • Instead of passing the query to a search engine and then sending the user to a “results” page, the system applies a filter to the collection display page.
      • Despite appearing very similar to search results pages, collection display pages can be browsed by applying additional facet filters.
    • Collection and search results pages also differ in appearance:
      • Search results pages have a full-width search bar with light gray background at the top, directly under the banner.
      • Collection pages (filtered or not) show the collection’s thumbnail, title, and “About” blurb in the same location, directly under the banner.
    • The most reliable way to determine whether you’re looking at a search result page or a facet-filtered Collection page is by looking at the URL.

    Example – metadata vs. fulltext search in a Collection (left-nav search box):

    These two searches use the left-nav “search this collection” box from the Middlebury College Library main collection page (go/ia). The query text is “cheese”.

    Example – facet filtering of a collection vs. querying subject term + collection name in global IA search (upper-right search box):

    These examples show the difference between search and filtering. Both methods yield the same results by leveraging the same metadata elements. However, the latter method sends the user to a search results page, while the former simply filters the Collection details page.

    Searching Metadata Records

    Metadata searches are the default search type, and work similarly to an OPAC. They look only at the contents of object/collection metadata, not their fulltext.

    Searching fulltext

    Fulltext searching allows you to search inside books and other text items, based on their OCR text.

    • Like all search results pages on IA, there are only a couple filters that can be used to limit results: Collection, Subject, Creator, and Language.
      • Protip: Creative use of subject tags to create pseudo-collections goes a long way in expanding the usefulness of fulltext searching. Example:
        • The problem – Both Middlebury Magazine and the Campus are part of the same collection, and thus a fulltext search within “middelburynewspapers” will yield results from both publications.
        • The solution – Periodicals are tagged with a uniform title in the Subject field. To see results from only Middlebury Magazine in a fulltext search for the word “Starr Library”: first do a fulltext search for “Starr Library” from the left-nav box on the middleburynewspapers collection page, then use the Subject filter “Middlebury College Magazine”.
    • Filtering out items inside a certain date range cannot be combined with fulltext searching; however, completing a fulltext search inside a limited-scope collection (such as College Newspapers) and sorting the results by date should make the task more manageable.
    • Fulltext searches use objects OCR’d text. They will not return results from objects with a mediatype other than “text”, nor will they return results from texts with bad or missing OCR.
    • Fulltext searches will search all texts on the site if completed from the top-right search bar, and inside a collection from the left-nav search bar.

    Advanced Search

    Advanced search lets users create complex queries to search item records.  Currently, Advanced Search does not support fulltext searching.

    The Advanced Search page has three main sections (note – there are two different forms on the same page, with separate “submit” buttons):

    1. At the top of the page is the regular Advanced Search form.
    2. Below that is a second form, which allows users to receive search results in one of several file formats – including CSV, JSON, & XML.
      • The query string must be entered into the text box for the second form; anything in text boxes above the heading “Advanced Search returning JSON, XML, and more” is ignored.
      • Users can enter the number of files they want returned, and how many results are in each file – the default setting is 50 results in one file; setting a value larger than the number of results actually returned will simply return all results (i.e., set this as high as necessary to ensure all results are included).
      • Although it searches through an item’s entire metadata record, the results data will only contain one metadatum per object by default: the Identifier.
        • Additional metadata fields can be selected from the list under the query box. Ctrl+mouseclick to select more than one at a time.
        • For CSV formatted results (the most commonly used), selected fields will be columns, while rows represent items.
      • Search results will not be displayed in the browser; instead, clicking “submit” initiates download of the search result data file.
    3. The page runs for a long way down at the bottom, giving a wide range of examples about how to use search operators.
     
  • Patrick Wallace 15:24 on 2017-03-15 Permalink |
    Tags: aws, ec2, fedora, file systems, interface, islandora, linux, projects, repository, s3,   

    Setting up Islandora + Fedora 3 on AWS EC2 and S3 storage: Progress Report #1 

    Note: this was a fun experiment that we never quite got working well enough for production, for a number of reasons (mostly S3 bandwidth limits).

    “They said we were insane, that it couldn’t be done…  but we’ll show them…  we’ll show them all! (via as much open documentation as I can provide, of course.)” – me.

    Leading Middlebury’s nascent digital repository project has been one of the biggest challenges of my career, in terms of scale as well as administrative and technical complexity involved in getting it off the ground.  I’m sure I’ll go into more detail about the project & process soon, but here’s an overview of what we put together over in recent months:

    The goal – to build a Swiss-Army, Do-It-All, Master Brain for the library’s digital content.

    Our challenge was to create a single, cloud-hosted repository to serve four distinct use cases with substantially different requirements while reducing our repository budget (currently dedicated to a much smaller CONTENTdm instance) as much as possible.

    The repository of our dreams would be, at turns:

    • An institutional Open Access repository for faculty preprints and honors theses;
    • A full-service platform for Special Collections & Archives’ digitized and born-digital content;
    • A science data repository capable of handling very large datasets with complex permissions & embargo requirements;
    • A DPLA Service Hub for the state of Vermont.

    After all the long hours of research, committee deliberations, proposals, reports, and other preparatory work we decided on Islandora +Fedora 3 over the close runner-up, Hydra + Fedora 4.  There were a number of reasons we chose to go this route, but the main ones were that Wendy and I were already very familiar with Islandora (Wendy had already put two years of work into prototyping an Islandora based data repository before the scope was expanded), that it fit in well with existing systems, and that it is significantly more stable than Hydra + Fedora 4 at present.

    Object storage is not block storage, but it sure is cheap.

    In terms of raw size, we have lots of content, especially when it comes to science data.  Side scan survey data, for example, can push some larger datasets above the 10Tb mark.  Making archival video and – to a slightly lesser extent – 3D scan data freely available also depends on having quite a bit of storage behind the repository.  With Tb/Year hosted storage costs generally sitting in the mid-to-low four figure range and first-year storage demand around 10-20Tb, our only apparent recourse was the cheapest possible cloud storage: Amazon S3, MS Azure, or an equivalent.

    The problem is that object storage is not block storage, and Fedora doesn’t necessarily play nice with S3.  We knew that setting up Islandora to push backups to S3 had been implemented at other institutions, and knew that using S3 to hold Fedora’s datastreamStore (which accounts for most of Fedora’s storage use) had been tried with varying degrees of success…  though, admittedly, we initially underestimated just how tenuous the results had been in prior attempts, it sounded like a good idea.

    Amazon S3 only supports a limited number of operations – GET, PUT, and DELETE. It is far from POSIX-compliant, and even with an interface layer like s3fs-fuse, has a number of limitations when set against local block storage (list adapted from s3fs readme):

    • random writes or appends to files require rewriting the entire file
    • network latency is an issue
    • no renames of files or directories
    • no coordination between multiple clients mounting the same bucket
    • no hard links

    S3 interface layers: many approaches, few solutions for Islandora.

    The core idea on how to make our EC2 / S3 based Islandora instance work involved finding an appropriate interface layer to allow Fedora’s datastreamStore file to sit on S3, with Fedora accessing the datastreamStore via a symlink in the server install directory.

    s3fs-fuse was the first interface we tried, being the most common and well developed interface layer for the task.  Unfortunately, though setting up s3fs to convince Ubuntu that our s3 bucket was “just another directory” was easy as pie, Fedora did not play nicely with the setup.  A number of errors and failures when trying to rebuild the datastreamStore led to a disappointing call from DiscoveryGarden (who we contracted to complete the Islandora install) saying using s3fs was a non-starter and recommending we abandon the s3 idea and resign ourselves to a much more modest repository and local block device storage.

    yas3fs was the second-tier candidate, mostly because Brad Spry at UNC-Charlotte reported a working install of Islandora using this technique in a substantially more complicated setup that we were imagining for Midd.  Though Brad later told me the project was abandoned in favor of a different hosting/storage setup due to speed & performance issues, we abandoned our own work with yas3fs quickly due to an almost complete lack of documentation.  Frankly, after more than a full workday of grinding, I could not figure out how to pass credentials to s3 using yas3fs, and thus never successfully mounted the bucket as a directory on the server running Fedora. Not great signs when choosing an enterprise solution.

    Then s3ql along, and seems to be working well.  Though it follows a similar interface paradigm as s3fs and yas3fs, s3ql gains the edge with a bit more POSIX compliance and decent speeds for disk operations that do not require actually reading/writing to stored files.

    The road ahead: testing and QA

    So, Middlebury now has an Islandora/Fedora3 repository up and running on EC2, with its datastreamStore living more-or-less comfortably in an s3 bucket.  If the setup continues to perform well enough to be usable in the interim before Islandora Claw/Fedora4 (hopefully) make cloud-based repositories a bit more robust, we will have shaved many thousands of dollars off of our potential storage budget and successfully built a repository meeting our needs while ringing up significantly cheaper than our (now officially obsolete) local CONTENTdm install.

    There’s still a lot to do:

    • Monitoring the number of GET/PUT transactions between s3 and EC2.  Amazon charges for transactions, and any pseudo-block interface for s3 seems likely to increase the overall number of transactions for object upload and downloads, along with incidental transaction costs for most uses that touch Fedora’s datastramStore.  How many? We are not sure yet.
    • Get a static IP and choose a domain name.
    • Figure out how to support batch application of pending updates, and how to schedule OS & security updates.
    • Reviewing security.
    • Skinning, branding, and other setup tweaks.
    • Migrating objects and metadata from CONTENTdm and elsewhere.
     
  • Patrick Wallace 15:09 on 2016-06-01 Permalink |
    Tags: audio, , realaudio, realmedia, ripping, , streaming media, , transcoding, video   

    Scraping and converting RealMedia streams to .mp4 

    This handy little script is a way to convert RealMedia streams (.rm, .ram, etc) to .mp4.

    The problem this script – crudely – solved began with a video server containing dozens of recorded lectures in RealMedia format, which were proving beastly in conversion. Some were created and served by Midd’s own streaming media tool, while others were created by an Accordent system from long ago.  Just downloading the .rm files from the server and running them through ffmpeg (or RealPlayer Converter for Windows) left the audio and video tracks wildly out of sync; 1 endless tinkering with codecs and settings didn’t improve the situation. Left with few other options, I wrote this script to dump the A/V streams in real time and convert them to .mp4 via ffmpeg.

    Note: the scripted method is SLOOOOOOW and wonky and resource-heavy.  The time to download and convert a .rm video is around 130% of the playback duration.
    (More …)

    Notes:

    1. My beloved hometown’s own Walker Art Center has documented a Mac-based workflow that runs into the same sync issues, with different solution, here.
     
  • Patrick Wallace 14:52 on 2016-05-21 Permalink |
    Tags: , , , , , , ,   

    Bulk management for archive.org: generate a CSV list of files by item identifier and file extension 

    When prepping items Internet Archive, I try to create identifiers that include a short prefix denoting the collection. For example, photos added to the Middlebury College News Bureau collection from CONTENTdm received the prefix “mnb_” to their identifiers (the rest is an abbreviated identifier for the original photo and scans).

    This script provides an easy way to search for files by some string in their identifier and spit out a file list as CSV. (More …)

     
  • Patrick Wallace 18:21 on 2016-05-19 Permalink |
    Tags: ebooks, , , monographs, , uploading   

    Digitized monographs: packaging for Internet Archive 

    This is a draft workflow for preparing scanned monographs and other multipage items with mediatype:texts for bulk uploading to Internet Archive. 1

    The tools:

    (More …)

    Notes:

    1. More information is available at:

     
c
Compose new post
j
Next post/Next comment
k
Previous post/Previous comment
r
Reply
e
Edit
o
Show/Hide comments
t
Go to top
l
Go to login
h
Show/Hide help
shift + esc
Cancel
Sites DOT MiddleburyThe Middlebury site network.