Tagged: internetarchive Toggle Comment Threads | Keyboard Shortcuts

  • Patrick Wallace 12:41 on 2019-03-05 Permalink |
    Tags: , comms, , internetarchive,   

    Tips for librarians : searching Internet Archive 

    Internet Archive help documentation – including walkthroughs for many common tasks – can be found here.

    A tale of two search methods

    “It was the best of UIs, it was the worst of UIs…”

    The “details” page for a Collection or User on archive.org has two search boxes, and each box works somewhat differently.

    • Both search query boxes can be set to search either metadata records or the fulltext of OCR’d content from “text” type objects.
    • The search box in the site banner searches all of Internet Archive and is located at the upper-right of the page.  In addition to metadata and fulltext searches, it can also be set to search closed caption text in IA’s television archives.
    • The search box in the left-nav bar, above the fasearches only within the current collection or user library. While fulltext searches from this box function identically to the one in the global (upper-right) search box, the metadata search works on a totally different principle.

    To help users find information in IA effectively, it is critically important for librarians understand the difference between these two “search” mechanisms

    • Queries entered into the top-right search bar (as well as all fulltext searches) are “true” searches.
      • They pass the query to a search engine and return a search results page to the user.
      • Although they look similar to a collection, search results cannot be browsed or filtered by facet in the same way.
      • Search result filters are limited to: Collection, Subject, Creator, and Language.
    • In contrast, metadata (but not fulltext) searches within a collection create custom filters based on the query string.
      • Instead of passing the query to a search engine and then sending the user to a “results” page, the system applies a filter to the collection display page.
      • Despite appearing very similar to search results pages, collection display pages can be browsed by applying additional facet filters.
    • Collection and search results pages also differ in appearance:
      • Search results pages have a full-width search bar with light gray background at the top, directly under the banner.
      • Collection pages (filtered or not) show the collection’s thumbnail, title, and “About” blurb in the same location, directly under the banner.
    • The most reliable way to determine whether you’re looking at a search result page or a facet-filtered Collection page is by looking at the URL.

    Example – metadata vs. fulltext search in a Collection (left-nav search box):

    These two searches use the left-nav “search this collection” box from the Middlebury College Library main collection page (go/ia). The query text is “cheese”.

    Example – facet filtering of a collection vs. querying subject term + collection name in global IA search (upper-right search box):

    These examples show the difference between search and filtering. Both methods yield the same results by leveraging the same metadata elements. However, the latter method sends the user to a search results page, while the former simply filters the Collection details page.

    Searching Metadata Records

    Metadata searches are the default search type, and work similarly to an OPAC. They look only at the contents of object/collection metadata, not their fulltext.

    Searching fulltext

    Fulltext searching allows you to search inside books and other text items, based on their OCR text.

    • Like all search results pages on IA, there are only a couple filters that can be used to limit results: Collection, Subject, Creator, and Language.
      • Protip: Creative use of subject tags to create pseudo-collections goes a long way in expanding the usefulness of fulltext searching. Example:
        • The problem – Both Middlebury Magazine and the Campus are part of the same collection, and thus a fulltext search within “middelburynewspapers” will yield results from both publications.
        • The solution – Periodicals are tagged with a uniform title in the Subject field. To see results from only Middlebury Magazine in a fulltext search for the word “Starr Library”: first do a fulltext search for “Starr Library” from the left-nav box on the middleburynewspapers collection page, then use the Subject filter “Middlebury College Magazine”.
    • Filtering out items inside a certain date range cannot be combined with fulltext searching; however, completing a fulltext search inside a limited-scope collection (such as College Newspapers) and sorting the results by date should make the task more manageable.
    • Fulltext searches use objects OCR’d text. They will not return results from objects with a mediatype other than “text”, nor will they return results from texts with bad or missing OCR.
    • Fulltext searches will search all texts on the site if completed from the top-right search bar, and inside a collection from the left-nav search bar.

    Advanced Search

    Advanced search lets users create complex queries to search item records.  Currently, Advanced Search does not support fulltext searching.

    The Advanced Search page has three main sections (note – there are two different forms on the same page, with separate “submit” buttons):

    1. At the top of the page is the regular Advanced Search form.
    2. Below that is a second form, which allows users to receive search results in one of several file formats – including CSV, JSON, & XML.
      • The query string must be entered into the text box for the second form; anything in text boxes above the heading “Advanced Search returning JSON, XML, and more” is ignored.
      • Users can enter the number of files they want returned, and how many results are in each file – the default setting is 50 results in one file; setting a value larger than the number of results actually returned will simply return all results (i.e., set this as high as necessary to ensure all results are included).
      • Although it searches through an item’s entire metadata record, the results data will only contain one metadatum per object by default: the Identifier.
        • Additional metadata fields can be selected from the list under the query box. Ctrl+mouseclick to select more than one at a time.
        • For CSV formatted results (the most commonly used), selected fields will be columns, while rows represent items.
      • Search results will not be displayed in the browser; instead, clicking “submit” initiates download of the search result data file.
    3. The page runs for a long way down at the bottom, giving a wide range of examples about how to use search operators.
  • Patrick Wallace 14:52 on 2016-05-21 Permalink |
    Tags: , , internetarchive, , , , ,   

    Bulk management for archive.org: generate a CSV list of files by item identifier and file extension 

    When prepping items Internet Archive, I try to create identifiers that include a short prefix denoting the collection. For example, photos added to the Middlebury College News Bureau collection from CONTENTdm received the prefix “mnb_” to their identifiers (the rest is an abbreviated identifier for the original photo and scans).

    This script provides an easy way to search for files by some string in their identifier and spit out a file list as CSV. (More …)

  • Patrick Wallace 18:21 on 2016-05-19 Permalink |
    Tags: ebooks, , internetarchive, monographs, , uploading   

    Digitized monographs: packaging for Internet Archive 

    This is a draft workflow for preparing scanned monographs and other multipage items with mediatype:texts for bulk uploading to Internet Archive. 1

    The tools:

    (More …)


    1. More information is available at:

Compose new post
Next post/Next comment
Previous post/Previous comment
Show/Hide comments
Go to top
Go to login
Show/Hide help
shift + esc
Sites DOT MiddleburyThe Middlebury site network.