Setting up Islandora + Fedora 3 on AWS EC2 and S3 storage: Progress Report #1

Note: this was a fun experiment that we never quite got working well enough for production, for a number of reasons (mostly S3 bandwidth limits).

“They said we were insane, that it couldn’t be done…  but we’ll show them…  we’ll show them all! (via as much open documentation as I can provide, of course.)” – me.

Leading Middlebury’s nascent digital repository project has been one of the biggest challenges of my career, in terms of scale as well as administrative and technical complexity involved in getting it off the ground.  I’m sure I’ll go into more detail about the project & process soon, but here’s an overview of what we put together over in recent months:

The goal – to build a Swiss-Army, Do-It-All, Master Brain for the library’s digital content.

Our challenge was to create a single, cloud-hosted repository to serve four distinct use cases with substantially different requirements while reducing our repository budget (currently dedicated to a much smaller CONTENTdm instance) as much as possible.

The repository of our dreams would be, at turns:

  • An institutional Open Access repository for faculty preprints and honors theses;
  • A full-service platform for Special Collections & Archives’ digitized and born-digital content;
  • A science data repository capable of handling very large datasets with complex permissions & embargo requirements;
  • A DPLA Service Hub for the state of Vermont.

After all the long hours of research, committee deliberations, proposals, reports, and other preparatory work we decided on Islandora +Fedora 3 over the close runner-up, Hydra + Fedora 4.  There were a number of reasons we chose to go this route, but the main ones were that Wendy and I were already very familiar with Islandora (Wendy had already put two years of work into prototyping an Islandora based data repository before the scope was expanded), that it fit in well with existing systems, and that it is significantly more stable than Hydra + Fedora 4 at present.

Object storage is not block storage, but it sure is cheap.

In terms of raw size, we have lots of content, especially when it comes to science data.  Side scan survey data, for example, can push some larger datasets above the 10Tb mark.  Making archival video and – to a slightly lesser extent – 3D scan data freely available also depends on having quite a bit of storage behind the repository.  With Tb/Year hosted storage costs generally sitting in the mid-to-low four figure range and first-year storage demand around 10-20Tb, our only apparent recourse was the cheapest possible cloud storage: Amazon S3, MS Azure, or an equivalent.

The problem is that object storage is not block storage, and Fedora doesn’t necessarily play nice with S3.  We knew that setting up Islandora to push backups to S3 had been implemented at other institutions, and knew that using S3 to hold Fedora’s datastreamStore (which accounts for most of Fedora’s storage use) had been tried with varying degrees of success…  though, admittedly, we initially underestimated just how tenuous the results had been in prior attempts, it sounded like a good idea.

Amazon S3 only supports a limited number of operations – GET, PUT, and DELETE. It is far from POSIX-compliant, and even with an interface layer like s3fs-fuse, has a number of limitations when set against local block storage (list adapted from s3fs readme):

  • random writes or appends to files require rewriting the entire file
  • network latency is an issue
  • no renames of files or directories
  • no coordination between multiple clients mounting the same bucket
  • no hard links

S3 interface layers: many approaches, few solutions for Islandora.

The core idea on how to make our EC2 / S3 based Islandora instance work involved finding an appropriate interface layer to allow Fedora’s datastreamStore file to sit on S3, with Fedora accessing the datastreamStore via a symlink in the server install directory.

s3fs-fuse was the first interface we tried, being the most common and well developed interface layer for the task.  Unfortunately, though setting up s3fs to convince Ubuntu that our s3 bucket was “just another directory” was easy as pie, Fedora did not play nicely with the setup.  A number of errors and failures when trying to rebuild the datastreamStore led to a disappointing call from DiscoveryGarden (who we contracted to complete the Islandora install) saying using s3fs was a non-starter and recommending we abandon the s3 idea and resign ourselves to a much more modest repository and local block device storage.

yas3fs was the second-tier candidate, mostly because Brad Spry at UNC-Charlotte reported a working install of Islandora using this technique in a substantially more complicated setup that we were imagining for Midd.  Though Brad later told me the project was abandoned in favor of a different hosting/storage setup due to speed & performance issues, we abandoned our own work with yas3fs quickly due to an almost complete lack of documentation.  Frankly, after more than a full workday of grinding, I could not figure out how to pass credentials to s3 using yas3fs, and thus never successfully mounted the bucket as a directory on the server running Fedora. Not great signs when choosing an enterprise solution.

Then s3ql along, and seems to be working well.  Though it follows a similar interface paradigm as s3fs and yas3fs, s3ql gains the edge with a bit more POSIX compliance and decent speeds for disk operations that do not require actually reading/writing to stored files.

The road ahead: testing and QA

So, Middlebury now has an Islandora/Fedora3 repository up and running on EC2, with its datastreamStore living more-or-less comfortably in an s3 bucket.  If the setup continues to perform well enough to be usable in the interim before Islandora Claw/Fedora4 (hopefully) make cloud-based repositories a bit more robust, we will have shaved many thousands of dollars off of our potential storage budget and successfully built a repository meeting our needs while ringing up significantly cheaper than our (now officially obsolete) local CONTENTdm install.

There’s still a lot to do:

  • Monitoring the number of GET/PUT transactions between s3 and EC2.  Amazon charges for transactions, and any pseudo-block interface for s3 seems likely to increase the overall number of transactions for object upload and downloads, along with incidental transaction costs for most uses that touch Fedora’s datastramStore.  How many? We are not sure yet.
  • Get a static IP and choose a domain name.
  • Figure out how to support batch application of pending updates, and how to schedule OS & security updates.
  • Reviewing security.
  • Skinning, branding, and other setup tweaks.
  • Migrating objects and metadata from CONTENTdm and elsewhere.