Tags » Middlebury

 
 
 

Website Performance: Pressflow, Varnish, Oh-My!

Categories: Midd Blogosphere

Executive summary:

We’ve migrated from core Drupal-6 to Pressflow, a back-port of Drupal-7 performance features. Using Pressflow allows us to cache anonymous web-requests (about 77% of our traffic) for 5-minutes and return them right from memory. While this vastly improves the amount of traffic we can handle as well as the speed of anonymous page-loads it does mean that anonymous users may not see new versions of content for at most 5 minutes. Traffic for logged-in users will always continue to flow directly through to Drupal/Pressflow and will always be up-to-the-instant-fresh.

Read on for more details about what has change and where we are at with regard to website performance.


Background

When we first launched the new Drupal website back in February we went through some growing pains that necessitated code fixes (Round 1 and Round 2) as well as the addition of an extra web-server host and database changes (Round 2).

These improvements brought our site up to acceptable performance levels, but I was concerned that we might run into performance problems if the college ended up in the news and thousands of people suddenly went to view our site.

At DrupalCon a few weeks ago I attended a Drupal Performance Workshop where I learned a number of techniques that can be used to scale Drupal sites to be able to handle internet-scale traffic — not Facebook or Google-level traffic, but that of The Grammys, Economist, or World Bank.

Since before the launch of the new site we were already making use of optcode-caching via APC to speed code execution and were doing data caching with Memcache to reduce the load on the database. This system-architecture is far more performant than a baseline setup, but we still could only handle a sustained average of 20 requests each second before the web-host started becoming fully loaded. While this double our normal average of 10-requests per second, it is not nearly enough headroom to feel safe from traffic spikes.

Diagram of the execution flow through the web-host using normal Drupal page caching.

Request flow through our Drupal web-host prior to May 13th; using normal Drupal page-caching stored in Memcache. Click for full-size.

Switching to Pressflow

Last week we switched from the standard Drupal-6.16 to Pressflow-6.16.77, a version of Drupal 6 that has had a number of the performance-related improvements from Drupal-7 back-ported to it. Code changes in Pressflow such as dropping legacy PHP4 support and using only MySQL enable Pressflow execute about 27% faster than Drupal, a useful improvement but not enough to make a huge difference were we to get double or triple our normal traffic.

For us, the most important difference between Pressflow and Drupal-6 is that sessions are ‘lazily’ created. This means that rather than creating a new ’session’ on the server to hold user-specific information on the first page each user sees on the website, Pressflow instead only creates the session when the user hits a page (such as the login page) that actually has user-specific data to store. This change makes it very easy to differentiate between anonymous requests (no session cookies) and authenticated requests (that have session cookies) and enables the next change, Varnish page caching.

Varnish Page Caching

Varnish is a reverse-proxy server that runs on our web hosts and can return pages and images from its own in-memory cache so that they don’t have to execute in Drupal/Pressflow every single time. The default rule in Varnish is that if there are any cookies in the request, then the request is for a particular user and should be transparently passed through to the back-end (Drupal/Pressflow). If there are no cookies in the request, then Varnish assumes correctly that it is an anonymous request and tries to respond from its cache without bothering the back-end.

Request flow through our Drupal/Pressflow web-host after May 13th; using the Varnish proxy-server for caching. Click for full-size.

Request flow through our Drupal/Pressflow web-host after May 13th; using the Varnish proxy-server for caching. Click for full-size.

Since about 77% of our traffic is non-authenticated traffic, Varnish only sends about 30% of the total requests through to Apache/PHP/Drupal: all authenticated requests and anonymous requests where the cache hasn’t been refreshed in the past 5 minutes. Were we to have a large spike in anonymous traffic, virtually all of this increase would be served directly from Varnish’s cache, preventing any load-increase on Apache/PHP/Drupal or the back-end MySQL database. In my tests against our home-page varnish was able to easily handle more than 10,000 requests each second with the limiting factor being network speed rather than Varnish.

A histogram of requests to the website. Y-axis is the number of requests, X-axis is the time to return requests, '|' requests were handled by Varnish's cache and '#' were passed through to Drupal. The majority of our requests are being handled quickly by Varnish while a smaller portion are being passed-through to Drupal.

A histogram of requests to the website. Y-axis is the number of requests, X-axis is the time to return requests, '|' requests were handled by Varnish's cache and '#' were passed through to Drupal. The majority of our requests are being handled quickly by Varnish while a smaller portion are being passed-through to Drupal.

MySQL Improvements

During the scheduled downtime this past Sunday, Mark updated our MySQL server and installed the InnoBase InnoDB Plugin, a high-performance storage engine for MySQL that can provide twice the performance of the built-in InnoDB engine in MySQL for the types of queries done by Drupal.

Last week Mark and I also went through our database configuration and verified that the important parameters were tuned correctly.

As the MySQL database is not currently the bottleneck that limits our site performance these improvements will likely have a minor (though wide-spread) effect. Were our authenticated traffic to further increase (due to more people editing for instance) these improvements will be more important.

Where We Are Now

At this point the website should be able to handle at least 20,000 requests/second of anonymous users (10,000 on each of two web-hosts) at the same time that it is handling up to 40 requests/second from authenticated users (20 on each of two web-hosts).

While it is impossible to accurately translate these request rates into the number of users we can support visiting the site, a very rough estimation would be to divide the number of requests/second by 10 (a guess at the average number of requests needed for each page view) to get a number of page-views that can be handled each second. (1)

In addition to how many requests can be handled, how fast the requests are returned is also important. Our current response times for un-cached pages usually falls between 0.5 seconds and 2 seconds. If pages take much longer than 2 seconds, the site can “feel slow”. For anonymous pages cached in Varnish response times range from 0.001 seconds to 0.07 seconds, much faster than Apache/Drupal can do and more than fast enough for anything we need.

The last performance metric that we are concerned with is about the time it takes for the page to be usable by the viewer. Even if they receive all of the files for a page in only 0.02 seconds, it may still take their browser several seconds to parse these files, execute javascript code, and turn them into a displayable graphic. Due to these factors, my testing has shown that most pages on our site take between 1 and 3 seconds for users to feel that our pages are loaded. For authenticated users, this stretches to 2-4 seconds.

Finally please be aware that, anonymous users see pages that may be cached for up to 5 minutes. While this is fine for the vast majority of our content, there are a few cases where we may need to have the content shown be up-to-the-second fresh. We will address these few special cases over the coming months.

Future Performance Directions

Now that we have our caching system in place our system architecture is relatively complete for our current performance needs. While we may do a bit of tuning on various server parameters, our focus now shifts to PHP and Javascript code optimization to further improve server-side and client-side performance respectively.

One big impact on javascript performance (and hence perceived load-time) is that we currently have to include two separate versions of the jQuery Javascript Library due to different parts of the site relying on different versions. Phasing out the older version will reduce almost by half the amount of code that the browser has to parse.

Additional Notes

(1) As people browse the site their browser needs to load the main HTML page as well as make separate requests for Javascript files, style-sheet (CSS) files, and every image. After these have been loaded the first time, [most] browsers will cache these files locally and only request them again after 5 minutes or if the user clears their browser cache. CSS files and images that haven’t been seen before will need to be loaded as new pages are browsed to. For example, the first time someone loads the Athletics page, it requires about 40 requests to the server for a variety of files. A subsequent click on the Arts page would require an additional 13 requests, while a click back to the Athletics page would require on 1 additional request as the images would still be cached in the browser.

MiddLab

Categories: Midd Blogosphere

http://go.middlebury.edu/middlab

MiddLab is a new section of Middlebury’s website with no precedent: an academic network, uniting all of the… blah, blah blah.

Truth is, MiddLab has been hard for us to explain ever since we heard the idea. A research network featuring discussions and blogs, and linking together disciplinary themes? How does that work? Rather than write a manifesto, here is what we’re trying to accomplish with MiddLab.

Our Goals

  • Make research easy to discover. If you want to know what student and faculty research is going on in a department, you shouldn’t have to know where their papers are published or the address of the project’s web site. Instead, these should be one or two clicks from our home page.
  • Show connections between research. Whether researching the population growth of trees in Biology or the population density of people in Geography, projects share themes and people interested in the topic can easily explore both.
  • Start a discussion. We encourage and recommend that you add comments to the projects on this site. Ask questions, suggest new research, or explain why you disagree with the conclusions. You can add your thoughts to any project page on MiddLab, explore the individual blogs for some projects, or contact the researchers directly.
  • Provide space for research and the sciences on our site. We’ll be expanding this site to feature more presentations from the Spring Research Symposium and research projects in our science departments. Though MiddLab is open to any student, faculty or staff projects, these are areas where we know we’re not offering enough information on our site and would like to use MiddLab to expand.

Your Feedback

We aren’t sure these are the right goals for our site. We’d like to hear from people: what would you like to see in MiddLab? What parts of this site work toward these goals and which don’t? Leave your thoughts by commenting on this page.

Oh, and if you would like us to feature your project in MiddLab, send an email to middlab@middlebury.edu.

LIS Organization Chart

Categories: Midd Blogosphere

Library and Information Services Staff Meeting  August 30, 2005

Vermont Parks Free June 12th and 13th

Categories: Midd Blogosphere

VERMONT DAYS is coming up June 12th and 13th. All Vermont State Park day areas, state-owned historic sites, and Vermont’s History Museum will be open and free to the public. Saturday, June 12th is free fishing day – the one day in the year when residents and non-residents may go fishing without a license.

Welcome Dennis Hadley!

Categories: Midd Blogosphere

Please join me in welcoming Dennis Hadley to LIS.  Dennis started with us on April 12th and has assumed the position of Senior Technology Specialist at the Technology Support Helpdesk.  Dennis comes to us from Paul Smith’s College in NY and brings a wealth of experience in higher-ed support and service to our helpdesk group and to our users.  He has already taken the plunge into many support efforts and is adding more value to our group everyday!  Please stop by the Technology Helpdesk and say hi to Dennis.

 

James Beauchemin
Technology Support Helpdesk

Introducing: The Identity Management Project

Categories: Midd Blogosphere

The Identity Management Project kicked off in December of 2009. The current project team (small ‘t’) is Tom Cutter, Adam Franco, Mike Lynch, Chris Norris, Carol Peddie, Mark Pyfrom, Jeff Rehbach, Mike Roy, and Marcy Smith.

The Identity Management (IDM) project seeks to organize our concept of a “person” or “identity” among our various systems (including Banner, the Active Directory, web-applications, hosted systems, and others). This project focuses on three facets of each identity:

Unique identifier:
Every identity would have a unique identifier. Currently, only people in Banner have one of its identifiers (guests and vendor-staff aren’t in Banner) and only people in AD have log-in names (alumni, parents, and others aren’t in the AD).
Unified Properties:
Each identity will have a set of properties (name, email, address, title, department, etc) that is consistent and available to all of our applications. Currently user properties may be different or unavailable depending on which source of user information is used; a person’s title is a good example of this inconsistency.
Roles:
Identities will gain zero or more “roles” that can be used to grant or deny access to our systems and services. We currently have no consistent way (in AD or web applications) of determining if a person is a current student, faculty, staff, or other role — the best we can do now is to look at membership in certain mailing lists like “All_Faculty”. With the IDM project, we will be able to access an authoritative list of the current roles for a person (visitors would have no roles) and will be able to ensure that access to services properly matches an individual’s relationship to the college.

In addition to organizing and improving the properties and roles of our current set of users (current students, faculty, staff, emeriti, vendors, spouses, and limited guests), the IDM project will also enable us to expand the number of usable (authenticate-able) accounts to include alumni, prospective students, and visitors. As well, we gain the potential to include users from other institutions via federated authentication systems such as Shibboleth.

Here is a list of a few things that will become possible with completion of the IDM project:

  • Rather than accounts being immediately deleted upon graduation, they instead would loose the “student” role and gain the “alumnus” role. These users would continue to use their same log-in credentials access alumni-only and public resources (i.e. commenting on blogs, renewing library books), but would loose access to student-only resources (i.e. course websites, JStore and other subscription library materials).
  • We will be able to grant access (individually or in groups) to many of our online systems for guests, alumni, emeriti, visitors, vendors, perspectives, and others with loose affiliations with the college.
  • Inter-institutional projects will be able to make use of any of our online systems as collaboration platforms.
  • A fan of Middlebury Hockey could create a visitor account to use for purchasing panther gear from the college book store, then come back and log in with the same account to purchase tickets from the box office, make comments on the coach’s blog, and fill out a form to sign up their kids for participation in the Winter Carnival ice show. Their name, email, mailing address, and other properties would be available to all of the systems.

Please note that some of these examples will require additional changes and development projects beyond the IDM project itself. However, all require aspects of the IDM project to be possible.

“Display Name” Updating Automatically

Categories: Midd Blogosphere

The “display name” (or alpha name) shown in the web directory, Outlook address book, etc. had in the past been set so that it could only be corrected or changed manually. This process has now been modified so name changes entered into the Banner database are flowing through the Active Directory table and are subsequently updating the web directory, Outlook, Segue, etc. in the format Last, First Middle.

One enhancement which should help people out as a result of this change involves the “Preferred First Name” field in the Banner database. For records that have had this data entered into Banner, the “Preferred First Name” is now appended to the “display name” and enclosed in parentheses in the format Last, First Middle (Preferred First).

Requests for “display name” changes should now be sent to the following:

  • Middlebury college staff or academic year faculty:  hr@middlebury.edu
  • Middlebury college undergrad students:  Commons coordinators
  • MIIS staff or faculty: hrmiis@middlebury.edu
  • MIIS students:  Seamus Dorrian
  • BLSE students or faculty:  Susan Holcomb
  • MMLA staff or faculty:  Michelle Davis
  • MMLA students:  Jessie Jerry
  • Language School summer staff or faculty:  Sandy Bonomo
  • Language School students: Kara Genarelli

 It would be helpful to include the words “change display name” in the subject line of the e-mail message.