Tag Archives: MIIS

MiddLab Discussion Sessions Follow-up

Thank you to everyone who was able to attend either of my discussion meetings for MiddLab last week! There were a lot of great ideas for the site and upcoming projects. You can now see one of those ideas added to the site in the new Research Centers page which shows a map of all the MiddLab projects. We’re going to continue adding features to the site throughout the semester, so stay tuned.

While we were not able to record the sessions due to some technical difficulties, I have prepared a guide to adding your project to MiddLab. Feel free to edit that page to add your own tips on creating a successful project description or send an email to middlab@middlebury.edu if anything is unclear. I will host another meeting to discuss MiddLab during the Spring semester, for those who were not able to attend, but I’m also more than happy to meet individually with Faculty, Staff, Students, departments, and offices.

Publications Database

During a discussion with Bob Cluss and Colleen Converse, we came up with an idea for a sub-site in MiddLab that serves as a portal to discover publicly available academic publications from our faculty and students. I’ll be working on adding that this semester and welcome you to send documents or (preferably) links to these papers in public databases to middlab@middlebury.edu. If the document is larger than 10MB, please send it to website@middlebury.edu instead. If you already have a site that lists these documents that you’d like to be included, you can also just send that link and I’ll take care of the rest.

Look for this information to be added to MiddLab shortly, giving people both on and off-campus another easy way to find information on the active and ongoing research at Middlebury.

Working More Closely with You

I also want to make you aware of a small change in policy about the inclusion of content in MiddLab. Due to some concerns about the unfortunate rules surrounding some academic publication and to ensure that all research collaborators are willing to be included, we’ll now ask that every person involved in a research project agrees to have it hosted in MiddLab before it is put up. I can also remove content from the site where your name appears if you would not like it published in this manner. You can see any mention of your work in MiddLab by browsing the People page. Please address any concerns to middlab@middlebury.edu.

Website Performance: Pressflow, Varnish, Oh-My!

Executive summary:

We’ve migrated from core Drupal-6 to Pressflow, a back-port of Drupal-7 performance features. Using Pressflow allows us to cache anonymous web-requests (about 77% of our traffic) for 5-minutes and return them right from memory. While this vastly improves the amount of traffic we can handle as well as the speed of anonymous page-loads it does mean that anonymous users may not see new versions of content for at most 5 minutes. Traffic for logged-in users will always continue to flow directly through to Drupal/Pressflow and will always be up-to-the-instant-fresh.

Read on for more details about what has change and where we are at with regard to website performance.


When we first launched the new Drupal website back in February we went through some growing pains that necessitated code fixes (Round 1 and Round 2) as well as the addition of an extra web-server host and database changes (Round 2).

These improvements brought our site up to acceptable performance levels, but I was concerned that we might run into performance problems if the college ended up in the news and thousands of people suddenly went to view our site.

At DrupalCon a few weeks ago I attended a Drupal Performance Workshop where I learned a number of techniques that can be used to scale Drupal sites to be able to handle internet-scale traffic — not Facebook or Google-level traffic, but that of The Grammys, Economist, or World Bank.

Since before the launch of the new site we were already making use of optcode-caching via APC to speed code execution and were doing data caching with Memcache to reduce the load on the database. This system-architecture is far more performant than a baseline setup, but we still could only handle a sustained average of 20 requests each second before the web-host started becoming fully loaded. While this double our normal average of 10-requests per second, it is not nearly enough headroom to feel safe from traffic spikes.

Diagram of the execution flow through the web-host using normal Drupal page caching.

Request flow through our Drupal web-host prior to May 13th; using normal Drupal page-caching stored in Memcache. Click for full-size.

Switching to Pressflow

Last week we switched from the standard Drupal-6.16 to Pressflow-6.16.77, a version of Drupal 6 that has had a number of the performance-related improvements from Drupal-7 back-ported to it. Code changes in Pressflow such as dropping legacy PHP4 support and using only MySQL enable Pressflow execute about 27% faster than Drupal, a useful improvement but not enough to make a huge difference were we to get double or triple our normal traffic.

For us, the most important difference between Pressflow and Drupal-6 is that sessions are ‘lazily’ created. This means that rather than creating a new ’session’ on the server to hold user-specific information on the first page each user sees on the website, Pressflow instead only creates the session when the user hits a page (such as the login page) that actually has user-specific data to store. This change makes it very easy to differentiate between anonymous requests (no session cookies) and authenticated requests (that have session cookies) and enables the next change, Varnish page caching.

Varnish Page Caching

Varnish is a reverse-proxy server that runs on our web hosts and can return pages and images from its own in-memory cache so that they don’t have to execute in Drupal/Pressflow every single time. The default rule in Varnish is that if there are any cookies in the request, then the request is for a particular user and should be transparently passed through to the back-end (Drupal/Pressflow). If there are no cookies in the request, then Varnish assumes correctly that it is an anonymous request and tries to respond from its cache without bothering the back-end.

Request flow through our Drupal/Pressflow web-host after May 13th; using the Varnish proxy-server for caching. Click for full-size.

Request flow through our Drupal/Pressflow web-host after May 13th; using the Varnish proxy-server for caching. Click for full-size.

Since about 77% of our traffic is non-authenticated traffic, Varnish only sends about 30% of the total requests through to Apache/PHP/Drupal: all authenticated requests and anonymous requests where the cache hasn’t been refreshed in the past 5 minutes. Were we to have a large spike in anonymous traffic, virtually all of this increase would be served directly from Varnish’s cache, preventing any load-increase on Apache/PHP/Drupal or the back-end MySQL database. In my tests against our home-page varnish was able to easily handle more than 10,000 requests each second with the limiting factor being network speed rather than Varnish.

A histogram of requests to the website. Y-axis is the number of requests, X-axis is the time to return requests, '|' requests were handled by Varnish's cache and '#' were passed through to Drupal. The majority of our requests are being handled quickly by Varnish while a smaller portion are being passed-through to Drupal.

A histogram of requests to the website. Y-axis is the number of requests, X-axis is the time to return requests, '|' requests were handled by Varnish's cache and '#' were passed through to Drupal. The majority of our requests are being handled quickly by Varnish while a smaller portion are being passed-through to Drupal.

MySQL Improvements

During the scheduled downtime this past Sunday, Mark updated our MySQL server and installed the InnoBase InnoDB Plugin, a high-performance storage engine for MySQL that can provide twice the performance of the built-in InnoDB engine in MySQL for the types of queries done by Drupal.

Last week Mark and I also went through our database configuration and verified that the important parameters were tuned correctly.

As the MySQL database is not currently the bottleneck that limits our site performance these improvements will likely have a minor (though wide-spread) effect. Were our authenticated traffic to further increase (due to more people editing for instance) these improvements will be more important.

Where We Are Now

At this point the website should be able to handle at least 20,000 requests/second of anonymous users (10,000 on each of two web-hosts) at the same time that it is handling up to 40 requests/second from authenticated users (20 on each of two web-hosts).

While it is impossible to accurately translate these request rates into the number of users we can support visiting the site, a very rough estimation would be to divide the number of requests/second by 10 (a guess at the average number of requests needed for each page view) to get a number of page-views that can be handled each second. (1)

In addition to how many requests can be handled, how fast the requests are returned is also important. Our current response times for un-cached pages usually falls between 0.5 seconds and 2 seconds. If pages take much longer than 2 seconds, the site can “feel slow”. For anonymous pages cached in Varnish response times range from 0.001 seconds to 0.07 seconds, much faster than Apache/Drupal can do and more than fast enough for anything we need.

The last performance metric that we are concerned with is about the time it takes for the page to be usable by the viewer. Even if they receive all of the files for a page in only 0.02 seconds, it may still take their browser several seconds to parse these files, execute javascript code, and turn them into a displayable graphic. Due to these factors, my testing has shown that most pages on our site take between 1 and 3 seconds for users to feel that our pages are loaded. For authenticated users, this stretches to 2-4 seconds.

Finally please be aware that, anonymous users see pages that may be cached for up to 5 minutes. While this is fine for the vast majority of our content, there are a few cases where we may need to have the content shown be up-to-the-second fresh. We will address these few special cases over the coming months.

Future Performance Directions

Now that we have our caching system in place our system architecture is relatively complete for our current performance needs. While we may do a bit of tuning on various server parameters, our focus now shifts to PHP and Javascript code optimization to further improve server-side and client-side performance respectively.

One big impact on javascript performance (and hence perceived load-time) is that we currently have to include two separate versions of the jQuery Javascript Library due to different parts of the site relying on different versions. Phasing out the older version will reduce almost by half the amount of code that the browser has to parse.

Additional Notes

(1) As people browse the site their browser needs to load the main HTML page as well as make separate requests for Javascript files, style-sheet (CSS) files, and every image. After these have been loaded the first time, [most] browsers will cache these files locally and only request them again after 5 minutes or if the user clears their browser cache. CSS files and images that haven’t been seen before will need to be loaded as new pages are browsed to. For example, the first time someone loads the Athletics page, it requires about 40 requests to the server for a variety of files. A subsequent click on the Arts page would require an additional 13 requests, while a click back to the Athletics page would require on 1 additional request as the images would still be cached in the browser.



MiddLab is a new section of Middlebury’s website with no precedent: an academic network, uniting all of the… blah, blah blah.

Truth is, MiddLab has been hard for us to explain ever since we heard the idea. A research network featuring discussions and blogs, and linking together disciplinary themes? How does that work? Rather than write a manifesto, here is what we’re trying to accomplish with MiddLab.

Our Goals

  • Make research easy to discover. If you want to know what student and faculty research is going on in a department, you shouldn’t have to know where their papers are published or the address of the project’s web site. Instead, these should be one or two clicks from our home page.
  • Show connections between research. Whether researching the population growth of trees in Biology or the population density of people in Geography, projects share themes and people interested in the topic can easily explore both.
  • Start a discussion. We encourage and recommend that you add comments to the projects on this site. Ask questions, suggest new research, or explain why you disagree with the conclusions. You can add your thoughts to any project page on MiddLab, explore the individual blogs for some projects, or contact the researchers directly.
  • Provide space for research and the sciences on our site. We’ll be expanding this site to feature more presentations from the Spring Research Symposium and research projects in our science departments. Though MiddLab is open to any student, faculty or staff projects, these are areas where we know we’re not offering enough information on our site and would like to use MiddLab to expand.

Your Feedback

We aren’t sure these are the right goals for our site. We’d like to hear from people: what would you like to see in MiddLab? What parts of this site work toward these goals and which don’t? Leave your thoughts by commenting on this page.

Oh, and if you would like us to feature your project in MiddLab, send an email to middlab@middlebury.edu.

Monterey Terrorism Research and Education Program

One of the Internet resources featured in the latest issue of the Internet Scout Report, at http://scout.wisc.edu/Reports/ScoutReport/2010/scout-100402.html, is the Monterey Terrorism Research & Education Program, at http://www.miis.edu/academics/researchcenters/terrorism.  Based at the Monterey Institute of International Studies, the Monterey Terrorism Research & Education Program (MonTREP) “conducts in-depth research, assesses policy options, and engages in public education on issues relating to terrorism and international security.”  Their team of scholars looks at violence-prone extremist groups and their historical evolution, organization structure, and operational methods. Most people will want to look at their Islam, Islamism, and Politics in Eurasia Reports (IIPER).  The IIPER is a bimonthly compendium of news and analysis on politics involving Islam in the former Soviet Union.  The reports are written and edited by Dr. Gordon M. Hahn, and the series also accepts independent submissions as well. Visitors are welcome to browse through the reports here, and they may end up forwarding them to friends and associates. Finally, the site also includes a “News & Student Stories” area which reports on the activities of current members of the team, alumni, and students.

Website Improvements #5: Search

When Middlebury first started using a Content Management System to organize its site in 2003 we added a local search engine for the site, operated by Atomz. This search engine wasn’t very popular, people weren’t finding the information they needed. At a meeting a couple years later, Barbara Merz remarked, “Why don’t we just get Google!?” So we purchased a Google Search Appliance (GSA) and set that up as our local search engine. Going into the Web Makeover Project, we thought we were safe on this subject. After all, the GSA was a Google project, it indexed all of our site’s content, we had put in Key Matches for the most relevant pages, people must be satisfied with this as our search engine.


The Strategy

After “the font is too small” and “it’s too hard to edit”, search results were the top complaint about our old site during the web makeover’s requirements gathering phase. We heard that people got better results about our site from Google.com than they did from the GSA. The designers we worked with to build the new site proposed a solution in three parts:

  1. For some searches, you want to craft a hand-written response. If someone searches for “natatorium hours”, tell them “The pool is open right now! Here’s the full schedule…”. This also includes ambiguous searches like “summer”. We have a lot going on in the summer: Language Schools, two Bread Loaf programs, etc., so one Key Match isn’t going to cut it. We need to show a list of the top things having to do with “summer” at Middlebury.
  2. For other searches, there’s no need to display a search results page. If you search for “webmail”, you probably do want to read articles about webmail being upgraded last year, you just want to check your email on the web. For these, we should deliver the user directly to the page.
  3. If the search doesn’t fall into either of these categories, we should show a list of search results, but if people say that the search results from Google.com are better than those from the GSA, then why not just show them the results from Google.com? Also, we should provide some results from other databases like our Directory or Course Catalog.

Fortunately, these recommendations were easy to implement. For the first class, the custom search result pages, I developed a template that can be used like any other theme on our site for a page. If a page is using this theme, then it will be the search result for any search of its URL. For example, there is both a men’s and women’s hockey team at Middlebury, so if you search for “hockey” it’s not always clear what you want. The custom search result page for “hockey” lists the scores for both teams, links to the team pages, a link to order tickets, a link to the page about our hockey rink, and a link to the intramural team. Barbara has put together several of these custom search result pages based on data we’ve gathered about the most popular searches on our site.

The next class of search results, the automatic redirects, were also easy to manage. We’ve compiled a large list of URLs and quick terms referring to those URLs over the last couple years: the GO database. If you search for a GO shortcut, you’ll be automatically taken to the page for that GO shortcut. For the large majority of GO shortcuts, this works very well. If you search for “bannerweb”, you’ll be taken to go/bannerweb, searching for “eres” brings you to the e-reserves site. There are a minority of searches where this doesn’t work as well: “german” takes you to the German department’s site, but you might have been looking for the German language school or several other possibilities. I’ll describe how we’ve addressed this issue in a bit.

Google.com and the 404 page

The last category of search results got us into some trouble. When we first launched out new site, the standard search results were coming from Google.com, but Google hadn’t updated its search index to reflect the contents or structure of our new site. I had thought, based on experience with the MIIS site, that it would take Google 2-3 days to index our site and search would mostly be normalized after that. This actually did happen. Google’s index of our new site was generally complete in that timeframe. However, the new pages were listed much lower in the results than links to the old pages. Since most people click the first few links in results, they wer only seeing the 404 page, getting frustrated and leaving search before finding the working links further down.

I overlooked two differences between Middlebury and MIIS that made a big difference here.

  1. Middlebury’s website is linked to from many more pages than MIIS’s site, both internal links (we have many more pages) and external links (peer institutions, etc.). Google’s search algorithm is weighted to push up pages that are linked to more frequently. Since other sites haven’t updated their links to Middlebury, Google assumes that those are the right links since there’s a lot more of them and pushes them up in the results. This was less of a factor for MIIS because MIIS is linked to less frequently.
  2. Google.com in 1998We have kept around paths to sites at Middlebury for over 10 years. All of the old /~department_name, /offices/department_name, /depts/department_name, paths on cat.middlebury.edu, etc. from 1997-2009 still worked in January, 2010. These paths were created before Google even existed.
    Google.com in 1998

    Google’s index has never really been updated to reflect changes in our information architecture. We wanted to move away from this practice because:

  • It produces multiple results listings in search. If you searched for the Bread Loaf Writer’s Conference, you’d get a result with a link to its homepage at /~blwc, then another with a link to its homepage at /academics/blwc, then another with a link to its homepage at /depts/blwc, and so on. These all go to the same page and push other relevant results for that search further down the page. Ideally, the homepage should be the first result and other pages related to the program following it. By removing the old IA, we minimize the number of duplicate results.
  • We are now allowing you to control the IA of the site, which we didn’t do before. On MCMS, I had a slick rewrite rule that allowed me to direct requests for academic departments sites because we required that they be named the same as they were in the old IA:RewriteRule ^/~?(depts/)?((?:alc|art|bio|chem|chinese|classic|cs|dance|econ|english|es|filmvid|french|geog|geol|german
    |russian|soca|spanish|teach|theatre|ws)(?:[/\\\?].*)?)$ /academics/ump/majors/$2 [R]So if you went to /depts/filmvid you were taken to /academics/ump/majors/filmvid. I can’t do stuff like this anymore because the departments can now change the path to their sites without alerting me to the change. It gets even harrier for sub-pages of those sites. It would be a logistical nightmare to maintain automated redirects for all the variations. I think allowing departments to add pages to their site without submitting a Helpdesk ticket is a fair tradeoff here.
  • Some portions of our new IA overlap ports of our old IA, like /offices and /admissions. There were going to be broken links in these areas no matter what I did.
  • A really nit-picky point, but reducing the number of paths to a site improves the responsiveness of the site. Every time the server redirects you, a full request-response chain is generated. It’s faster to go right to the final URL than bounce between all these alternatives. We’re talking about milliseconds of difference here, but hey, every bit counts.

What I should have done was begin to phase out the old URLs last year. Starting with the /~department_name addresses and workign forward in the IA timeline. This would have reduced the shock on launch day and sped up conversion to the new IA. This is a lesson I’ll take to future projects of this nature.

That’s what happened, now here’s what we did.

Solving the 404 issues

Indexing Speed

Google offers a service named Google Webmaster Tools where you can see information about your site and control some of the ways that Google interacts with the site. The first thing we did was to double the speed at which Google crawls the site. Google finds out about information our your site by automating typical user interaction on the site: a program they run will request your homepage, then request every page linked to from your homepage, and so on. The faster Google indexed this content, the faster information about our new site would be available in their index. While we were still having performance issues with the site, we needed to decrease this indexing speed, but were able to increase it again as we solved those issues.

Sitemap Files

Our next step was to create a sitemap file. Since Google’s crawler only looks at pages that are linked to from other pages, it might miss some content that isn’t linked to from anywhere, or very few places. A sitemap file is a really simple text document that tells search engines about every page on your site so that they have a base to check their index against. Again, this was done to make sure that the search engines had as much information about our new site as we could provide. At the same time, Adam and Chris worked to block search engines from looking at our old site or portions of our new site that we don’t want indexed by their engine and making those entries in our robots.txt file which tells search engines which paths they should ignore.

Removing Links from the Index

Google also offers you the option of requesting that a URL is removed from its search index for six months, after which we can assume that the index will have updated to reflect that the page is permanently gone. We were able to retrieve a list of the broken URLs (about 100,000 of them) from Google’s Webmaster Tools, and started to run through the list. The problem with the URL removal tool that Google offers is that it only lets you request one page removal at a time. A developer at another college noticed this problem too and wrote an application that fills out the removal request form for you over and over again to remove the tedium from the process.

I started using this to remove some of the URLs from Google’s results and noticed that I was only able to submit 1000 URLs per day from an account. It also took about a day for the new requests to be processed. For a time, I was submitting multiple thousands of broken URLs through this tool using multiple accounts, but that too stopped working, I guess because someone at Google noticed what I was up to. I now take a more targeted approach to the situation.

Each morning I run the following script, inspired by this Drupal blog post, on each of our front-end webservers:

gawk ‘{ print $9,$7,$11,$4 }’ /var/log/httpd/www.middlebury.edu-access_log  | grep ^404 | grep google.com/search > 404.txt

This produces a report of all of the requests coming from Google searches that result in a 404 page. I then combine all of the reports and submit the non-duplicate pages from it through the URL removal tool. This turns out to only be a hundred unique pages per day, since we have eliminated most of the top level pages and are working on the “long tail” of search results.

I can then use this command to find out how many requests from Google.com to our site result in a 404 page each day:

grep ” 404 ” /var/log/httpd/www.middlebury.edu-access_log* | grep google.com/search | grep “09/Mar/2010″ | wc -l

I’ve been recording the results for a couple weeks now and have noticed that this method appears to be working very well at reducing the number of bad landing pages on our site.


I noticed one more thing about 404 pages in the logs from this morning. I had already submitted, and processed, the removal of the old address for the Bread Loaf School of English several times, but it kept appearing as broken in these logs. Looking at the search results page for BLSE I noticed that there is a Maps result. This isn’t part of the normal Google search index so the broken link wasn’t being removed by my request.

To update this, you need to change the record in the Google Local Business Listing administration interface. This is actually a pretty neat tool. It lets you list the location, phone number, email address, and other information about your business to add to the Google Maps interface. You can also upload images and videos about your business. I added all of the information I could about the BLSE except for the screen where it asked if we had any current coupons, though that’s not a bad idea – 10% off your Masters perhaps? Google called the BLSE office and gave them a PIN, which I entered into the interface and now their listing in Google’s search results is better than ever.

Also, since it is integrated with Google Maps, we get some interesting information about the people who search for the BLSE. For instance, people from Springfield, MA need more help getting to the campus:

Where driving directions requests come from:

1. Springfield 01118 5
2. Cedar Beach 05445 4
3. Burlington 05401 2
4. Concord 01742 2
5. Washington 20006 2
6. Brattleboro 05301 1
7. Bristol 05443 1
8. Mansfield 44902 1
9. Newton 02459 1
10. Rutland 05701 1

Top Internal Searches

We’ve also been looking at the list of top searches on our internal search interface and adding GO shortcuts for all the items where there’s only one page you’d want for that search or custom search results page for the more ambiguous items. These results come from one month of Google Analytics information on our site.

Interface Improvements

Using GO for Search Results

I knew that automatically forwarding people to the page of a GO shortcut if they searched for one would be controversial. Everyone agreed with the concept of forwarding certain searches to their final destination, like “bannerweb” or “menu”, but people were alarmed at the extent to which I suggested we take this feature. However, after looking at the list of internal searches, it became clear to me that our top search terms were already GO shortcuts and were shortcuts for which there was only one logical destination.

Still, I am sympathetic to the issue I raised before about certain searches, like “german” going to a department page when there are a lot of other relevant pages for that term. Ideally, these searches would have a custom search result page, and we will likely build one for each of the terms, but those take a while to develop. Instead, we now use a really old feature of HTML, frames, to show a banner at the top of a page you’ve been forwarded to so that you can click back to the full search results if you didn’t find what you were looking for. My original idea was to just have this display in our Drupal site using the themes native to that platform. Adam suggested extending it to any site using frames.

What you see if you search for "banner" on our site.

What you see if you search for "banner" on our site.

Frames are a bit of a controversial feature of HTML. Few people consider using them any more as layout based on Cascading Style Sheets has replaced both tables and HTML frames as the preferred method for laying out a web page. Still, they do have some valid uses and I’d contend that this is one of them. What frame do is split your browser window into multiple windows. A classic example is the Java API documentation where you’re looking at three windows in one.

For the GO search results, we use two frames: one on top that links you back to the full search result and one on the bottom that shows you the search results page.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<?php if ($go_url): ?>
 <head><title>Search Results</title></head>
 <frameset rows="30px,*">
 <frame marginheight="0" marginwidth="0" scrolling="no" noresize="noresize"
        src="<?php print $base_path . drupal_get_path('theme', 'midd-search'); ?>/searchbar.php?search=
        <?php print $q2; ?>" />
 <frame id="contentFrame" src="<?php print $go_url; ?>" />
<?php else:  // print the full search results ?>
<?php endif; ?>

The searchbar.php script itself is a really simple page that just displays a message and a link:

<?php $search = str_replace("+", " ", $_GET['search']); ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
 <style type="text/css">
 #searchbar { margin:0px; padding:0px 18px; width: 100%; height:30px; line-height:30px; font-size: 1em;
     font-family:Verdana,"Lucida Grande",Lucida,sans-serif; background-color: #BEDA90; color:#003468; }
 .closeBar {position:absolute; right:0;}
 .closeBar a {padding-right: 18px; text-decoration:none;}
 <script type="text/javascript">
 var mainloc = parent.document.getElementById('contentFrame').src
 function closeFrame() { window.top.document.location = mainloc; }
 <div id="searchbar"><span><b><a href="javascript:closeFrame()">X</a></b></span>We think this is the
   right page for your search of <b><?php print $search; ?></b>, but if it's not, you can <b>
   <a href="http://www.middlebury.edu/search?q2=<?php print $_GET['search']; ?>&nocustom=true"
   target="_top">view all the results</a></b>.</div>

You can click on the “X” in the upper right to close the frame and keep browsing. One of the limitations of this approach is that frames can’t actually communicate with each other for security reasons. If they could, a site could create a really small frame with malicious code and then a really big frame with any site on the internet, then have the small frame execute its malicious code on the previously secure big frame. Since they can’t communicate with each other, when you close the top frame it will take you to the location that the bottom frame was at when you first saw it. So if you browse around for a bit in the bottom frame, then close the top frame, you’ll be taken back to your original search result page.

GO terms in the Search drop-down

Right before our site launched we noticed that the constituent landing pages like Current Students and Faculty & Staff had these search boxes on them with the label “go” in front of them. The idea was to let people search the database of GO shortcuts. We didn’t have any way to do this at the time, so Adam developed a little module for Drupal that made a request to the GO database to conduct searches of terms and used the jQuery autocomplete plugin to make it so that the results were returned to the user in real-time.

function go_fetch_url ($name) {
 if (!is_string($name))
 throw new InvalidArgumentException('$name must be a string.');

 if (!strlen($name))
 return array();

 $pdo = go_pdo();

 if ($inst = variable_get('go_scope_institution', '')) {
 $stmt = $pdo->prepare("SELECT code.url FROM code LEFT JOIN alias ON (code.name = alias.code)
   WHERE (code.name=:name1 AND code.institution=:inst1)
   OR (alias.name=:name2 AND alias.institution=:inst2)");
 $stmt->bindValue(":inst1", $inst);
 $stmt->bindValue(":inst2", $inst);
 } else {
 $stmt = $pdo->prepare("SELECT code.url FROM code LEFT JOIN alias ON (code.name = alias.code)
   WHERE code.name=:name1 OR alias.name=:name2");
 $stmt->bindValue(":name1", $name);
 $stmt->bindValue(":name2", $name);

 $row = $stmt->fetch(PDO::FETCH_ASSOC);
 if (!$row)
 throw new Exception('No result matches.');

 return $row['url'];

goI took this and applied it to the search boxes throughout the site, making a couple modifications. The GO boxes on the constituent page assumed you only wanted to search GO and would complete your search term for you. On the site-wide search, we know that not every search will be covered by a GO shortcut and left that out. I also added “go/” to the beginning of each of the results so that people were more aware of what they meant using the autocomplete plugin’s formatItem option:

 { max: 30,
 width: 200,
 autoFill: false,
 selectFirst: false,
 formatItem: function(row) {
 return "go/"+row[0];

Results from the Google Search Appliance

Though we had initially wanted to move away from using the Google Search Appliance (GSA) for search results, because so many of the links to our site on Google.com were broken, we reindexed the new site using the GSA and added its results to the search results page. This involved requesting results from the GSA in a way that we hadn’t done before. We used to just have the GSA serve as the search front end using the XSLT style sheet interface that the server provides, but doing that would bypass all of the GO shortcut and custom search result page work that we’d done, as well as leave out results from the Directory and Course Catalog.

Instead, I found in the GSA documentation that you can make a request to the service and have it return an XML document of results, using this URL for our search engine:


For this to work, you need to replace SEARCH_QUERY with your search and SEARCH_COLLECTION with one of the collections that we maintain to segment search results. For example, there is a search collection named “Middlebury” that has all of our sites, but also one named “Blogs” that has only pages on our Wordpress instance. Here is an example of what is returned by a search for “Google” on our Blogs server.

searchI then parse these results and display them on the search results page. Not wanting to get rid of work that was already there, and because I know we will want to switch back to using Google.com for our primary search results once the issue of 404 pages has been resolved, I added a tabbed interface on the search results page that lets you alternate between the two search collections. Just click on the tabs to see results from the other service.

Asynchronous Results

The search results page was one of the slowest loading pages on our site, taking between 6-15 seconds to load. The reason for this is that it needs to make requests to a lot of services to get all of the information:

  1. Check to see if there are custom search pages
  2. Check to see if there is a GO shortcut
  3. Get the results from Google.com
  4. Get the results from the GSA
  5. Get the results from the Directory
  6. Get the results from the Course Catalog

Steps 1 & 2 still happen before the page loads, since we might need to redirect you based on their results, but steps 3-6 now happen after the page loads. While you’re viewing the page, we’re requesting results from all of those services in the background. When the results come in, the page displays them using JavaScript. You might see the results from the GSA immediately and then results from the Course Catalog a second or two later. This gives the illusion of the page loading faster than it actually is and, if the thing you’re looking for appears early on, allows you to skip loading results from services you don’t need.

Upcoming Improvements

Improving search has been our secondary focus (after site performance) since launching our new site and a very important part of our work. We really want to get this right, so we’ll be adding in more and more of these types of improvements around the search results as time goes on. We’ll next be looking at statistics on how well our strategy of using GO shortcuts to deliver people directly to result pages works based on click patterns and, once we solve the 404 issue, how well Google.com does at proving basic search for our site.

Faceted Search

The next area of work is to figure out segmented search. We have a number of collections of highly structured content like HR Job Descriptions or Undergraduate Research Opportunities. We want to be able to build search interfaces for these collections so that people can search for, say, all of the jobs on campus that have a job level of Specialist 3 or all of the Research Opportunities in East Asia.

To do this, we’re setting up a local copy of the Apache Solr search engine. There is a Drupal module for this search engine that allows it to build filters based on content types. Job descriptions and research opportunities are content types and each of their fields could then be used as a filter in the faceted search results. I’m still in the preliminary stages of setting up this service, but am hoping to have a rough prototype done in April of how this will work.

Search My Site

Another use for the Apache Solr search engine would be to provide URL filtering for search results. We can do this by setting up collections in the GSA, but we don’t necessarily want to create a collection for every sub-site or maintain all of those filtering rules. Instead, we want to use Apache Solr’s flexible query syntax to let us find documents whose URL paths match patterns like “http://www.middlebury.edu/academics/lib” by passing that as a parameter to the search engine that we can alter if needed. This will also help us to add search to areas of the site like news archives.

Excluding GO Addresses from Search

There are some times when a custom search page is not appropriate for a search term and we don’t want to go directly to the URL for the GO shortcut when you search for that item. For example, Adam had set up “go/adam” to go to his personal blog, but there are a lot of people at Middlebury named Adam and people might be looking for a different Adam. We’ll add an option in the GO administration interface to allow you to exclude a GO shortcut from being used in search results, if you don’t want it to be.

User Preferred Search Engine

We’ve already got results from Google.com and the GSA and are adding the Apache Solr search engine to our site. Why just get results from one of these? Why not all of them? Why not Yahoo and Bing as well? Why not let people pick which one they want to use. We’ll add a field to the user profile pages in Drupal so that you can set your own preferred search engine and have it provide the default results for all your searches. This will also help us to look at which search tools you prefer and gauge which are giving better results.

What else?

Do you have other ideas for things we could do? Is there something that I glossed over that you have more questions about? Please let us know by adding your comments to this post.

Introducing: The Identity Management Project

The Identity Management Project kicked off in December of 2009. The current project team (small ‘t’) is Tom Cutter, Adam Franco, Mike Lynch, Chris Norris, Carol Peddie, Mark Pyfrom, Jeff Rehbach, Mike Roy, and Marcy Smith.

The Identity Management (IDM) project seeks to organize our concept of a “person” or “identity” among our various systems (including Banner, the Active Directory, web-applications, hosted systems, and others). This project focuses on three facets of each identity:

Unique identifier:
Every identity would have a unique identifier. Currently, only people in Banner have one of its identifiers (guests and vendor-staff aren’t in Banner) and only people in AD have log-in names (alumni, parents, and others aren’t in the AD).
Unified Properties:
Each identity will have a set of properties (name, email, address, title, department, etc) that is consistent and available to all of our applications. Currently user properties may be different or unavailable depending on which source of user information is used; a person’s title is a good example of this inconsistency.
Identities will gain zero or more “roles” that can be used to grant or deny access to our systems and services. We currently have no consistent way (in AD or web applications) of determining if a person is a current student, faculty, staff, or other role — the best we can do now is to look at membership in certain mailing lists like “All_Faculty”. With the IDM project, we will be able to access an authoritative list of the current roles for a person (visitors would have no roles) and will be able to ensure that access to services properly matches an individual’s relationship to the college.

In addition to organizing and improving the properties and roles of our current set of users (current students, faculty, staff, emeriti, vendors, spouses, and limited guests), the IDM project will also enable us to expand the number of usable (authenticate-able) accounts to include alumni, prospective students, and visitors. As well, we gain the potential to include users from other institutions via federated authentication systems such as Shibboleth.

Here is a list of a few things that will become possible with completion of the IDM project:

  • Rather than accounts being immediately deleted upon graduation, they instead would loose the “student” role and gain the “alumnus” role. These users would continue to use their same log-in credentials access alumni-only and public resources (i.e. commenting on blogs, renewing library books), but would loose access to student-only resources (i.e. course websites, JStore and other subscription library materials).
  • We will be able to grant access (individually or in groups) to many of our online systems for guests, alumni, emeriti, visitors, vendors, perspectives, and others with loose affiliations with the college.
  • Inter-institutional projects will be able to make use of any of our online systems as collaboration platforms.
  • A fan of Middlebury Hockey could create a visitor account to use for purchasing panther gear from the college book store, then come back and log in with the same account to purchase tickets from the box office, make comments on the coach’s blog, and fill out a form to sign up their kids for participation in the Winter Carnival ice show. Their name, email, mailing address, and other properties would be available to all of the systems.

Please note that some of these examples will require additional changes and development projects beyond the IDM project itself. However, all require aspects of the IDM project to be possible.

“Display Name” Updating Automatically

The “display name” (or alpha name) shown in the web directory, Outlook address book, etc. had in the past been set so that it could only be corrected or changed manually. This process has now been modified so name changes entered into the Banner database are flowing through the Active Directory table and are subsequently updating the web directory, Outlook, Segue, etc. in the format Last, First Middle.

One enhancement which should help people out as a result of this change involves the “Preferred First Name” field in the Banner database. For records that have had this data entered into Banner, the “Preferred First Name” is now appended to the “display name” and enclosed in parentheses in the format Last, First Middle (Preferred First).

Requests for “display name” changes should now be sent to the following:

  • Middlebury college staff or academic year faculty:  hr@middlebury.edu
  • Middlebury college undergrad students:  Commons coordinators
  • MIIS staff or faculty: hrmiis@middlebury.edu
  • MIIS students:  Seamus Dorrian
  • BLSE students or faculty:  Susan Holcomb
  • MMLA staff or faculty:  Michelle Davis
  • MMLA students:  Jessie Jerry
  • Language School summer staff or faculty:  Sandy Bonomo
  • Language School students: Kara Genarelli

 It would be helpful to include the words “change display name” in the subject line of the e-mail message.