Website Improvements #5: Search – The Middlebury Sites Network

When Middlebury first started using a Content Management System to organize its site in 2003 we added a local search engine for the site, operated by Atomz. This search engine wasn’t very popular, people weren’t finding the information they needed. At a meeting a couple years later, Barbara Merz remarked, “Why don’t we just get Google!?” So we purchased a Google Search Appliance (GSA) and set that up as our local search engine. Going into the Web Makeover Project, we thought we were safe on this subject. After all, the GSA was a Google project, it indexed all of our site’s content, we had put in Key Matches for the most relevant pages, people must be satisfied with this as our search engine.

Nope.

The Strategy

After “the font is too small” and “it’s too hard to edit”, search results were the top complaint about our old site during the web makeover’s requirements gathering phase. We heard that people got better results about our site from Google.com than they did from the GSA. The designers we worked with to build the new site proposed a solution in three parts:

For some searches, you want to craft a hand-written response. If someone searches for “natatorium hours”, tell them “The pool is open right now! Here’s the full schedule…”. This also includes ambiguous searches like “summer”. We have a lot going on in the summer: Language Schools, two Bread Loaf programs, etc., so one Key Match isn’t going to cut it. We need to show a list of the top things having to do with “summer” at Middlebury.
For other searches, there’s no need to display a search results page. If you search for “webmail”, you probably do want to read articles about webmail being upgraded last year, you just want to check your email on the web. For these, we should deliver the user directly to the page.
If the search doesn’t fall into either of these categories, we should show a list of search results, but if people say that the search results from Google.com are better than those from the GSA, then why not just show them the results from Google.com? Also, we should provide some results from other databases like our Directory or Course Catalog.

Fortunately, these recommendations were easy to implement. For the first class, the custom search result pages, I developed a template that can be used like any other theme on our site for a page. If a page is using this theme, then it will be the search result for any search of its URL. For example, there is both a men’s and women’s hockey team at Middlebury, so if you search for “hockey” it’s not always clear what you want. The custom search result page for “hockey” lists the scores for both teams, links to the team pages, a link to order tickets, a link to the page about our hockey rink, and a link to the intramural team. Barbara has put together several of these custom search result pages based on data we’ve gathered about the most popular searches on our site.

The next class of search results, the automatic redirects, were also easy to manage. We’ve compiled a large list of URLs and quick terms referring to those URLs over the last couple years: the GO database. If you search for a GO shortcut, you’ll be automatically taken to the page for that GO shortcut. For the large majority of GO shortcuts, this works very well. If you search for “bannerweb”, you’ll be taken to go/bannerweb, searching for “eres” brings you to the e-reserves site. There are a minority of searches where this doesn’t work as well: “german” takes you to the German department’s site, but you might have been looking for the German language school or several other possibilities. I’ll describe how we’ve addressed this issue in a bit.

Google.com and the 404 page

The last category of search results got us into some trouble. When we first launched out new site, the standard search results were coming from Google.com, but Google hadn’t updated its search index to reflect the contents or structure of our new site. I had thought, based on experience with the MIIS site, that it would take Google 2-3 days to index our site and search would mostly be normalized after that. This actually did happen. Google’s index of our new site was generally complete in that timeframe. However, the new pages were listed much lower in the results than links to the old pages. Since most people click the first few links in results, they wer only seeing the 404 page, getting frustrated and leaving search before finding the working links further down.

I overlooked two differences between Middlebury and MIIS that made a big difference here.

Middlebury’s website is linked to from many more pages than MIIS’s site, both internal links (we have many more pages) and external links (peer institutions, etc.). Google’s search algorithm is weighted to push up pages that are linked to more frequently. Since other sites haven’t updated their links to Middlebury, Google assumes that those are the right links since there’s a lot more of them and pushes them up in the results. This was less of a factor for MIIS because MIIS is linked to less frequently.
We have kept around paths to sites at Middlebury for over 10 years. All of the old /~department_name, /offices/department_name, /depts/department_name, paths on cat.middlebury.edu, etc. from 1997-2009 still worked in January, 2010. These paths were created before Google even existed.

Google.com in 1998

Google’s index has never really been updated to reflect changes in our information architecture. We wanted to move away from this practice because:

It produces multiple results listings in search. If you searched for the Bread Loaf Writer’s Conference, you’d get a result with a link to its homepage at /~blwc, then another with a link to its homepage at /academics/blwc, then another with a link to its homepage at /depts/blwc, and so on. These all go to the same page and push other relevant results for that search further down the page. Ideally, the homepage should be the first result and other pages related to the program following it. By removing the old IA, we minimize the number of duplicate results.
We are now allowing you to control the IA of the site, which we didn’t do before. On MCMS, I had a slick rewrite rule that allowed me to direct requests for academic departments sites because we required that they be named the same as they were in the old IA:RewriteRule ^/~?(depts/)?((?:alc|art|bio|chem|chinese|classic|cs|dance|econ|english|es|filmvid|french|geog|geol|german
|haa|hist|ipe|is|italian|japanese|math|mbb|music|neuro|philo|physics|portuguese|ps|psych|rel
|russian|soca|spanish|teach|theatre|ws)(?:[/\\\?].*)?)$ /academics/ump/majors/$2 [R]So if you went to /depts/filmvid you were taken to /academics/ump/majors/filmvid. I can’t do stuff like this anymore because the departments can now change the path to their sites without alerting me to the change. It gets even harrier for sub-pages of those sites. It would be a logistical nightmare to maintain automated redirects for all the variations. I think allowing departments to add pages to their site without submitting a Helpdesk ticket is a fair tradeoff here.
Some portions of our new IA overlap ports of our old IA, like /offices and /admissions. There were going to be broken links in these areas no matter what I did.
A really nit-picky point, but reducing the number of paths to a site improves the responsiveness of the site. Every time the server redirects you, a full request-response chain is generated. It’s faster to go right to the final URL than bounce between all these alternatives. We’re talking about milliseconds of difference here, but hey, every bit counts.

What I should have done was begin to phase out the old URLs last year. Starting with the /~department_name addresses and workign forward in the IA timeline. This would have reduced the shock on launch day and sped up conversion to the new IA. This is a lesson I’ll take to future projects of this nature.

That’s what happened, now here’s what we did.

Solving the 404 issues

Indexing Speed

Google offers a service named Google Webmaster Tools where you can see information about your site and control some of the ways that Google interacts with the site. The first thing we did was to double the speed at which Google crawls the site. Google finds out about information our your site by automating typical user interaction on the site: a program they run will request your homepage, then request every page linked to from your homepage, and so on. The faster Google indexed this content, the faster information about our new site would be available in their index. While we were still having performance issues with the site, we needed to decrease this indexing speed, but were able to increase it again as we solved those issues.

Sitemap Files

Our next step was to create a sitemap file. Since Google’s crawler only looks at pages that are linked to from other pages, it might miss some content that isn’t linked to from anywhere, or very few places. A sitemap file is a really simple text document that tells search engines about every page on your site so that they have a base to check their index against. Again, this was done to make sure that the search engines had as much information about our new site as we could provide. At the same time, Adam and Chris worked to block search engines from looking at our old site or portions of our new site that we don’t want indexed by their engine and making those entries in our robots.txt file which tells search engines which paths they should ignore.

Removing Links from the Index

Google also offers you the option of requesting that a URL is removed from its search index for six months, after which we can assume that the index will have updated to reflect that the page is permanently gone. We were able to retrieve a list of the broken URLs (about 100,000 of them) from Google’s Webmaster Tools, and started to run through the list. The problem with the URL removal tool that Google offers is that it only lets you request one page removal at a time. A developer at another college noticed this problem too and wrote an application that fills out the removal request form for you over and over again to remove the tedium from the process.

I started using this to remove some of the URLs from Google’s results and noticed that I was only able to submit 1000 URLs per day from an account. It also took about a day for the new requests to be processed. For a time, I was submitting multiple thousands of broken URLs through this tool using multiple accounts, but that too stopped working, I guess because someone at Google noticed what I was up to. I now take a more targeted approach to the situation.

Each morning I run the following script, inspired by this Drupal blog post, on each of our front-end webservers:

gawk ‘{ print $9,$7,$11,$4 }’ /var/log/httpd/www.middlebury.edu-access_log | grep ^404 | grep google.com/search > 404.txt

This produces a report of all of the requests coming from Google searches that result in a 404 page. I then combine all of the reports and submit the non-duplicate pages from it through the URL removal tool. This turns out to only be a hundred unique pages per day, since we have eliminated most of the top level pages and are working on the “long tail” of search results.

I can then use this command to find out how many requests from Google.com to our site result in a 404 page each day:

grep ” 404 ” /var/log/httpd/www.middlebury.edu-access_log* | grep google.com/search | grep “09/Mar/2010″ | wc -l

I’ve been recording the results for a couple weeks now and have noticed that this method appears to be working very well at reducing the number of bad landing pages on our site.

I noticed one more thing about 404 pages in the logs from this morning. I had already submitted, and processed, the removal of the old address for the Bread Loaf School of English several times, but it kept appearing as broken in these logs. Looking at the search results page for BLSE I noticed that there is a Maps result. This isn’t part of the normal Google search index so the broken link wasn’t being removed by my request.

To update this, you need to change the record in the Google Local Business Listing administration interface. This is actually a pretty neat tool. It lets you list the location, phone number, email address, and other information about your business to add to the Google Maps interface. You can also upload images and videos about your business. I added all of the information I could about the BLSE except for the screen where it asked if we had any current coupons, though that’s not a bad idea – 10% off your Masters perhaps? Google called the BLSE office and gave them a PIN, which I entered into the interface and now their listing in Google’s search results is better than ever.

Also, since it is integrated with Google Maps, we get some interesting information about the people who search for the BLSE. For instance, people from Springfield, MA need more help getting to the campus:

Where driving directions requests come from:

1.	Springfield 01118	5
2.	Cedar Beach 05445	4
3.	Burlington 05401	2
4.	Concord 01742	2
5.	Washington 20006	2
6.	Brattleboro 05301	1
7.	Bristol 05443	1
8.	Mansfield 44902	1
9.	Newton 02459	1
10.	Rutland 05701	1

Interface Improvements

Using GO for Search Results

I knew that automatically forwarding people to the page of a GO shortcut if they searched for one would be controversial. Everyone agreed with the concept of forwarding certain searches to their final destination, like “bannerweb” or “menu”, but people were alarmed at the extent to which I suggested we take this feature. However, after looking at the list of internal searches, it became clear to me that our top search terms were already GO shortcuts and were shortcuts for which there was only one logical destination.

Still, I am sympathetic to the issue I raised before about certain searches, like “german” going to a department page when there are a lot of other relevant pages for that term. Ideally, these searches would have a custom search result page, and we will likely build one for each of the terms, but those take a while to develop. Instead, we now use a really old feature of HTML, frames, to show a banner at the top of a page you’ve been forwarded to so that you can click back to the full search results if you didn’t find what you were looking for. My original idea was to just have this display in our Drupal site using the themes native to that platform. Adam suggested extending it to any site using frames.

What you see if you search for "banner" on our site.

Frames are a bit of a controversial feature of HTML. Few people consider using them any more as layout based on Cascading Style Sheets has replaced both tables and HTML frames as the preferred method for laying out a web page. Still, they do have some valid uses and I’d contend that this is one of them. What frame do is split your browser window into multiple windows. A classic example is the Java API documentation where you’re looking at three windows in one.

For the GO search results, we use two frames: one on top that links you back to the full search result and one on the bottom that shows you the search results page.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<?php if ($go_url): ?>
 <head><title>Search Results</title></head>
 <frameset rows="30px,*">
 <frame marginheight="0" marginwidth="0" scrolling="no" noresize="noresize"
        src="<?php print $base_path . drupal_get_path('theme', 'midd-search'); ?>/searchbar.php?search=
        <?php print $q2; ?>" />
 <frame id="contentFrame" src="<?php print $go_url; ?>" />
 </frameset>
</html>
<?php else:  // print the full search results ?>
<?php endif; ?>

The searchbar.php script itself is a really simple page that just displays a message and a link:

<?php $search = str_replace("+", " ", $_GET['search']); ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
 <head>
 <style type="text/css">
 #searchbar { margin:0px; padding:0px 18px; width: 100%; height:30px; line-height:30px; font-size: 1em;
     font-family:Verdana,"Lucida Grande",Lucida,sans-serif; background-color: #BEDA90; color:#003468; }
 .closeBar {position:absolute; right:0;}
 .closeBar a {padding-right: 18px; text-decoration:none;}
 </style>
 <script type="text/javascript">
 var mainloc = parent.document.getElementById('contentFrame').src
 function closeFrame() { window.top.document.location = mainloc; }
 </script>
 </head>
 <body>
 <div id="searchbar"><span><b><a href="javascript:closeFrame()">X</a></b></span>We think this is the
   right page for your search of <b><?php print $search; ?></b>, but if it's not, you can <b>
   <a href="http://www.middlebury.edu/search?q2=<?php print $_GET['search']; ?>&nocustom=true"
   target="_top">view all the results</a></b>.</div>
 </body>
</html>

You can click on the “X” in the upper right to close the frame and keep browsing. One of the limitations of this approach is that frames can’t actually communicate with each other for security reasons. If they could, a site could create a really small frame with malicious code and then a really big frame with any site on the internet, then have the small frame execute its malicious code on the previously secure big frame. Since they can’t communicate with each other, when you close the top frame it will take you to the location that the bottom frame was at when you first saw it. So if you browse around for a bit in the bottom frame, then close the top frame, you’ll be taken back to your original search result page.

GO terms in the Search drop-down

Right before our site launched we noticed that the constituent landing pages like Current Students and Faculty & Staff had these search boxes on them with the label “go” in front of them. The idea was to let people search the database of GO shortcuts. We didn’t have any way to do this at the time, so Adam developed a little module for Drupal that made a request to the GO database to conduct searches of terms and used the jQuery autocomplete plugin to make it so that the results were returned to the user in real-time.

function go_fetch_url ($name) {
 if (!is_string($name))
 throw new InvalidArgumentException('$name must be a string.');

 if (!strlen($name))
 return array();

 $pdo = go_pdo();

 if ($inst = variable_get('go_scope_institution', '')) {
 $stmt = $pdo->prepare("SELECT code.url FROM code LEFT JOIN alias ON (code.name = alias.code)
   WHERE (code.name=:name1 AND code.institution=:inst1)
   OR (alias.name=:name2 AND alias.institution=:inst2)");
 $stmt->bindValue(":inst1", $inst);
 $stmt->bindValue(":inst2", $inst);
 } else {
 $stmt = $pdo->prepare("SELECT code.url FROM code LEFT JOIN alias ON (code.name = alias.code)
   WHERE code.name=:name1 OR alias.name=:name2");
 }
 $stmt->bindValue(":name1", $name);
 $stmt->bindValue(":name2", $name);
 $stmt->execute();

 $row = $stmt->fetch(PDO::FETCH_ASSOC);
 if (!$row)
 throw new Exception('No result matches.');

 return $row['url'];
}

I took this and applied it to the search boxes throughout the site, making a couple modifications. The GO boxes on the constituent page assumed you only wanted to search GO and would complete your search term for you. On the site-wide search, we know that not every search will be covered by a GO shortcut and left that out. I also added “go/” to the beginning of each of the results so that people were more aware of what they meant using the autocomplete plugin’s formatItem option:

$('.go_query').autocomplete(
 url,
 { max: 30,
 width: 200,
 autoFill: false,
 selectFirst: false,
 formatItem: function(row) {
 return "go/"+row[0];
 }
});

Results from the Google Search Appliance

Though we had initially wanted to move away from using the Google Search Appliance (GSA) for search results, because so many of the links to our site on Google.com were broken, we reindexed the new site using the GSA and added its results to the search results page. This involved requesting results from the GSA in a way that we hadn’t done before. We used to just have the GSA serve as the search front end using the XSLT style sheet interface that the server provides, but doing that would bypass all of the GO shortcut and custom search result page work that we’d done, as well as leave out results from the Directory and Course Catalog.

Instead, I found in the GSA documentation that you can make a request to the service and have it return an XML document of results, using this URL for our search engine:

http://search.middlebury.edu/cluster?q=SEARCH_QUERY&site=SEARCH_COLLECTION&coutput=xml&btnG=Google+Search&access=p&entqr=0&ud=1&sort=date%3AD%3AL%3Ad1&output=xml_no_dtd&oe=UTF-8&ie=UTF-8&client=default_frontend&%20proxystylesheet=default_frontend

For this to work, you need to replace SEARCH_QUERY with your search and SEARCH_COLLECTION with one of the collections that we maintain to segment search results. For example, there is a search collection named “Middlebury” that has all of our sites, but also one named “Blogs” that has only pages on our WordPress instance. Here is an example of what is returned by a search for “Google” on our Blogs server.

I then parse these results and display them on the search results page. Not wanting to get rid of work that was already there, and because I know we will want to switch back to using Google.com for our primary search results once the issue of 404 pages has been resolved, I added a tabbed interface on the search results page that lets you alternate between the two search collections. Just click on the tabs to see results from the other service.

Asynchronous Results

The search results page was one of the slowest loading pages on our site, taking between 6-15 seconds to load. The reason for this is that it needs to make requests to a lot of services to get all of the information:

Check to see if there are custom search pages
Check to see if there is a GO shortcut
Get the results from Google.com
Get the results from the GSA
Get the results from the Directory
Get the results from the Course Catalog

Steps 1 & 2 still happen before the page loads, since we might need to redirect you based on their results, but steps 3-6 now happen after the page loads. While you’re viewing the page, we’re requesting results from all of those services in the background. When the results come in, the page displays them using JavaScript. You might see the results from the GSA immediately and then results from the Course Catalog a second or two later. This gives the illusion of the page loading faster than it actually is and, if the thing you’re looking for appears early on, allows you to skip loading results from services you don’t need.

Upcoming Improvements

Improving search has been our secondary focus (after site performance) since launching our new site and a very important part of our work. We really want to get this right, so we’ll be adding in more and more of these types of improvements around the search results as time goes on. We’ll next be looking at statistics on how well our strategy of using GO shortcuts to deliver people directly to result pages works based on click patterns and, once we solve the 404 issue, how well Google.com does at proving basic search for our site.

Faceted Search

The next area of work is to figure out segmented search. We have a number of collections of highly structured content like HR Job Descriptions or Undergraduate Research Opportunities. We want to be able to build search interfaces for these collections so that people can search for, say, all of the jobs on campus that have a job level of Specialist 3 or all of the Research Opportunities in East Asia.

To do this, we’re setting up a local copy of the Apache Solr search engine. There is a Drupal module for this search engine that allows it to build filters based on content types. Job descriptions and research opportunities are content types and each of their fields could then be used as a filter in the faceted search results. I’m still in the preliminary stages of setting up this service, but am hoping to have a rough prototype done in April of how this will work.

Search My Site

Another use for the Apache Solr search engine would be to provide URL filtering for search results. We can do this by setting up collections in the GSA, but we don’t necessarily want to create a collection for every sub-site or maintain all of those filtering rules. Instead, we want to use Apache Solr’s flexible query syntax to let us find documents whose URL paths match patterns like “http://www.middlebury.edu/academics/lib” by passing that as a parameter to the search engine that we can alter if needed. This will also help us to add search to areas of the site like news archives.

Excluding GO Addresses from Search

There are some times when a custom search page is not appropriate for a search term and we don’t want to go directly to the URL for the GO shortcut when you search for that item. For example, Adam had set up “go/adam” to go to his personal blog, but there are a lot of people at Middlebury named Adam and people might be looking for a different Adam. We’ll add an option in the GO administration interface to allow you to exclude a GO shortcut from being used in search results, if you don’t want it to be.

User Preferred Search Engine

We’ve already got results from Google.com and the GSA and are adding the Apache Solr search engine to our site. Why just get results from one of these? Why not all of them? Why not Yahoo and Bing as well? Why not let people pick which one they want to use. We’ll add a field to the user profile pages in Drupal so that you can set your own preferred search engine and have it provide the default results for all your searches. This will also help us to look at which search tools you prefer and gauge which are giving better results.

What else?

Do you have other ideas for things we could do? Is there something that I glossed over that you have more questions about? Please let us know by adding your comments to this post.