Tag Archives: website

Does tagging content make it easier to find with search? No.

I’ve received this question from several people now. Below are two videos from Matt Cutts who works on Google’s Webspam team explaining how tagging content mostly does not affect their search results. This also means that tagging largely will not affect how results appear on Middlebury’s site, since we use Google to provide our search results.

Tags

Tag Clouds

This does not mean that you shouldn’t tag content at all. Tags can still be useful for humans who want to find other posts and pages on a topic. However, if you want your page to be easier to find, your time is better invested in making sure that the content is well written, structured and relevant to a particular topic.

Site Search Satisfaction Survey

Please take a few minutes and let us know whether our website’s search feature is working: http://go.middlebury.edu/search/feedback.

Forward this exciting link to your friends and co-workers so that they may weigh in as well. If you think we’re missing an important question on the survey, leave a comment here and I’ll add it in.

Thank you for your time!

Website Performance: Pressflow, Varnish, Oh-My!

Executive summary:

We’ve migrated from core Drupal-6 to Pressflow, a back-port of Drupal-7 performance features. Using Pressflow allows us to cache anonymous web-requests (about 77% of our traffic) for 5-minutes and return them right from memory. While this vastly improves the amount of traffic we can handle as well as the speed of anonymous page-loads it does mean that anonymous users may not see new versions of content for at most 5 minutes. Traffic for logged-in users will always continue to flow directly through to Drupal/Pressflow and will always be up-to-the-instant-fresh.

Read on for more details about what has change and where we are at with regard to website performance.


Background

When we first launched the new Drupal website back in February we went through some growing pains that necessitated code fixes (Round 1 and Round 2) as well as the addition of an extra web-server host and database changes (Round 2).

These improvements brought our site up to acceptable performance levels, but I was concerned that we might run into performance problems if the college ended up in the news and thousands of people suddenly went to view our site.

At DrupalCon a few weeks ago I attended a Drupal Performance Workshop where I learned a number of techniques that can be used to scale Drupal sites to be able to handle internet-scale traffic — not Facebook or Google-level traffic, but that of The Grammys, Economist, or World Bank.

Since before the launch of the new site we were already making use of optcode-caching via APC to speed code execution and were doing data caching with Memcache to reduce the load on the database. This system-architecture is far more performant than a baseline setup, but we still could only handle a sustained average of 20 requests each second before the web-host started becoming fully loaded. While this double our normal average of 10-requests per second, it is not nearly enough headroom to feel safe from traffic spikes.

Diagram of the execution flow through the web-host using normal Drupal page caching.

Request flow through our Drupal web-host prior to May 13th; using normal Drupal page-caching stored in Memcache. Click for full-size.

Switching to Pressflow

Last week we switched from the standard Drupal-6.16 to Pressflow-6.16.77, a version of Drupal 6 that has had a number of the performance-related improvements from Drupal-7 back-ported to it. Code changes in Pressflow such as dropping legacy PHP4 support and using only MySQL enable Pressflow execute about 27% faster than Drupal, a useful improvement but not enough to make a huge difference were we to get double or triple our normal traffic.

For us, the most important difference between Pressflow and Drupal-6 is that sessions are ‘lazily’ created. This means that rather than creating a new ’session’ on the server to hold user-specific information on the first page each user sees on the website, Pressflow instead only creates the session when the user hits a page (such as the login page) that actually has user-specific data to store. This change makes it very easy to differentiate between anonymous requests (no session cookies) and authenticated requests (that have session cookies) and enables the next change, Varnish page caching.

Varnish Page Caching

Varnish is a reverse-proxy server that runs on our web hosts and can return pages and images from its own in-memory cache so that they don’t have to execute in Drupal/Pressflow every single time. The default rule in Varnish is that if there are any cookies in the request, then the request is for a particular user and should be transparently passed through to the back-end (Drupal/Pressflow). If there are no cookies in the request, then Varnish assumes correctly that it is an anonymous request and tries to respond from its cache without bothering the back-end.

Request flow through our Drupal/Pressflow web-host after May 13th; using the Varnish proxy-server for caching. Click for full-size.

Request flow through our Drupal/Pressflow web-host after May 13th; using the Varnish proxy-server for caching. Click for full-size.

Since about 77% of our traffic is non-authenticated traffic, Varnish only sends about 30% of the total requests through to Apache/PHP/Drupal: all authenticated requests and anonymous requests where the cache hasn’t been refreshed in the past 5 minutes. Were we to have a large spike in anonymous traffic, virtually all of this increase would be served directly from Varnish’s cache, preventing any load-increase on Apache/PHP/Drupal or the back-end MySQL database. In my tests against our home-page varnish was able to easily handle more than 10,000 requests each second with the limiting factor being network speed rather than Varnish.

A histogram of requests to the website. Y-axis is the number of requests, X-axis is the time to return requests, '|' requests were handled by Varnish's cache and '#' were passed through to Drupal. The majority of our requests are being handled quickly by Varnish while a smaller portion are being passed-through to Drupal.

A histogram of requests to the website. Y-axis is the number of requests, X-axis is the time to return requests, '|' requests were handled by Varnish's cache and '#' were passed through to Drupal. The majority of our requests are being handled quickly by Varnish while a smaller portion are being passed-through to Drupal.

MySQL Improvements

During the scheduled downtime this past Sunday, Mark updated our MySQL server and installed the InnoBase InnoDB Plugin, a high-performance storage engine for MySQL that can provide twice the performance of the built-in InnoDB engine in MySQL for the types of queries done by Drupal.

Last week Mark and I also went through our database configuration and verified that the important parameters were tuned correctly.

As the MySQL database is not currently the bottleneck that limits our site performance these improvements will likely have a minor (though wide-spread) effect. Were our authenticated traffic to further increase (due to more people editing for instance) these improvements will be more important.

Where We Are Now

At this point the website should be able to handle at least 20,000 requests/second of anonymous users (10,000 on each of two web-hosts) at the same time that it is handling up to 40 requests/second from authenticated users (20 on each of two web-hosts).

While it is impossible to accurately translate these request rates into the number of users we can support visiting the site, a very rough estimation would be to divide the number of requests/second by 10 (a guess at the average number of requests needed for each page view) to get a number of page-views that can be handled each second. (1)

In addition to how many requests can be handled, how fast the requests are returned is also important. Our current response times for un-cached pages usually falls between 0.5 seconds and 2 seconds. If pages take much longer than 2 seconds, the site can “feel slow”. For anonymous pages cached in Varnish response times range from 0.001 seconds to 0.07 seconds, much faster than Apache/Drupal can do and more than fast enough for anything we need.

The last performance metric that we are concerned with is about the time it takes for the page to be usable by the viewer. Even if they receive all of the files for a page in only 0.02 seconds, it may still take their browser several seconds to parse these files, execute javascript code, and turn them into a displayable graphic. Due to these factors, my testing has shown that most pages on our site take between 1 and 3 seconds for users to feel that our pages are loaded. For authenticated users, this stretches to 2-4 seconds.

Finally please be aware that, anonymous users see pages that may be cached for up to 5 minutes. While this is fine for the vast majority of our content, there are a few cases where we may need to have the content shown be up-to-the-second fresh. We will address these few special cases over the coming months.

Future Performance Directions

Now that we have our caching system in place our system architecture is relatively complete for our current performance needs. While we may do a bit of tuning on various server parameters, our focus now shifts to PHP and Javascript code optimization to further improve server-side and client-side performance respectively.

One big impact on javascript performance (and hence perceived load-time) is that we currently have to include two separate versions of the jQuery Javascript Library due to different parts of the site relying on different versions. Phasing out the older version will reduce almost by half the amount of code that the browser has to parse.

Additional Notes

(1) As people browse the site their browser needs to load the main HTML page as well as make separate requests for Javascript files, style-sheet (CSS) files, and every image. After these have been loaded the first time, [most] browsers will cache these files locally and only request them again after 5 minutes or if the user clears their browser cache. CSS files and images that haven’t been seen before will need to be loaded as new pages are browsed to. For example, the first time someone loads the Athletics page, it requires about 40 requests to the server for a variety of files. A subsequent click on the Arts page would require an additional 13 requests, while a click back to the Athletics page would require on 1 additional request as the images would still be cached in the browser.

Website Improvements #5: Search

When Middlebury first started using a Content Management System to organize its site in 2003 we added a local search engine for the site, operated by Atomz. This search engine wasn’t very popular, people weren’t finding the information they needed. At a meeting a couple years later, Barbara Merz remarked, “Why don’t we just get Google!?” So we purchased a Google Search Appliance (GSA) and set that up as our local search engine. Going into the Web Makeover Project, we thought we were safe on this subject. After all, the GSA was a Google project, it indexed all of our site’s content, we had put in Key Matches for the most relevant pages, people must be satisfied with this as our search engine.

Nope.

The Strategy

After “the font is too small” and “it’s too hard to edit”, search results were the top complaint about our old site during the web makeover’s requirements gathering phase. We heard that people got better results about our site from Google.com than they did from the GSA. The designers we worked with to build the new site proposed a solution in three parts:

  1. For some searches, you want to craft a hand-written response. If someone searches for “natatorium hours”, tell them “The pool is open right now! Here’s the full schedule…”. This also includes ambiguous searches like “summer”. We have a lot going on in the summer: Language Schools, two Bread Loaf programs, etc., so one Key Match isn’t going to cut it. We need to show a list of the top things having to do with “summer” at Middlebury.
  2. For other searches, there’s no need to display a search results page. If you search for “webmail”, you probably do want to read articles about webmail being upgraded last year, you just want to check your email on the web. For these, we should deliver the user directly to the page.
  3. If the search doesn’t fall into either of these categories, we should show a list of search results, but if people say that the search results from Google.com are better than those from the GSA, then why not just show them the results from Google.com? Also, we should provide some results from other databases like our Directory or Course Catalog.

Fortunately, these recommendations were easy to implement. For the first class, the custom search result pages, I developed a template that can be used like any other theme on our site for a page. If a page is using this theme, then it will be the search result for any search of its URL. For example, there is both a men’s and women’s hockey team at Middlebury, so if you search for “hockey” it’s not always clear what you want. The custom search result page for “hockey” lists the scores for both teams, links to the team pages, a link to order tickets, a link to the page about our hockey rink, and a link to the intramural team. Barbara has put together several of these custom search result pages based on data we’ve gathered about the most popular searches on our site.

The next class of search results, the automatic redirects, were also easy to manage. We’ve compiled a large list of URLs and quick terms referring to those URLs over the last couple years: the GO database. If you search for a GO shortcut, you’ll be automatically taken to the page for that GO shortcut. For the large majority of GO shortcuts, this works very well. If you search for “bannerweb”, you’ll be taken to go/bannerweb, searching for “eres” brings you to the e-reserves site. There are a minority of searches where this doesn’t work as well: “german” takes you to the German department’s site, but you might have been looking for the German language school or several other possibilities. I’ll describe how we’ve addressed this issue in a bit.

Google.com and the 404 page

The last category of search results got us into some trouble. When we first launched out new site, the standard search results were coming from Google.com, but Google hadn’t updated its search index to reflect the contents or structure of our new site. I had thought, based on experience with the MIIS site, that it would take Google 2-3 days to index our site and search would mostly be normalized after that. This actually did happen. Google’s index of our new site was generally complete in that timeframe. However, the new pages were listed much lower in the results than links to the old pages. Since most people click the first few links in results, they wer only seeing the 404 page, getting frustrated and leaving search before finding the working links further down.

I overlooked two differences between Middlebury and MIIS that made a big difference here.

  1. Middlebury’s website is linked to from many more pages than MIIS’s site, both internal links (we have many more pages) and external links (peer institutions, etc.). Google’s search algorithm is weighted to push up pages that are linked to more frequently. Since other sites haven’t updated their links to Middlebury, Google assumes that those are the right links since there’s a lot more of them and pushes them up in the results. This was less of a factor for MIIS because MIIS is linked to less frequently.
  2. Google.com in 1998We have kept around paths to sites at Middlebury for over 10 years. All of the old /~department_name, /offices/department_name, /depts/department_name, paths on cat.middlebury.edu, etc. from 1997-2009 still worked in January, 2010. These paths were created before Google even existed.
    Google.com in 1998

    Google’s index has never really been updated to reflect changes in our information architecture. We wanted to move away from this practice because:

  • It produces multiple results listings in search. If you searched for the Bread Loaf Writer’s Conference, you’d get a result with a link to its homepage at /~blwc, then another with a link to its homepage at /academics/blwc, then another with a link to its homepage at /depts/blwc, and so on. These all go to the same page and push other relevant results for that search further down the page. Ideally, the homepage should be the first result and other pages related to the program following it. By removing the old IA, we minimize the number of duplicate results.
  • We are now allowing you to control the IA of the site, which we didn’t do before. On MCMS, I had a slick rewrite rule that allowed me to direct requests for academic departments sites because we required that they be named the same as they were in the old IA:RewriteRule ^/~?(depts/)?((?:alc|art|bio|chem|chinese|classic|cs|dance|econ|english|es|filmvid|french|geog|geol|german
    |haa|hist|ipe|is|italian|japanese|math|mbb|music|neuro|philo|physics|portuguese|ps|psych|rel
    |russian|soca|spanish|teach|theatre|ws)(?:[/\\\?].*)?)$ /academics/ump/majors/$2 [R]So if you went to /depts/filmvid you were taken to /academics/ump/majors/filmvid. I can’t do stuff like this anymore because the departments can now change the path to their sites without alerting me to the change. It gets even harrier for sub-pages of those sites. It would be a logistical nightmare to maintain automated redirects for all the variations. I think allowing departments to add pages to their site without submitting a Helpdesk ticket is a fair tradeoff here.
  • Some portions of our new IA overlap ports of our old IA, like /offices and /admissions. There were going to be broken links in these areas no matter what I did.
  • A really nit-picky point, but reducing the number of paths to a site improves the responsiveness of the site. Every time the server redirects you, a full request-response chain is generated. It’s faster to go right to the final URL than bounce between all these alternatives. We’re talking about milliseconds of difference here, but hey, every bit counts.

What I should have done was begin to phase out the old URLs last year. Starting with the /~department_name addresses and workign forward in the IA timeline. This would have reduced the shock on launch day and sped up conversion to the new IA. This is a lesson I’ll take to future projects of this nature.

That’s what happened, now here’s what we did.

Solving the 404 issues

Indexing Speed

Google offers a service named Google Webmaster Tools where you can see information about your site and control some of the ways that Google interacts with the site. The first thing we did was to double the speed at which Google crawls the site. Google finds out about information our your site by automating typical user interaction on the site: a program they run will request your homepage, then request every page linked to from your homepage, and so on. The faster Google indexed this content, the faster information about our new site would be available in their index. While we were still having performance issues with the site, we needed to decrease this indexing speed, but were able to increase it again as we solved those issues.

Sitemap Files

Our next step was to create a sitemap file. Since Google’s crawler only looks at pages that are linked to from other pages, it might miss some content that isn’t linked to from anywhere, or very few places. A sitemap file is a really simple text document that tells search engines about every page on your site so that they have a base to check their index against. Again, this was done to make sure that the search engines had as much information about our new site as we could provide. At the same time, Adam and Chris worked to block search engines from looking at our old site or portions of our new site that we don’t want indexed by their engine and making those entries in our robots.txt file which tells search engines which paths they should ignore.

Removing Links from the Index

Google also offers you the option of requesting that a URL is removed from its search index for six months, after which we can assume that the index will have updated to reflect that the page is permanently gone. We were able to retrieve a list of the broken URLs (about 100,000 of them) from Google’s Webmaster Tools, and started to run through the list. The problem with the URL removal tool that Google offers is that it only lets you request one page removal at a time. A developer at another college noticed this problem too and wrote an application that fills out the removal request form for you over and over again to remove the tedium from the process.

I started using this to remove some of the URLs from Google’s results and noticed that I was only able to submit 1000 URLs per day from an account. It also took about a day for the new requests to be processed. For a time, I was submitting multiple thousands of broken URLs through this tool using multiple accounts, but that too stopped working, I guess because someone at Google noticed what I was up to. I now take a more targeted approach to the situation.

Each morning I run the following script, inspired by this Drupal blog post, on each of our front-end webservers:

gawk ‘{ print $9,$7,$11,$4 }’ /var/log/httpd/www.middlebury.edu-access_log  | grep ^404 | grep google.com/search > 404.txt

This produces a report of all of the requests coming from Google searches that result in a 404 page. I then combine all of the reports and submit the non-duplicate pages from it through the URL removal tool. This turns out to only be a hundred unique pages per day, since we have eliminated most of the top level pages and are working on the “long tail” of search results.

I can then use this command to find out how many requests from Google.com to our site result in a 404 page each day:

grep ” 404 ” /var/log/httpd/www.middlebury.edu-access_log* | grep google.com/search | grep “09/Mar/2010″ | wc -l

I’ve been recording the results for a couple weeks now and have noticed that this method appears to be working very well at reducing the number of bad landing pages on our site.

image

I noticed one more thing about 404 pages in the logs from this morning. I had already submitted, and processed, the removal of the old address for the Bread Loaf School of English several times, but it kept appearing as broken in these logs. Looking at the search results page for BLSE I noticed that there is a Maps result. This isn’t part of the normal Google search index so the broken link wasn’t being removed by my request.

To update this, you need to change the record in the Google Local Business Listing administration interface. This is actually a pretty neat tool. It lets you list the location, phone number, email address, and other information about your business to add to the Google Maps interface. You can also upload images and videos about your business. I added all of the information I could about the BLSE except for the screen where it asked if we had any current coupons, though that’s not a bad idea – 10% off your Masters perhaps? Google called the BLSE office and gave them a PIN, which I entered into the interface and now their listing in Google’s search results is better than ever.

Also, since it is integrated with Google Maps, we get some interesting information about the people who search for the BLSE. For instance, people from Springfield, MA need more help getting to the campus:

Where driving directions requests come from:

1. Springfield 01118 5
2. Cedar Beach 05445 4
3. Burlington 05401 2
4. Concord 01742 2
5. Washington 20006 2
6. Brattleboro 05301 1
7. Bristol 05443 1
8. Mansfield 44902 1
9. Newton 02459 1
10. Rutland 05701 1

Top Internal Searches

We’ve also been looking at the list of top searches on our internal search interface and adding GO shortcuts for all the items where there’s only one page you’d want for that search or custom search results page for the more ambiguous items. These results come from one month of Google Analytics information on our site.

Interface Improvements

Using GO for Search Results

I knew that automatically forwarding people to the page of a GO shortcut if they searched for one would be controversial. Everyone agreed with the concept of forwarding certain searches to their final destination, like “bannerweb” or “menu”, but people were alarmed at the extent to which I suggested we take this feature. However, after looking at the list of internal searches, it became clear to me that our top search terms were already GO shortcuts and were shortcuts for which there was only one logical destination.

Still, I am sympathetic to the issue I raised before about certain searches, like “german” going to a department page when there are a lot of other relevant pages for that term. Ideally, these searches would have a custom search result page, and we will likely build one for each of the terms, but those take a while to develop. Instead, we now use a really old feature of HTML, frames, to show a banner at the top of a page you’ve been forwarded to so that you can click back to the full search results if you didn’t find what you were looking for. My original idea was to just have this display in our Drupal site using the themes native to that platform. Adam suggested extending it to any site using frames.

What you see if you search for "banner" on our site.

What you see if you search for "banner" on our site.

Frames are a bit of a controversial feature of HTML. Few people consider using them any more as layout based on Cascading Style Sheets has replaced both tables and HTML frames as the preferred method for laying out a web page. Still, they do have some valid uses and I’d contend that this is one of them. What frame do is split your browser window into multiple windows. A classic example is the Java API documentation where you’re looking at three windows in one.

For the GO search results, we use two frames: one on top that links you back to the full search result and one on the bottom that shows you the search results page.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<?php if ($go_url): ?>
 <head><title>Search Results</title></head>
 <frameset rows="30px,*">
 <frame marginheight="0" marginwidth="0" scrolling="no" noresize="noresize"
        src="<?php print $base_path . drupal_get_path('theme', 'midd-search'); ?>/searchbar.php?search=
        <?php print $q2; ?>" />
 <frame id="contentFrame" src="<?php print $go_url; ?>" />
 </frameset>
</html>
<?php else:  // print the full search results ?>
<?php endif; ?>

The searchbar.php script itself is a really simple page that just displays a message and a link:

<?php $search = str_replace("+", " ", $_GET['search']); ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
 <head>
 <style type="text/css">
 #searchbar { margin:0px; padding:0px 18px; width: 100%; height:30px; line-height:30px; font-size: 1em;
     font-family:Verdana,"Lucida Grande",Lucida,sans-serif; background-color: #BEDA90; color:#003468; }
 .closeBar {position:absolute; right:0;}
 .closeBar a {padding-right: 18px; text-decoration:none;}
 </style>
 <script type="text/javascript">
 var mainloc = parent.document.getElementById('contentFrame').src
 function closeFrame() { window.top.document.location = mainloc; }
 </script>
 </head>
 <body>
 <div id="searchbar"><span><b><a href="javascript:closeFrame()">X</a></b></span>We think this is the
   right page for your search of <b><?php print $search; ?></b>, but if it's not, you can <b>
   <a href="http://www.middlebury.edu/search?q2=<?php print $_GET['search']; ?>&nocustom=true"
   target="_top">view all the results</a></b>.</div>
 </body>
</html>

You can click on the “X” in the upper right to close the frame and keep browsing. One of the limitations of this approach is that frames can’t actually communicate with each other for security reasons. If they could, a site could create a really small frame with malicious code and then a really big frame with any site on the internet, then have the small frame execute its malicious code on the previously secure big frame. Since they can’t communicate with each other, when you close the top frame it will take you to the location that the bottom frame was at when you first saw it. So if you browse around for a bit in the bottom frame, then close the top frame, you’ll be taken back to your original search result page.

GO terms in the Search drop-down

Right before our site launched we noticed that the constituent landing pages like Current Students and Faculty & Staff had these search boxes on them with the label “go” in front of them. The idea was to let people search the database of GO shortcuts. We didn’t have any way to do this at the time, so Adam developed a little module for Drupal that made a request to the GO database to conduct searches of terms and used the jQuery autocomplete plugin to make it so that the results were returned to the user in real-time.

function go_fetch_url ($name) {
 if (!is_string($name))
 throw new InvalidArgumentException('$name must be a string.');

 if (!strlen($name))
 return array();

 $pdo = go_pdo();

 if ($inst = variable_get('go_scope_institution', '')) {
 $stmt = $pdo->prepare("SELECT code.url FROM code LEFT JOIN alias ON (code.name = alias.code)
   WHERE (code.name=:name1 AND code.institution=:inst1)
   OR (alias.name=:name2 AND alias.institution=:inst2)");
 $stmt->bindValue(":inst1", $inst);
 $stmt->bindValue(":inst2", $inst);
 } else {
 $stmt = $pdo->prepare("SELECT code.url FROM code LEFT JOIN alias ON (code.name = alias.code)
   WHERE code.name=:name1 OR alias.name=:name2");
 }
 $stmt->bindValue(":name1", $name);
 $stmt->bindValue(":name2", $name);
 $stmt->execute();

 $row = $stmt->fetch(PDO::FETCH_ASSOC);
 if (!$row)
 throw new Exception('No result matches.');

 return $row['url'];
}

goI took this and applied it to the search boxes throughout the site, making a couple modifications. The GO boxes on the constituent page assumed you only wanted to search GO and would complete your search term for you. On the site-wide search, we know that not every search will be covered by a GO shortcut and left that out. I also added “go/” to the beginning of each of the results so that people were more aware of what they meant using the autocomplete plugin’s formatItem option:

$('.go_query').autocomplete(
 url,
 { max: 30,
 width: 200,
 autoFill: false,
 selectFirst: false,
 formatItem: function(row) {
 return "go/"+row[0];
 }
});

Results from the Google Search Appliance

Though we had initially wanted to move away from using the Google Search Appliance (GSA) for search results, because so many of the links to our site on Google.com were broken, we reindexed the new site using the GSA and added its results to the search results page. This involved requesting results from the GSA in a way that we hadn’t done before. We used to just have the GSA serve as the search front end using the XSLT style sheet interface that the server provides, but doing that would bypass all of the GO shortcut and custom search result page work that we’d done, as well as leave out results from the Directory and Course Catalog.

Instead, I found in the GSA documentation that you can make a request to the service and have it return an XML document of results, using this URL for our search engine:

http://search.middlebury.edu/cluster?q=SEARCH_QUERY&site=SEARCH_COLLECTION&coutput=xml&btnG=Google+Search&access=p&entqr=0&ud=1&sort=date%3AD%3AL%3Ad1&output=xml_no_dtd&oe=UTF-8&ie=UTF-8&client=default_frontend&%20proxystylesheet=default_frontend

For this to work, you need to replace SEARCH_QUERY with your search and SEARCH_COLLECTION with one of the collections that we maintain to segment search results. For example, there is a search collection named “Middlebury” that has all of our sites, but also one named “Blogs” that has only pages on our Wordpress instance. Here is an example of what is returned by a search for “Google” on our Blogs server.

searchI then parse these results and display them on the search results page. Not wanting to get rid of work that was already there, and because I know we will want to switch back to using Google.com for our primary search results once the issue of 404 pages has been resolved, I added a tabbed interface on the search results page that lets you alternate between the two search collections. Just click on the tabs to see results from the other service.

Asynchronous Results

The search results page was one of the slowest loading pages on our site, taking between 6-15 seconds to load. The reason for this is that it needs to make requests to a lot of services to get all of the information:

  1. Check to see if there are custom search pages
  2. Check to see if there is a GO shortcut
  3. Get the results from Google.com
  4. Get the results from the GSA
  5. Get the results from the Directory
  6. Get the results from the Course Catalog

Steps 1 & 2 still happen before the page loads, since we might need to redirect you based on their results, but steps 3-6 now happen after the page loads. While you’re viewing the page, we’re requesting results from all of those services in the background. When the results come in, the page displays them using JavaScript. You might see the results from the GSA immediately and then results from the Course Catalog a second or two later. This gives the illusion of the page loading faster than it actually is and, if the thing you’re looking for appears early on, allows you to skip loading results from services you don’t need.

Upcoming Improvements

Improving search has been our secondary focus (after site performance) since launching our new site and a very important part of our work. We really want to get this right, so we’ll be adding in more and more of these types of improvements around the search results as time goes on. We’ll next be looking at statistics on how well our strategy of using GO shortcuts to deliver people directly to result pages works based on click patterns and, once we solve the 404 issue, how well Google.com does at proving basic search for our site.

Faceted Search

The next area of work is to figure out segmented search. We have a number of collections of highly structured content like HR Job Descriptions or Undergraduate Research Opportunities. We want to be able to build search interfaces for these collections so that people can search for, say, all of the jobs on campus that have a job level of Specialist 3 or all of the Research Opportunities in East Asia.

To do this, we’re setting up a local copy of the Apache Solr search engine. There is a Drupal module for this search engine that allows it to build filters based on content types. Job descriptions and research opportunities are content types and each of their fields could then be used as a filter in the faceted search results. I’m still in the preliminary stages of setting up this service, but am hoping to have a rough prototype done in April of how this will work.

Search My Site

Another use for the Apache Solr search engine would be to provide URL filtering for search results. We can do this by setting up collections in the GSA, but we don’t necessarily want to create a collection for every sub-site or maintain all of those filtering rules. Instead, we want to use Apache Solr’s flexible query syntax to let us find documents whose URL paths match patterns like “http://www.middlebury.edu/academics/lib” by passing that as a parameter to the search engine that we can alter if needed. This will also help us to add search to areas of the site like news archives.

Excluding GO Addresses from Search

There are some times when a custom search page is not appropriate for a search term and we don’t want to go directly to the URL for the GO shortcut when you search for that item. For example, Adam had set up “go/adam” to go to his personal blog, but there are a lot of people at Middlebury named Adam and people might be looking for a different Adam. We’ll add an option in the GO administration interface to allow you to exclude a GO shortcut from being used in search results, if you don’t want it to be.

User Preferred Search Engine

We’ve already got results from Google.com and the GSA and are adding the Apache Solr search engine to our site. Why just get results from one of these? Why not all of them? Why not Yahoo and Bing as well? Why not let people pick which one they want to use. We’ll add a field to the user profile pages in Drupal so that you can set your own preferred search engine and have it provide the default results for all your searches. This will also help us to look at which search tools you prefer and gauge which are giving better results.

What else?

Do you have other ideas for things we could do? Is there something that I glossed over that you have more questions about? Please let us know by adding your comments to this post.

Supported Web Browsers

I was asked as a member of the LIS Website Team to put together a quick post on supported web browsers for our site. In general our guideline for supporting a browser is to keep support for it for as long as the browser’s manufacturer is supporting it. This means we will try our best to resolve issues with any browser that you can readily download from a manufacturer’s site, except for beta and pre-release versions.

These guidelines apply only to services supported by the Web Application Development workgroup. Other workgroups may have their own guidelines, for example Internet Explorer 7 is the supported browser for Internet Native Banner users.

These are the versions we support at the time of this post:

With the exception of Internet Explorer, each of these browsers have both Mac and PC versions.

I’m using internet explorer, which version should I use?

We recommend that none of our users browse the site with IE 6, but the browser still accounts for about 6.5% of our site’s traffic and we try to make sure that the site is mostly working for these users. If you are on-campus, your machine should have received an automatic update recently to move you to IE7, if you hadn’t applied that upgrade already. If you are on-campus and still using IE 6, contact the Helpdesk so your machine can be updated.

Most of the site’s features and visuals are the same in IE 7 and IE 8, but IE 8 does have a better rendering engine and will be able to support more features going forward. Users of Internet Native Banner should stay on IE 7, since that is the most recent version of Internet Explorer supported for use with INB. Others may choose to upgrade to IE 8.

What about older versions of Firefox?

The Mozilla Foundation makes available all older versions of the Firefox browser, but after a certain time stops applying security and stability updates to the browser. When that stops, it makes sense for us to stop supporting the browser for viewing and editing the site. Firefox is updated more frequently, and iteratively than Internet Explorer, making changes between its versions less severe and allowing site functionality to continue working in most cases. For this reason, we recommend always applying the updates to Firefox and sticking with the most recently released version.

There are specific issues with Firefox 3.0 that we know about on the site and are unlikely to resolve. If you’re using Firefox 3.0, please upgrade to one of the more recent versions.

Is there a different list of supported browsers for editing www.middlebury.edu?

In theory, no. We would very much like the editing experience to be the same across all of the browsers listed above. However, we are beholden to using a WYSIWYG editor that is known to have a few quirks in certain browsers. We are planning to upgrade the version of this editor shortly to address some of these issues, but need to make certain that modifications to it to allow you to browse for files in the site still work in the new version.

We don’t block you from using any browser to edit the site, but some people have noticed intermittent quirks when editing in Internet Explorer and Safari. At this time, we recommend that editors use Firefox since we have not heard of editing issue with this browser and it’s part of the default distribution package.

What about beta and pre-release browser versions?

You’re welcome to use these, and they may work, but we will not respond to bug reports about site functionality not working in a beta version of a browser. These are often caused by issues with the browser that are addressed before its final version is released and third-party systems like Wordpress and Drupal will often release their own fixes to these issues when the final version of a browser is released. It’s not efficient for us to spend time addressing these issues as well.

This recently came up because the Wordpress editing interface didn’t work in a development version of Google Chrome. The issue was resolved several days later in a new development build of the browser and is likely not something we would have been able to resolve. In circumstances like these, we recommend using one of the supported browser versions instead until the development version is updated to fix the issue.

I’m using one of the supported versions, but there’s an issue. What can I do?

People with a Middlebury College account can submit a bug report. This system allows us to communicate back-and-forth with you and gives you a view of the issue through a web interface. If you don’t have a Middlebury account, you can submit the Web Feedback form and we’ll get in touch with you via email.

If I haven’t answered your question here, leave a comment.

Website Improvements #4: Previews

As I said at the start of this series, I aim to do at least one thing each week that improves our website for someone. Last week we had a number of improvements to the performance of the site that had a dramatic effect for everyone. Not every update to the site is quite that exciting. This update might not seem as significant, but it will help out people editing our site.

I also wanted to use this series to give more of a back-end explanation of what goes into these changes. This gives other schools and organizations running Drupal an opportunity to see what we’re up to, use our solutions to fix similar problems, and offer suggestions on how we could do this even better. I don’t want these posts to just be, “Yup, I added a button. Problem solved.” *dusts hands off*

But I do realize that those details might bore some people. So, if you want to know what’s changed:

  • You can click Preview Live Site in the Edit Console to hide all of the Edit, Delete, etc. links, hidden menu items and other content available only to editors.
  • There is a Preview button at the bottom of the editing form. Click it to see what your updates will look like before saving the node.

Preview Live Site

I’ve added a checkbox to the Edit Console labeled Preview Live Site. Clicking this will show or hide all of the links and content that are only visible to editors. These links are necessary to edit the site, but sometimes you want to be able to browse the site as a normal visitor would see it, so you can make sure that padding around images is correct and unpublished content isn’t being shown.

Here is the Library web site with all the editing links shown:

library_edit

And here it is in Preview Live Site mode:

library_live

In order to give you the option to browse around the site with this option either off or on, I set a cookie when you click on the checkbox. A browser cookie is a text file your browser creates on your machine and sends back to the site that created it whenever you visit the site. In the case of this browser cookie, you tell Middlebury that the value of “midd_live_preview” is “preview”. As long as your browser retains that cookie, you’ll see the site in “live mode”. Unchecking the checkbox clears the cookie.

This requires the jQuery Cookie plugin. I added a checkbox with the id “livepreview” to our “Edit Console”, which is a floating tab of options for editing the site using Monster Menus. And this is the jQuery-enabled JavaScript that makes this work:

$(function() { // on DOM ready
  $('#livepreview').click(function() {
    var options = { path: '/', expires: 10 };
    if ($('#livepreview').is(':checked')) {
      $.cookie("midd_live_preview","preview", options);
      $('.mm-block-links,div.links,.hidden-cat,.recycle-bin,.preview').hide();
    } else {
      $.cookie("midd_live_preview", null, options);
      $('.mm-block-links,div.links,.hidden-cat,.recycle-bin,.preview').show();
    }
  });
  if ($.cookie("midd_live_preview") == "preview") {
    $('#livepreview').attr('checked', true);
    $('.mm-block-links,div.links,.hidden-cat,.recycle-bin,.preview').hide();
  } else {
    $('#livepreview').attr('checked', false);
    $('.mm-block-links,div.links,.hidden-cat,.recycle-bin,.preview').show();
  }
});

Preview Button for Editors

I’ve added a button next to the Save button at the bottom of the node edit form for you to preview the change. Actually “added” is the wrong word, since the Preview button is always supposed to be there. There must have been issues with an earlier version of the editor that caused our colleagues at Amherst to hide this option, since they wrote:

mm_ui.inc:2335:    // TinyMCE screws up the body in previews, so remove this button for now
mm_ui.inc:2336:    unset($form['buttons']['preview']);

I was not able to detect any issues with the preview option, so I added this back in. Be sure to let me know if you notice anything awry. We will be moving to a newer version of TinyMCE, which is the WYSIWYG (What-you-see-is-what-you-get) editor we use very soon. This will accompany the addition of the WYSIWYG Drupal module that will let you choose whether you want to use TinyMCE or the FCKEditor. Stay tuned for more on this in a future update.

However, the normal Drupal preview mode shows you both the “teaser” and full versions of the node you’re posting. We use teaser versions in very few places on our site, making this potentially confusing to editors. Fortunately, Drupal lets me override the output of content through its themeing system and there’s a theme function for the preview mode of nodes. I added this quick function to our template:

function midd_node_preview($node) {
  $output = "";
  if ($node->body) {
    $output .= node_view($node, 0, FALSE, 0);
  }

  return $output;
}

The node_view function takes the node as its first argument, whether to display the teaser version as the second, whether to display the node as its own page as the third, and whether to display editing links as the fourth. We just tell it to show the node, as it would be displayed to a site visitor and be done with it.

Website Improvements #3: Better Performance [Extended Edition]

Here are some additional notes about the update Adam gave last week on how we were able to improve sitewide.

Attack of the Search Bots

Adam discovered that there was a page on our site that was displaying a linked tree of the permission groups for the site – all of them. A couple search spiders found this page and started browsing its sub-pages. There are hundreds of thousands of permissions groups, up to four for every course taught at Middlebury going back years, mailing lists, etc. It’s important to note right here that only the name of the permission group is displayed through this interface, not its members. Still, having search bots crawling all of these pages slowed our site down to a crawl, and there’s no reason we’d want this content indexed anyhow.

Adam added a rule to the robots.txt file for our site telling search bots to ignore this path, which will also remove it from their indexes. He also placed the pages behind authentication so that other users wouldn’t slow the site down by looking through there.

Reducing Hits on the Database

We also began to look at how many requests to the database were being made each time a page on the site was loaded. Pages like the Student Life home page require over 500 database queries to run before they are loaded. Most of this overhead is fetching various pieces of information about menu items, display settings, permissions, and the many fields that some nodes on our site use. There were two things that stood out in these results.

1. There were many queries related to the Workflow and Locale modules. Workflow allows us to create edit -> review -> publish workflows for content approval, but we haven’t set any of these up yet. Still, just having the module enabled requires Drupal to check the database for each node to see what its current workflow state is. The Locale module lets us provide multi-language versions of content and display a version in the local language of a person visiting the site. Since none of our content has been translated, this isn’t useful for our site. I disabled both modules as one step in improving site performance.

2. The left-hand menu requires many queries to load. On the home page, even though we aren’t displaying the menu, it was still loading all sub-pages of the home page, just in case we did want to display them. I hid all sub-pages of the home page in its menu. This reduced the number of database requests to load the home page from 100 to around 30. The home page is the page most often requested on our site, so making it as fast as possible improves performance site-wide.

Moving Wordpress

We had originally hoped that we could set up a “high availability” MySQL server that would provide database space for Drupal, Wordpress, MediaWiki and other applications we consider part of our “core” supported web applications. This desire came up last Spring when an update to the main database server caused an issue in a little-used, old, third-party application to spiral out of control and disrupt services on several of these high-use applications.

Unfortunately, it appears that the combination of WordpressMU and the Drupal instances of both www.middlebury.edu and www.miis.edu generate so much database activity that MySQL couldn’t keep cached versions of these queries in its memory. Cached queries are really important for performance. When you make a request to the database like “give me all of the stories on the Midd home page” it will read from the server’s disks to find the answer, then keep a copy of the answer in active memory. The next time the database is asked that question, it can skip the step where it looks up the information on the disk. Retrieving information from active memory is several orders of magnitude faster than reading from disk, but it’s a much more limited resource.

With both applications running on the same database server, the query cache would quickly fill up, overflow and empty out, meaning that most requests to the database were being served from the disk. After Adam and Mark worded to get Drupal on its own database server, without Wordpress, over 90% of the queries sent to the database are served from the cache, greatly improving the responsiveness of the machine.

Removing Locks

On Thursday, February 11 at 10:07AM, the site crashed and was down for about 15 minutes. The database had stopped processing new requests and needed to be restarted. Before doing that, we looked at the last database query in the queue. It was a request to create a new user group in the groups table. This happens whenever you save a page after adding a single user to the page permissions. To simplify permissions requests, Monster Menus groups all single users assigned to a page together and creates a pseudo-group in the database. So if you add me, Adam and Mark to be able to edit a page, Monster Menus will create a group with each of our user objects in it and assign that group permission to edit the page, rather than each of us individually.

In order to keep these separate from the rest of the groups, they are inserted into the groups table with a negative ID. So a group with a positive ID will be something like “All LIS Staff”, but a group with a negative ID might be “temp group of Ian, Adam and Mark”. The database engine is designed to handle the case where two positive ID group creation requests occur at the same time, but not two negative ID group creation requests. In that case, which request gets assigned which ID?

This problem can be solved by placing what is called a “lock” on the database table until the current request is done processing:

db_query(’LOCK TABLES {mm_group} WRITE’);
$gid = db_result(db_query(’SELECT LEAST(-1, MIN(gid)-1) FROM {mm_group}’));
if (!$gid) $gid = -1;
db_query(’INSERT INTO {mm_group} (gid, uid) VALUES(%d, %d)’, $gid, $uid);
db_query(’UNLOCK TABLES’);

This says, “keep other requests out of the group table, give me the next lowest ID# from this table (in other words – the most negative ID#), create my new group, then let other requests have access”. The problem with this is: if the database fails to process the INSERT request for whatever reason, no other requests can access the table, new requests will pile up and the server will die.

There is a setting on the database that prevents this from happening called table_lock_wait_timeout. By default, this was set to a very high value, somewhere around 36,000 seconds or 10 minutes. We changed this to 30 seconds, which should give the server enough time to process the request or, if it can’t, let someone else have a chance.

The Access Denied page

The path to log into the site is http://www.middlebury.edu/cas (the last part is “cas” because we’re using the Central Authentication System to provide single-sign-on to Drupal, Segue, Wordpress, and MediaWiki). I had put the path to Drupal’s 403 (Access Denied) page in the configuration as this path, figuring that if people hit a page they could not view, it would direct them to the sign on page. However, because of the way Drupal draws its access denied page, it was actually bouncing people infinitely between the sign on page and the page to which they didn’t have access and not giving them a chance to sign on.

Several people had this happen to them, waited patiently – too patiently – for it to be resolved and caused a large number of requests to be generated against the site, decreasing site performance. Adam noticed this and set up an appropriate Access Denied page that resolves this issue.

That’s it for this week

If you have any questions, we’re always happy to answer them here and remember that we’re still taking feedback via the Web Feedback form. If I haven’t responded to your question through that form, let me know.