DrupalCon 2010 Trip Report – Day 3 | Library & Information Technology Services

After attending a conference, I usually think, “Wow, we’re so far ahead here at Middlebury!” Not this time! DrupalCon was incredibly helpful in demonstrating all of the ways we can improve our site with better performance, better search, better content, and better code. I’m also really excited about the upcoming release of Drupal 7 and both confident we can move our site onto this new version and eager to use all the new features.

Here are the highlights from the last day:

All of the conference sessions are now available to watch online.
Librarians might be particularly interested in Shh! This is a (Drupal-powered) Library Website, though I suggest skipping the first 27 minutes as those presenters cover topics not relevant to our setup.
The tech team at the White House is contributing back to the community the work they’ve done to make the Drupal perform better under high load and the NY State Senate is working to provide an out-of-the-box Drupal configuration for State governments. Whatever your personal politics, I hope you’ll agree that it’s neat to see the government using free resources and then improving them for the rest of the community.
The Node Access module in Drupal 7 includes a feature that will allow us to remove the “core hack” from Monster Menus, which will allow it to be an accepted module in the Drupal community and help it get adopted by other Drupal users.
Drupal 7 allows you to build form fields with “states”. For instance, a group of field for taking credit card information could have the state “expanded” if a checkbox labeled “pay online” is checked. This will help us build easier to use interfaces.
We can combine the Apache Solr project with the Apache Nutch project to create a local search crawler and indexer, like the Google Search Appliance, but with a lot more room for us to expand and a lot less configuration to provide faceted search.
The next database abstraction layer for Drupal 7 uses PHP’s PDO library, which will support MySQL, postgres, SQLite, MSSQL, and Oracle as database back ends. There are also huge improvements in database replication, allowing us to have a true “hot standby” server, support for prepared statements, transactions, a query builder, and a lot of other stuff that our Banner programmers take for granted.
While still in its very early stages, there are a number of Drupal projects working on support for various NoSQL platforms, including MongoDB. These systems promise to improve performance by removing some of the technical limitations imposed by storing information in “tables” in databases. Not quite ready for wide use in production, but expect to see a lot more of this in a few years.
Much of the HTML code printed by code modules in Drupal 7 will use RDF markup. This provides additional information about the elements on the page that are intended for the browser and search engines to use. For example, the text “Price: $9.99” won’t get picked up by a search engine as a price, but “Price: $<span class="field-item even" property="product:listPrice">9.99</span>” allows the search engine to display that information as a price next to the page listing.

Read on for more notes on each of these points.

Node Access in Drupal 7 (Notes by Ian)

Watch the presentation

In the current version of Drupal, changes to the user permissions to access a piece of content can only truly happen through the core of the software, meaning that our custom modules to Drupal are limited in what they can do to control access. With Drupal 7, we get a new hook, hook_node_access(), that allows any module to define CRUD permissions (Create, Read, Update, Delete), on any node and for any other module to then modify those permissions. I’ll back up for a second and explain that when a Drupal page loads, the software core gets first crack at doing whatever it needs to do and then all of the modules get a change to modify it, using hooks, in an order defined by the site administrator. Any module that implements hook_node_access() will run that function on all nodes and, if it is the last module to implement that function, will have the final say on who can do what with a node.

This is very important for us, because this is exactly the type of thing Monster Menus needs to do. MM introduces “pages” to Drupal, which the core of the software doesn’t know anything about. In MM, all nodes are assigned to a page and then permissions are set on the page. So in Drupal 7, MM will be able to run hook_node_access() to tell the system, “these nodes being loaded right now belong to this page and here’s who has each of these permissions on them”. Right now this is all done with a large amount of heavy lifting that taxes the system. The hope is that, by opening up access to these functions, Drupal 7 will improve site performance and let us do more.

There are some costs to doing this. The query builder in Drupal 7 doesn’t know that things need to be run through the node access logic, so it is a requirement that you add a tag of “node_access” to any query that is run against the node table. Failure to do this is a security violation in Drupal and will get your module flagged by the security team. This is a bit silly, since we need to do this for every query against the node table, why doesn’t the system just add it for us? The presenter said that a patch to Drupal to provide that would likely get approved, but the design philosophy behind the decision is that it’s, “not the API’s responsibility to tell you how to code correctly”. The new node_access hook also only works on single nodes, you can’t run a whole list of them through it, so there is a chance of poor performance for pages that need to load a bunch of nodes and need to execute these functions on all of them.

Instant Dynamic Forms with #states (Notes by Ian)

Watch the presentation

Drupal 7 adds the “#states” form API element which allows us to define behavior for a form specific to a current “state” of the form. For instance, a field could become required depending on whether a checkbox elsewhere on the form is checked or unchecked. You can try out an example of this behavior.

Naturally, these examples are really trivial, but this is a framework with a lot of power in it. The other nice thing about using #states to build forms in Drupal 7 is that all of the JavaScript is created by the forms engine. This makes forms easier to build and less prone to bugs. I hate having to debug JavaScript, so I’m happy to hand off that responsibility to the software.

One potential issue is that the form is rebuilt by the server after every action by the user. If you click a checkbox, the server gets a notice, rebuilds the entire form, and sends it back to you. This makes the form more secure: only the server can decide what elements are part of the form, but potentially very resource intensive. Our Page Settings form already takes a while to load the first time. If it has to load multiple times in between user actions I can see people starting to get frustrated. We’ll have to wait and see.

Web Crawling and Search with Nutch and Solr (Notes by Ian)

How to build a Jobs Aggregation Search Engine with Nutch, Apache Solr and Views 3 in about an hour

Solr-Nutch Architecture (Diagram by Adam)

Apache Solr is a search engine and Apache nutch is a crawler that can be used to populate that search engine. Using these tools together, we can build a local search repository that indexes all the same sites our Google Search Appliance does, but allows us to extend the search experience by adding facets and localized search. Both Solr and Nutch are also available in cloud configurations, meaning that we can offload the processing of these actions if they grow beyond what our local staff and servers can manage.

The biggest advantage of using Solr as a search engine with Drupal is that the two are closely integrated through the Apache Solr Drupal module. Rather than use a search crawler that goes through the site, Drupal will periodically send off highly structured data to the Solr search repository, including all of the metadata associated with a node. For instance, we have a node in the site for every staff job descriptions with fields listed for department, position number, and level. A crawler just sees these as plain text, but Drupal sends each field off to Solr as part of the index. So when you search for “Programmer” you can filter by level, department, or location. Though it isn’t mentioned anywhere in the documentation, I learned that these filters are automatically set up for any content type field that is a radio button, checkbox, or drop down menu. For other fields, we can build our own filters.

It’s true that these services emulate what we already have with our Google Search Appliance. We’ll be meeting later this week to discuss our search strategy and determine whether it will be beneficial to set up this infrastructure, extend how we use the GSA, both, or neither. I’m glad I attended this session before trying to make that decision!

Databases: the Next Generation (Notes by Ian)

Watch the presentation

The big announcement in this session was that the database abstraction layer in Drupal 7 will use PHP’s PDO libraries, meaning that Drupal 7 can run with MySQL, postgres, SQLite, MSSQL or Oracle as a database backend. Currently, it can run using MySQL or (with some reservations) postgres. Microsoft also announced at the conference that they have supplied a beta drive for MSSQL to the PHP project and are engaged in improving the driver to access their database, rather than relying on the volunteer community to provide it.

This opens up a lot of features that I bet our Banner programmers would be surprised to learn we didn’t have in Drupal (since these have been common in the PL/SQL environment forever):

Prepared statements
Transactions
Named placeholders for variables in statements
Merge and truncate
Return a result set as any object type
Multi-insert statements
Full master/slave replication support for multiple failover servers

The importance of that last point can’t be overstated. Right now, if our primary database server fails, we can switch over to our backup server. However, MySQL servers can only replicate data in one way, so we can’t allow you to modify data on the backup server because it will never be written back to the primary server. However, the database abstraction layer is so basic right now that we can’t differentiate between a query that is writing to the database and one that’s just reading from the database in our code. So, we solved this by denying our web server permission to execute write statements on the backup server. When the failover occurs, you can try to write data, but you’ll get a warning and our error log will start piling up the errors too. This isn’t ideal and the added functionality in Drupal 7’s database abstraction layer solves this problem.

This was also one of those sessions where I was impressed by the amount of knowledge in the room. A fairly esoteric question was asked about prepared statements: where are they prepared? In the code? By the database? It turns out that, in PHP, the answer is different for each database driver. The MySQL driver prepares the query before passing it off to the database, so it’s done in C code on the web server. The Oracle driver passes the query to the database to prepare, then fetches the parameters from the web server. The Oracle preparation is more precise and less error prone since the database is doing it, but introduces latency since another request is needed to the web server to get the parameter information.

MongoDB – Humongous Drupal (Notes by Ian)

Watch the presentation

MongoDB is a key-value index database that can be used to improve performance for very, very high volume sites. The claim is that storing information on the filesystem using key-value pairs allows it to be access more quickly than storing information in tables, like RDBMS (MySQL, MSSQL, Oracle) do. Since there is no schema, data can be added easily (just append) and indexed on any key. Since it’s just files, the database can be replicated easily as well. Here’s an example of a few rows from a MongoDB file:

{"name" : "mongo" , "_id" : ObjectId("497cf60751712cf7758fbdbb")}
{"x" : 3 , "_id" : ObjectId("497cf61651712cf7758fbdbc")}
{"x" : 4 , "j" : 1 , "_id" : ObjectId("497cf87151712cf7758fbdbd")}
{"x" : 4 , "j" : 2 , "_id" : ObjectId("497cf87151712cf7758fbdbe")}

The same structure in an RDBMS might use three separate tables, one to store x, one for j, and one for name. Depending on how x and j are related, there might be a fourth table involved. This is important for Drupal because all the fields used by content types are stored in separate tables. For our job descriptions, we have a table to store the title of the description, another to store the department for the description, another to store the level for the description and so on. When the node is printed to the page, all of these tables need to be access by joining them together, which can become a resource intensive task. Under high load, this causes problems, particularly in MySQL.

The work being done on MongoDB for Drupal is still in its very early stages and probably won’t be ready for widespread use for over a year. Right now they have some of the query functions used in Drupal core implemented, but not all of them and not necessarily through the database abstraction layer, meaning that any module using MongoDB with Drupal needs to be rewritten at this time. For those few sites that have enough visitors to warrant using this, that might be an acceptable trade off, but not for us. “NoSQL” systems like MongoDB will be one of the most interesting developments in web software this decade. We’ll have to see how this develops in parallel with the traditional RDBMS systems.

Scalable infrastructure for Whitehouse.gov (Notes by Adam)

Frank Febrarro of Phase 2 technology, part of the team that developed Whitehouse.gov, gave a session titled Providing a Scalable Infrastructure for Whitehouse.gov. This session talked about all of the techniques used to ensure both that the site would be able to handle the huge visitor load, as well as be extremely secure from defacement and highly available.

Infrastructure build-out took a team of the same size and the same amount of time as the development work.
Tested by turning off servers and services to see how the system reacts.
They found that targeting read and write queries to different database servers with MySQL_Proxy worked great in development, but fell down under their heavy load testing. They ended up having to patch Drupal core (by running PressFlow) to enable retargeting from within Drupal.
They run two complete data centers, a production data-center and a disaster-recovery data-center. The disaster recovery data-center also includes development environments and complete data replication so that they can continue operations with a complete loss of the production data center.
The hosts are all RHEL and provisioned with Puppet (60+ servers). They use SELinux to lock down access to files and executables within the hosts and use AIDE to report on unauthorized file access. Puppet allows each type of server to be spawned exactly the same (ensuring that new servers are in audit compliance just like existing ones).
The Akamai content delivery network is used for three services: Site Accelerator (a reverse proxy for handling page caching), Net Storage for file serving, and Live Streaming. 90% of all traffic hits Akamai and doesn’t need to go through Drupal. Since they have under a hundred authenticated Admin and Editor users (no public users can log in) they have very low authenticated traffic and don’t need to scale authenticated access much.
PHP code and user files are all served from a NAS system mounted via NFS.
They run Memcache with the consistent hashing strategy that allows a node to fail and cache to still operate.
The database backend is MySQL Enterprise with InnoDB. The also use a RAM-based filesystem for temp-tables to improve the performance of file-sort operations.

Database Replication:: They use both Master-Master replication as well as Master-Slave replication. The second master is a hot-swap of the primary master as well as handles all of the slave replication. This means that the primary master only has to handle some reads, all writes, and replicating to a single ‘slave’.
Monitoring: MySQL enterprise monitor, Nagios for infrastructure monitoring, Cacti for graphs.
Replication Monitoring:: Constantly writing to a file with a pool of active slaves, allowing PHP to switch where reads are going. Custom scripts remove slaves when replication fails, rebuild them, then move them back into the pool once they are repaired.
Environmental sync:: Changes to servers and files automatically sync to Akami as well as the disaster recovery site.
Hardware scaling:: Goal is to quickly scale horizontally. Puppet handles the provisioning details, new web servers and database slaves can be brought online in minutes.
Data scaling:: They receive 15,000+ Webform submissions every day. These are stored outside of the main Drupal database to make site restores much quick by not requiring the rebuilding of many GBs of data.
Development Process:: They create a branch per issue, then a branch per release — merging in completed issue fixes.
Release Process:: They run a full-featured staging environment that allows testing of all aspects of changes.; For the published data like Whitehouse visitors, data files are imported using the Drush command line tools.

RDF in Drupal 7 (Notes by Adam)

Exclamations of how the “semantic web” is going to revolutionize the internet as we know it have been voiced for a decade. The session “The story of RDF in Drupal 7 and what it means for the Web at large” described how in the upcoming Drupal 7 semantic tags will be added to Drupal markup as HTML attributes (RDFa). These attributes will make it easier for machines to understand the meaning and context of words on the page. One example of this used in the description is of a bit of price text having the an RDF attribute <div class="field-item even" property="product:listPrice">9.99</div> enabling search engines to display that price in their results rather than just assuming that it is unstructured text. In my view, this change won’t be earth shattering for quite a while (another few years), but will make some things we currently do now (such as search and data re-purposing) work better without as much server-side programming. In the long run there may be bigger implications.

The second neat thing was a mention of an RDF Proxy Module that will enable Drupal to fetch and display content from remote sites via RDF searching. This is sort of like displaying RSS feeds on steroids. The big difference is that rather than being limited to an Feed that has one or more items of similar format, the RDF queries can grab individual or multiple data-elements from a remote page or many remote pages. Even better, these queries can be run against the normal human-consumable web pages rather than requiring a separate RSS or XML feed to be generated on the source site.

This session also showed the use of Sindice Inspector, a tool for viewing, navigating, and graphing the RDF data in a web page. They showed a cool example of a blog post with graphs linking the RDF data on multiple sites, but I wasn’t able to find a URL with such complex RDF data.

2 thoughts on “DrupalCon 2010 Trip Report – Day 3”

Clif April 27, 2010 at 5:28 am

Thanks for the knowledgeable post. Topics covered have been quite useful in understanding the concept. Other than Nutch there also have been other technology integrations of Solr with FAST etc which will definitely enhance the searching & indexing techniques. Recently I also reviewed Solr’s reference guide(http://www.lucidimagination.com/Downloads/LucidWorks-for-Solr/Reference-Guide) which was a worth read.

Reply ↓
Forest Mars May 18, 2010 at 9:44 pm

(MongoDB is not a key-value index database.)

Key-value caches include APC & Memcached and Key-value stores include Voldemart and SimpleDB, both of which are “eventually consistant” (though this is not a key-value store requirement.)

MogoDB however -like CouchDB- is a document store, which is a very different model. The document DB model gives you performance approaching what you might achieve with Key Value, but with object modeling and querying beyond what you can get from RDBMS.

This sophistication in object modeling can lead to a tighter coupling bw database and code, even as functions themselves become more loosely coupled— both within the limits of what CAP theorem allows.

MongoDB is one of the most exciting things happening in Drupal now, storing every field of an entity in its corresponding document as an AJAX-ready JSON (BSON) object. I have seen the future of fields– and it is Mongo.

Reply ↓

Library & Information Technology Services

Middlebury College

DrupalCon 2010 Trip Report – Day 3

Node Access in Drupal 7 (Notes by Ian)

Instant Dynamic Forms with #states (Notes by Ian)

Web Crawling and Search with Nutch and Solr (Notes by Ian)

Databases: the Next Generation (Notes by Ian)

MongoDB – Humongous Drupal (Notes by Ian)

Scalable infrastructure for Whitehouse.gov (Notes by Adam)

RDF in Drupal 7 (Notes by Adam)

2 thoughts on “DrupalCon 2010 Trip Report – Day 3”

Leave a Reply Cancel reply