This post was originally published on Cooper Hewitt Labs on Dec 21, 2011.
When I first started working at the Cooper-Hewitt in June of 2010 I was really interested in making our web more searchable. This is no simple task, and it is one that I am still working on today. So, I thought I might document some of the projects we have going on with regards to “search” at the Cooper-Hewitt.
The Google Search Appliance
One of the nice things about being part of the larger Smithsonian Institution is that we have some amazing resources. When I first started working here I took a day trip down to Washington DC to meet with Smithsonian’s CTO and some of the folks who run the Smithsonian data center in Herndon Virginia. They took me on a great tour of the facilities and talked about many of the unique resources I had at my fingertips. One such resource was our clustered Google Search Appliance. It’s no big surprise that Google, one of the biggest names in the search business, offers its technology to the enterprise.
Our setup provides the entire Smithsonian Institution with an enterprise class web crawler and index database. Shown in the picture below, it consists of two separate clusters of five 2U Google Search Appliances. One rack of 5 is simply a hot spare in case the other goes down.
The Google Search Appliance works in a similar way to Google itself. It constantly crawls across all of Smithsonian’s web properties, updating its index and providing a front-end for search. In fact, you can try it out yourself at http://search.si.edu where you can search across our entire network, or simply select the Smithsonian unit you are interested in. As you can see, results are displayed by relevance and follow a format similar to Google.com.
The Google Search Appliance is a great tool and we now have it integrated with our main website. If you want to try it out, go to http://cooperhewitt.org/search/gsa. This should give you results from the Google Search Appliance that span most of our web properties.
Of course the GSA comes with a few strings attached. First, it’s a web crawler, so it is indexing web properties by crawling in an automated way. This means you tell it where to start and it automatically finds pages by crawling each link from one page to another. This works fine in most cases, but you really don’t have much control over how it crawls and what it finds. It’s also a device that is shared by the entire Smithsonian Institution, so it comes with some restrictions as to how it can be utilized and customized. As well, its a completely off the shelf solution based on proprietary code. In other-words, it’s not open source, and thus can’t be hacked or altered or customized to do fun stuff! Lastly, these devices are pretty expensive. We are lucky to have them in our datacenter, but many of you reading this post will need a less costly option.
Google Custom Search
One such option might be Google’s Custom Search. This solution simply leverages Google’s existing infrastructure to create a site specific search. Of course you have little control over how and when your site gets indexed, but its free and easy to set up. I’ve used Google Custom Search for a few personal sites that needed a hosted search solution.
You can try out our custom search site here. This one searches our main site as well as a number of other web properties we have. One nice thing about the GCS is that it can be hosted on Google’s site, or embedded on your own site through a number of layout options.
Drupal and WordPress
Nearly all Content Management Systems come with some type of search feature. This type of search is simply created for you automatically when you author a new post. There is no web-crawler involved because it’s not necessary. However, this type of search comes with many caveats. First, you can only index the site managed by the CMS. So unlike the Google Search Appliance, you can’t use this to search multiple web properties on multiple domains. Also, the search baked into many popular CMS systems tend to be pretty limited in how they function on the front end. WordPress is notoriously bad at displaying search results out of the box. It simply displays them in alphabetical order–not too helpful!
Drupal is a little better with a decent advanced search page and ability to sort results by relevance. But, it is still pretty basic.
On the other hand, this type of manual indexing is very important as you can ensure that every page you author gets indexed as opposed to just hoping a web-cralwer finds your pages. Another advantage with Drupal is that the search back-end is fairly extensible and can be modified with modules. For example, our Google Search Appliance is easily integrated with our main website through this contributed module for Drupal.
While thinking about ways to improve search we realized that we really needed to do our own thing. We sort of needed both modes of indexing ( web crawling and manual indexing ). We also needed something we had full control over, and something that was open-source and enterprise class.
the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world’s largest internet sites.
And Nutch is a complimentary project to Solr that provides an enterprise class web-crawler. Nutch can help us crawl across multiple domains and web properties. It can also scale nicely using another Apache project known as Hadoop.
With Solr and Nutch ( and maybe a little Hadoop ) we should be able to build a pretty sophisticated platform for search. In fact, this has already been done at the Smithsonian in their Collections Search project, where you can search across nearly every Smithsonian unit.
In my next post I will dig in deep and show you how we get these things up and running.