Why Google Currently Rules The Web Search Roost (Big-Time)
Dec 18, 2007 at 4:32 AM Thread Starter Post #1 of 6
[size=medium]As many of you know, Head-Fi.org suffered its worst outage since its 2001 founding, on November 10, 2007, remaining down for about two weeks (coming back online early in the morning of November 25, 2007). Long story short, a storage appliance's malfunction (both hardware- and software-related) corrupted a lot of data, primary and backups. The two weeks of downtime was a result of the time it took to reconstruct the data as best we could, with some difficult decisions to make along the way. One of the toughest decisions was the primary goal--what to focus the most on in the site's restoration. In one way, this was the easiest decision: Get as much of the post/thread/forum/user registration data back as we could. Secondary (but not considered lightly) was photo hosting data, post attachment data, and the private message data, all of which, during the outage, we could not give as much consideration to, without increasing the total outage time significantly (there is some chance the photo hosting data can still be recovered, but there is much to do ahead of that, and I am in no position to make any promises about it one way or the other at this point; the post attachment data was lost; the private message data was restored with compromises).

What made what would seem like a straightforward primary goal a series of very difficult decisions was discovering that we had the data necessary to get the post/thread/forum/user data restored up to the very moment of the outage--but that restoring it to that point would require an importing process that would result in every single URL (about 4,000,000 URLs, if I'm not mistaken) to be re-written. (We did try standard restore/repair procedures, but with no luck--a big thanks anyway to the excellent vBulletin Support Team for their guidance in our repair attempts, answering my many questions, even on Thanksgiving afternoon--Steve, I hope the turkey was good that night.) In other words, almost every single forum/post/thread URL (and many user profile URLs) that had been formed by the forum software over the last 6 1/2 years would end up changed at the end of the restoration. Some of the downsides to this are quite obvious, including the fact that virtually all Head-Fi.org page URLs bookmarked prior to the outage are almost certainly now pointing to completely different topics. Also, pages from other websites that linked in to thousands upon thousands of different places within Head-Fi.org, pre-outage, also now mostly point to the wrong pages. One of the most obvious collateral-damage effects from this site-wide internal URL shifting is the effect it has had on search results on the major web search engines that point to Head-Fi.org. But, with Head-Fi.org typically receiving between 2700 and 3100 new posts per day (which is a lot for an audio-specific forum, and certainly for one that focuses on headphones and personal audio), the decision to sacrifice the old URL structure instead of trying to maintain that structure (but potentially losing weeks or months of content) was a painful but obvious choice, given the decision matrix we were faced with.

Prior to the outage, I hadn't given much consideration to SEO (search engine optimization), with respect to Head-Fi.org. Over the years, this site has grown into what is arguably the authority web resource for high-end personal audio and headphones, the clear heart of it all being our forums, and the the content we create within those forums as a community. As its position as an authority site in its space continued to grow, Head-Fi.org continued to rank higher and higher for relevant search queries at the major web search engines (especially Google), with more than half of outside referrals to Head-Fi.org coming from search engines. This URL-change situation required that I start looking at Head-Fi-related SEO closely, which included the licensing and installation of specialized forum SEO software for the purpose. If you're a Head-Fi.org veteran, you probably noticed that the URLs are now formed with words from thread titles (and/or usernames, where applicable), and with the use of more search-engine-spider-friendly static URLs. Normally, with the installation of this SEO software, the existing URLs are simply 301-redirected to the newly-formed URLs, and all is well (that is, any old, bookmarked/cached URLs would still point to the right content). In our case, however, with all the base URLs changed during the recovery import process, 301-redirecting from the original (pre-outage) URLs to the post-outage URLs wasn't practical or possible. So, over time, Head-Fi.org will simply have to have its pages/URLs completely re-indexed by the major web search engines, and external links pointing to the post-outage URLs repaired (and added to) over time. And now that I've been paying such close attention to all things SEO, I have made some interesting observations about the various web search engines, their crawling/indexing activity, and the resultant output on their respective SERPs (search engine results pages).

(Grizzled SEO veterans may find this discussion of very limited I’m-learnin’-somethin’ value, but I think most folks would be surprised at the stark differences in search engine performance, from the spidering side, indexing side, and their end products--the quality and freshness of their search results.)

One of the many helpful things the forum SEO software application we installed does is track the spidering activity of the various web search engines, providing rather detailed reporting on which search engine spiders have touched the site, and which specific sections of the site they've crawled.

For the first two days after the site was put back up (again, with most all URLs changed), the two biggest web search engines--Google (#1) and Yahoo (#2)--were close to each other in crawling activity (based on the number of URLs reached by their spiders), with Google edging Yahoo out on Head-Fi.org’s first day back by just under 200 URLs; and Yahoo edging Google out on the second day by just under 150 URLs. Microsoft’s Live--the only other web search engine that crawled enough pages to register on the summary report page--came in, over those first two days, at about 15.5% of the total URLs crawled, versus either Yahoo or Google.


On The Third Day….

But, on the third day back, things started to look dramatically different. Whereas Yahoo crawled approximately the same number of URLs as it did on days one and two, Googlebot cranked up its rate of crawling, ending up with 29 times the number of URLs crawled versus Yahoo on that day (and more than 120 times Microsoft’s tally for the day). For most of Head-Fi.org’s first 19 days back up, Google kept up the frenetic spidering pace, averaging (over those 19 days) 23 times the crawled URLs versus Yahoo, and 148 times the crawled URLs versus Microsoft! Look at the graph below to get a clear visual of the relative difference in crawling activity between the three top search engines in the first 19 days back up. (Click on the graph below to see Yahoo and Microsoft isolated, to compare the two without Google’s heavy effect on the linear y-axis.)

http://hfimage.head-fi.org/google/Yahoo-msnbot.gifhttp://hfimage.head-fi.org/google/Go...hoo-msnbot.gif


Google Fresh

No webmaster would want his site being spider-pummeled all day long for nothing, and Google’s Googlebot spider(s) crawl and pull down a lot of data from Head-Fi.org on any given day--but it’s definitely not for naught, as the Google web search results clearly show. I am in awe at how quickly Google can crawl content, do its thorough analysis of content for countless factors relative to pages both within and outside a site (determining content relevancy; identifying duplicate content, spam, unethical SEO practices; etc., etc.), and then quickly get relevant content into its web index (in our case, sometimes within minutes of the content being posted to Head-Fi.org) for relevant search queries. Currently, no other web search engine (from what I can tell) comes remotely close to achieving the search results freshness that Google serves up. Look at the screenshots below to get an example of just how quickly content from Head-Fi.org can be picked up and included in Google’s search index, ready for the world to query and find:

(NOTE: I know a couple of the queries below are obviously long-tail queries. I just copied thread titles exactly as they were, and used them as queries to pull up specific results for the sake of example.)

In this screenshot (below), the search results show a thread URL that was picked up one hour prior to the query (at 2050 EST), meaning that Google picked up that URL less than two hours after the originating post was made at 1756 EST.

http://hfimage.head-fi.org/google/1-hour_3.gif


In this screenshot (below), the search results show a thread URL that was picked up just 35 minutes prior to the query (at 2048 EST), and so picked up in just over two hours after the thread’s originating post was made.

http://hfimage.head-fi.org/google/35-minutes.gif


In this screenshot (below), the result from Head-Fi.org was put up just 19 minutes prior to the query, from a thread created just hours before.

http://hfimage.head-fi.org/google/19-minutes.gif


Remarkable.

Once again, it’s obvious that, as of right now, none of Google’s web search engine competitors comes close when it comes to delivering relevant, fresh web search results, which brings me to a little aside: <p><b><font face="Courier New" size="4">&lt;begin aside&gt;</font></b></p> Do I think it will always be this way? Not necessarily. Who do I think is most likely to close the gap fastest? I’d have to say Microsoft, even though they’re currently far behind #2 Yahoo, in terms of search market share. Microsoft, in its latest update of Live Search, made substantial strides, in terms of improving the relevance of its search results. However, based on the spidering numbers I’m seeing on Head-Fi.org (I know, n=1 is hardly a representative sample), Microsoft is still far behind #2 Yahoo (which itself is far behind #1 Google), in terms of spidering activity--and, on a site like Head-Fi.org (that is updated very frequently, due to it being a busy forum site), high spidering activity is necessary to achieving good search results freshness. But I’ve learned over the years not to count Microsoft out, even when they’re in a position of severe disadvantage. Microsoft is stacked with cash that history has shown us they’re willing to throw at whatever big problem or segment they target. Remember (in what feels to me like eons ago) when Netscape was at 95+% of the browser market, and how bad IE 2.x was? Microsoft re-tooled (and invested heavily) to focus on the Internet, and ended up dominating browser market share eventually (and still does today), with a combination of OS-bundling and serious improvements to their browser over the years. (And though I do like Firefox for all its available add-ons, I still also use IE, as it doesn't leak memory like Firefox-the-memory-hog, which is something the Mozilla team will have to fix if Firefox is ever going to be a serious player in the mobile browser market). As a PDA reviewer years ago, I remember receiving loaner units of the first generation of Microsoft Windows CE-powered handheld PCs and Palm PCs (later re-named "Pocket PCs," after being sued by Palm), and marveling at how incomplete they felt, even compared to the simpler (but much more elegant) PalmPilot's Palm OS. Fast forward to 2007, and I dumped my Palm OS PDA years ago for Pocket PCs powered by Windows Mobile (which is a more current, complete, elegant version of Windows CE, that, for my purposes, now trounces Palm); and this very blog post is being drafted on an old NEC MobilePro handheld PC, running one of the first decent versions of Windows CE. Also, Windows Mobile is now the OS of choice for many of the world's latest generation of smart mobile phones. Again, I'll not count Microsoft out of any segment (including web search) until they give me strong reason to. I think we'll see some very serious gains in their web search prowess in the coming months and years, but catching up to Google may end up being one of their biggest challenges yet. Web search is OS-independent (so there's less in this space for Microsoft to leverage with their OS dominance), and Google is hardly sitting still when it comes to advancing that front, with Google currently getting better faster than all of its web search competitors (in my opinion), only further padding its lead as time goes by. Both Microsoft and Google continue to build immense data centers worldwide, amassing vast computing power to apply to web search and other applications. For now, though, it just seems to me that Google navigates with greater ease (as a company) through the web and what it is, and what it's becoming, than Microsoft currently does. <p><b><font size="4" face="Courier New">&lt;/end aside&gt;</font></b></p>


What Head-Fi.org SEO-Related Challenges Remain, Post-Outage?

There are still some challenges that remain with respect to SEO and a post-outage Head-Fi.org. First, the non-Google web search engines--because they do not crawl at nearly the rate that Google does, and because they don’t put up what they do crawl as fast as Google does--will likely not send a lot of correctly-pointed traffic Head-Fi’s way any time very soon, due to the fact that they’re simply not picking up and displaying the revised URLs in great abundance yet. Let me make clear that I’m not asserting that they’re in any way responsible for fixing what the outage caused, but only making the comparison between how Yahoo and Microsoft spider, index and present, versus Google, post-outage.

And even though Google is far surpassing its competitors with respect to crawling, indexing, and representing the post-outage Head-Fi.org, there is still a long way to go before the revised URLs are completely crawled and indexed, and the old (pre-November 10, 2007) URLs are completely replaced with them in the Google web index. But, again, nobody comes close to the progress Google has made in this regard, and I’m noticing that, increasingly, the revised URLs are starting to replace the old URLs in Google’s index. For example, a Google search for a blog post that I made that was put back up post-outage (again, with a different URL), is now being shown in Google ahead of the old, cached, pre-outage version of it (I’m starting to see these older results dropping, too, as Google discovers that they’re unreachable and/or no longer representative, or less representative, of the query). No other search engine has picked up this change yet. In fact, Yahoo, for the same query, doesn’t even show any version, new or old, of that blog post. Part of this is probably how Yahoo and Google treat the query differently, with Google apparently giving much more weight to word proximity and order. (To get this query to pull something relevant in Yahoo up, you have to query it as a phrase--framing the query in quotation marks.) Microsoft’s Live, for the same query, shows a cached version of the post, which now points to the wrong place, after the post-outage URL re-writes (but at least it’s showing a more relevant result than Yahoo).

Long story short, this site and its unique situation with respect to its pre- and post-outage states, is one clear illustration of why Google still rules the game of searching the World Wide Web, and likely will for some time to come. I’m thankful to see Google making such strong progress on the re-indexing and representation of post-outage Head-Fi.org in its web search index. Of course, I wish that I could say the same for Yahoo and Microsoft's Live Search, but, based on what I've seen so far, it'll be a much longer time coming on those search engines.

And to all of you, thanks for coming back to Head-Fi.org after the outage. I missed y’all, and I’m glad as heck we're back.
_________________


Some sites and articles to read, relevant to this post:
  1. [size=medium]Google’s Matt Cutts’ blog post on Google’s “Minty Fresh Indexing.”[/size]
  2. [size=medium]An interesting article on the power of “cloud computing,” and the various players (and their datacenters), including the aforementioned companies.[/size]
  3. [size=medium]SearchEngineLand.com is currently my favorite site for web search engine industry news (with great feeds).[/size]
  4. [size=medium]SEOmoz.org is a great resource for information on SEO and SEM (search engine marketing), and where I am a Premium Member (and it’s worth every penny so far).[/size]
  5. [size=medium]WebMasterWorld is another great resource for webmaster information of all stripes, including, of course, SEO and SEM. I am also an annual subscriber to their premium content.[/size]
[/size]
 
Dec 21, 2007 at 6:32 PM Post #2 of 6
Very interesting read!
 
Dec 26, 2007 at 12:49 PM Post #3 of 6
x2. Didn't really know how all this worked until now.
 
Mar 1, 2008 at 10:40 PM Post #6 of 6
I have to say that head-fi's frequent appearance in Google (along with me joining the forums) has prompted me to install SEO on the forum I run, MacTalk Australia.

I sympathise entirely with the downtime issues, as we've had our own over the past few years. I've learned never to trust any back-ups and back-up to multiple locations for safety as a result.

Thanks for all the efforts you've made. I'm enjoying being a member immensely.
 

Users who are viewing this thread

Back
Top