Feb 17 2009

Canonical URLs get a boost

Published at 3:41 am

Almost missed this piece of awesome SEO news (thanks Twitter): Google, Yahoo, and Microsoft have teamed up on a simple new protocol for specifying canonical URLs.

What are canonical URLs?

In short, a canonical URL is the preferred, or main, URL for a given document. Most people have noticed that you can get to a given website, say www.example.com, by leaving off the www. part. www.example.com and example.com take you to the same place, so to speak. But only one of those is the canonical URL. For example, I only ever use http://burden.ca/blog/ in links to my blog. I’ve chosen that as the canonical URL, even though http://www.burden.ca/blog/ works, as does http://burden.ca/blog (no trailing slash).

Who cares?

If you’re a site owner or webmaster, you should care for two reasons. First, search engines have a heck of a time with duplicate content. If they see duplicate content all over the place, they could exclude those URLs from their search results.

Second, and related, is that your PageRank will be distributed unfavourably. Google – and Yahoo and Microsoft use similar algorithms – counts the number of incoming links to a given page to determine how important that page is, and how far up in the search results a given page appears. A webmaster’s (and particularly a news website’s webmaster’s) primary goal in life is to get his pages high up in the search results where people can see them, so this is important. If incoming links to a given page point to different URLs, the URLs will all have less PageRank and appear lower in search results. This is a Bad Thing ™.

Worse, PageRank is logarithmic, so each additional incoming link is worth more than the last. If you have 5,000 incoming links pointing at the www version and 1,000 links pointing at the non-www version, all that extra PageRank you could have had is gone. And that second 1,000 is worth more than the first 5,000.

How does the new canonical tag help?

There have until now been two main ways to deal with canonicalization issues. The first is to make sure your content management system never generates anything but canonical URLs. Anything that was not a canonical URL would return 404.

The second way is to use the web server software behind your site to issue 301 redirects to browsers telling them where the canonical URL is. Google et al can follow these redirects to the correct locations and even preserve PageRank through the redirect.

But not everyone has control of their CMS or of their web server. This new protocol, supported by the big three search engines, is a one-line snippet of code anybody can throw on top of their pages.

In this interview, Google evangelist Matt Cutts says that as much as 36% of the content stored in its index is duplicated content. Now, everyone can do their part to reduce clutter and make the web a better place.

Here’s a longer explanation of canonical URLs.

One response so far

One Response to “Canonical URLs get a boost”

  1. [...] If you don’t know why that’s bad, get as far away from control of your newspaper’s website as you can. But here’s a hint. [...]

Additional comments powered by BackType