Wednesday, August 22, 2007

How to Defend your Website from the Google Duplicate Proxy Exploit

There is a current and active way to knock a website out of
Google's search engine results. It's simple and effective.
This information is already in the public domain and the
more people that know about it, the more likelihood there
is that Google will do something about it. This article
will tell you how it works, how to get a website knocked
out of the search engine rankings, but most importantly,
how to defend your own website from having it happen to you.

To understand this exploit, you must first understand about
Google's Duplicate Content filter. It's simply described
thus: Google doesn't want you to search for "blue widget"
and have the top 10 search terms returned copies of the
same article on how great blue widgets are. They want to
give you ONE copy of the Great Blue Widget article, and 9
other different results, just on the off chance that you've
already read that article and the other results are
actually what you wanted.

To handle this, every time Google spiders and indexes a
page, it checks it to see if it's already got a page that
is predominantly the same, a duplicate page if you will.
Exactly how Google works this out, nobody knows exactly,
but it is going to be a combination of some or all of: page
text length, page title, headings, keyword densities,
checking exactly copy sentence fragments etc. As a result
of this duplicate content filter, a whole industry has
grown up around trying to get round the filter, just search
for "spin article".

Getting back to the story here, Google indexes a page and
lets say it fails it's duplicate content check, what does
Google do? These days, it dumps that duplicate page in
Google's Supplemental Index. What, you didn't know that
Google have 2 indexes? Well they do: the main one, and
supplemental one. 2 things are important here: Google will
always return results from their Main index if they can;
and they will only go to the Supplemental index if they
don't get enough joy from their main index. What this
means is that if your page is in the supplemental index,
it's almost certain that you will never show up in the
Search Engine Ranking Pages, unless there is next to no
competition for the phrase that was searched for.

This all seems pretty reasonable to me, so what's the
problem? Well there's another little step I haven't
mentioned yet. What happens if someone copies your page,
let's say your homepage of your business website, and when
Google indexes that copy, it correctly determines that it's
a duplicate. Now Google knows about 2 pages that it knows
are duplicates, it has to decide which to dump in the
supplemental index, and which to keep in the main one.
That's pretty obvious right? But how does Google know which
is the original and which is the copy? They don't. Sure
they have some clever algorithms to work it out, but even
if they are 99% accurate, that leaves a lot of problems for
that 1% of times they can get it wrong!

And this is the heart of the exploit, if someone copies
your websites homepage say, and manages to convince Google
that *their* page is the original, your homepage will get
tossed into the supplemental index, never to see the light
of day in the Search Engine Ranking Pages again. In case
I'm not being clear enough, that's bad! But wait, it gets
worse:

It's fair to say that in the case of a person physically
copying your page and hosting it, you can often get them to
take it down through the use of copyright lawyers, and
cease and desist letters to ISP's and the like, with a
quick "Reinclusion Request" to Google. But recently
there's a new threat that's a whole lot harder to stop: the
use of publicly accessible Proxy websites. (If you don't
know what a Proxy is, it's basically a way of making the
web run faster by caching content more local to your
internet destination. In principle they are generally a
good thing.)

There are many such web proxies out there, and I won't list
any here, however I will describe the process: they send
out spiders (much like Google's) and they spider your page,
take your content, then they host a copy of your website on
their proxy site, nominally so that when their users
request your page, they can serve up their local copy
quickly rather than having to retrieve if off your server.
The big issue is that Google can sometimes decide that the
proxy copy of your web page is the original, and yours is
not.

Worse again, there's some evidence that people are
deliberately and maliciously using proxy servers to cache
copies of web pages, then using normal (white and black
hat) Search Engine Optimization (SEO) techniques to make
those proxy pages rank in the search engine, increasing the
likelihood that your legitimate page will be the one dumped
by the search engines' duplicate content filters. Danger
Will Robinson!

Even worse still, some of the proxy spiders actively spoof
their origins so that you don't realise that it's a spider
from a proxy, as they pretend to be a Googlebot for
example, or from Yahoo. This is why the major search
engines actively publish guidelines on how to identify and
validate their own spiders.

Now for the big question, how can you defend against this?
There are several possible solutions, depending on you web
hosting technology and technical competence:

Option 1 - If you are running Apache and PHP on your
server, you can set the webhost up to check for search
engine spiders that purport to be from the main search
engines, and using php and the .htaccess file, you can
block proxies from other sources. However this only works
for proxies that are playing by the rules and identifying
themselves correctly.

Option 2 - If you are using MS Windows and IIS on your
server, or if you are on a shared hosting solution that
doesn't give you the ability to do anything clever, it's an
awful lot harder and you should take the advice of a
professional on how to defend yourself from this kind of
attack.

Option 3 - This is current the best solution available, and
applies if you are running a PHP or ASP based website: you
set ALL pages robot meta tags to noindex and nofollow, then
you implement a PHP or ASP script on each page that checks
for valid spiders from the major search engines, and if so,
resets the robot meta tags to index and follow. The
important distinction here is that it's easier to validate
a real spider, and to discount a spider that's trying to
spoof you, because the major search engines publish
processes and procedures to do this, including IP lookups
and the like.

So, stay aware, stay knowledgeable, and stay protected.
And if you see that you've suddenly been dumped from the
Search Engine Rankings Pages, now you might know why, how
and what to do about it.


----------------------------------------------------
Sophie White is an Internet Marketing and Website Promotion
Consultant at http://www.intrinsic-marketing.co.uk

an SEO
and Pay Per Click firm dedicated to supplying Better
Website ROI.

No comments: