How to hide your beta site from Google and not shoot yourself in the foot

When we work on a new application for a client, we give them a place such as beta.project.com, where they can follow the project's progress. This beta site should remain hidden from crawlers, so it doesn't accidentally appear in Google before the application has launched.

The standard way to dismiss crawlers is robots.txt. The problem with robots.txt is when you forget to remove it as the site launches. Since both beta and production site are often deployed from the same repository, this can happen easily during the excitement surrounding a product launch. Now you screwed up big time: The production site will remain hidden from Google until someone notices the missing traffic. Then it can take weeks to get back into Google's index.

Here is a suggestion how to fail less. Instead of adding a robots.txt to your document root, name it robots.beta.txt. In order to dismiss all crawlers, the file should look like this:

User-Agent: *
Disallow: /

We can now tell our web server to redirect requests from robots.txt to robots.beta.txt, but only those that refer to the beta site. The production site should rightfully return a 404 not found error when asked for robots.txt. To achieve this using Apache, add the following lines to the virtual host of the beta site:

RewriteEngine On
RewriteCond %{HTTP_HOST} ^beta.project.com$
RewriteRule ^/robots.txt$ /robots.beta.txt

Now you won't have to remember removing the robots.txt file when the product launches.

Henning Koch @triskweline Apr 11, 2010

How to hide your beta site from Google and not shoot yourself in the foot

Recent posts