When we work on a new application for a client, we give them a place such as beta.project.com
, where they can follow the project's progress. This beta site should remain hidden from crawlers, so it doesn't accidentally appear in Google before the application has launched.
The standard way to dismiss crawlers is robots.txt
. The problem with robots.txt
is when you forget to remove it as the site launches. Since both beta and production site are often deployed from the same repository, this can happen easily during the excitement surrounding a product launch. Now you screwed up big time: The production site will remain hidden from Google until someone notices the missing traffic. Then it can take weeks to get back into Google's index.
Here is a suggestion how to fail less. Instead of adding a robots.txt
to your document root, name it robots.beta.txt
. In order to dismiss all crawlers, the file should look like this:
User-Agent: *
Disallow: /
We can now tell our web server to redirect requests from robots.txt
to robots.beta.txt
, but only those that refer to the beta site. The production site should rightfully return a 404 not found error when asked for robots.txt
. To achieve this using Apache, add the following lines to the virtual host of the beta site:
RewriteEngine On
RewriteCond %{HTTP_HOST} ^beta.project.com$
RewriteRule ^/robots.txt$ /robots.beta.txt
Now you won't have to remember removing the robots.txt
file when the product launches.