Tips to prevent the crawling and indexing of sub-domains in search engines
Today I would like to share a blog on “Tips to prevent the crawling and indexing of sub domains in search engines“.
How many of you guys know that main domain and sub domain are different?
Indeed, in search engine point of view both the main domain (www.xxx.com) and sub domain (aaa.xxx.com) are entirely different and they treat them as separate websites. Most of us will create the sub domains mainly for testing, demo projects, client information, etc. But these things are very confidential as far as the business people or owner is concerned and always they would expect no visitors can see those things. But Google spiders and bots do not know the importance of the content present in the sub domains and its pages. It just starts to crawl once the files and folders are uploaded in the server.
That’s why webmasters using robots.txt file to block the files and folders which they consider as confidential and valuable. By including robots.txt we can instruct the search engine bots which one to be crawled and which one not. According to the search engines, robots.txt and .htaccess files are the most preferable one. Initially they will search if there is any such files in server, if yes then start to proceed with these files and performed the necessary operations according to the instructions given in those files. After that, it starts to crawl other files and folders in the server. Hence we have put robots.txt and .htaccess files in root directory of the server.
Ok, let’s come to the point… As I said earlier people use sub domains for various purposes. Suppose if the subdomains which they feel very confidential has crawled and indexed by search engines then there would be a chance of hacking by unauthorized person and even malware attack too. So, in order to protect our websites from hackers and malicious scripts in the upcoming days, we need to do the following things:
1. Disallow the entire sub domain folder which you do not want to crawl and index by SEs in main domain robots.txt file
Ex: The following code should be inserted in http://www.xxx.com/robots.txt
User-agent: *
Disallow: /aaa/ (block the entire sub folder /aaa/)
2. Even though disallow the sub domain folder in main domain robots.txt, there would be a chance of getting indexed in SEs despite they are not crawled by SEs.
3. It would be better to create another robots.txt file for the sub domain too to avoid such issues in near future. You have to include that file in the sub domain folder.
Ex: The following code should be inserted in http://aaa.xxx.com/robots.txt
User-agent: *
Disallow: /
4. Another alternative is that if you feel that the subdomain has private information which must not ever be accessed by anybody including robots, then you need to protect the folder by password.