We have received several requests about protecting website pages of search results from spambots.
At a glance the solution is quite simple — remove the search results page in “robots.txt”, example:
But further analysis showed that it won’t be a 100% solution and there are many more problems which couldn’t be fixed by just directive “Disallow” and which are being ignored even by big corporations.
Anyone who is aware of crawling budget knows that it brings problems about SEO.
One of the Google Webmaster Guidelines informs us:
Use the “robots.txt” file on your web server to manage your crawling budget by preventing crawling of infinite spaces such as search result pages.
When your website search engine creates result page and if it’s visible for indexing then search bots will waste their time to index it and they wouldn’t process needed pages, it will entail increase of indexing time and some of your good pages will be ignored. If you want to limit indexing then you should use “Disallow” directive.
No matter what we want, there are many details and situations just like in the SEO case when this advice is not optimal.
A lot of websites including big companies ignore this advice and grant access to their search result pages to the crawler bots. It really can make sense with the right approach — if search results which Google shows to your visitors correspond with their search requests and satisfy their needs then it could be useful for some types of websites.
Be careful. Your website could receive a Google penalty and get a low rank. CleanTalk doesn’t recommend to do it.
Quite possible that search result pages of your website will be not the most optimal ones which you desire to have.
Changing directive to “Disallow” alone is not enough to solve the problem of spam requests.
Spambot or a visitor searched something on your website using a spam phrase with a spam link and search result page will contain the phrase with the link even if are no pages found on your website.
The page will look like this:
Your search for “rent yacht here www.example.com” did not match any entries.
If your search result page is visible for indexing then crawler bots will know that your website gives links or mentions about such topic, therefore goal of a spammer to promote something is fulfilled and your website has necessary phrase and link (in some cases search result pages could have an active link).
To get rid of this problem you have already added “Disallow: /search” in your “robots.txt” file but this directive doesn’t fully forbid indexing and visiting these pages by crawler bots. Google tells us about that directly:
A robotted page can still be indexed if linked to from other sites
While Google won’t crawl or index the content blocked by “robots.txt”, we might still find and index a disallowed URL if it is linked from other places on the web. As a result, the URL address and, potentially, other publicly available information such as anchor text in links to the page can still appear in Google search results. To properly prevent your URL from appearing in Google Search results, you should password-protect the files on your server or use the noindex meta tag or response header (or remove the page entirely).
Thus you have to add NoIndex meta tag to your search result page template.
To prevent most search engine web crawlers from indexing a page on your site, place the following meta tag into the section of your page:
<meta name="robots" content="noindex">
To prevent only Google web crawlers from indexing a page:
<meta name="googlebot" content="noindex">
You should be aware that some search engine web crawlers might interpret the NoIndex directive differently. As a result, it is possible that your page might still appear in results from other search engines.
Why it has to be done?
In a way you can call it a vulnerability and spammers use it for their own purposes. They search something on your website with needed key words then grab the link of the search results and copy-paste it to other web resources.
When Google bots visit your pages that have such link they follow it they land on the Disallowed page. But it doesn’t mean to stop indexing, so they index pages with spam search results.
As a result users who would search for the same phrases in Google might get pages with spam. It’s dangerous because some important data could be compromised such as phone numbers, contact e-mails and so on.
Load on Your Website via Search Form
How it works: your website has a search engine and visitors can input a word or a phrase they want to get information about. Search engine generates result pages and these pages are being visited by crawler bots, Google, Bing and the like. There could be dozens or even hundreds of pages of the search results, it could create a significant load on your website as your website generates a new result page every time. Spambots can use your search engine to perform a DDoS attack and your web server has to process a lot of actions.
So, how can you avoid these problems?
- Add “Disallow” directive to the search result page.
- Add tag NoIndex to the search result page template of your website. Be careful, make sure that other pages don’t have such tag or else Google will stop indexing them.
- Set the limit of requests a one IP could have to use your search form.
All this is doable by yourself but we offer to use our anti-spam solution.
CleanTalk Anti-Spam has the option to protect your website search form from spambots.
- Spam FireWall blocks access to all website pages for the most active spambots. It lowers your web server load and traffic just by doing this.
- Anti-Spam protection for website search forms repels spambots.
- Additional option can add NoIndex tag to forbid indexing.
- If your search form gets data too often the CleanTalk plugin will add pause and increase it with each new attempt to send data. It saves your web server processor time.
- Spam protection allows you to not forbid indexing for the crawler bots if you really need it but simultaneously you will get protection from spambots.
- CleanTalk allows you to see what requests users did in the search form and what they were looking for. This will help you optimize your site and make information more accessible.
You can download CleanTalk Anti-Spam Plugin from WordPress Plugin Directory.
Note: Adding tags to search results pages will be added in one of the next releases. We will inform you.
Spam protection for search form is available for WordPress, Joomla 2.5, Drupal 8.