Part 1. What statistic says
Non-visual methods to protect the site from spam suggest automatic analysis of data coming from the visitor. As more data is analyzed, the more fully and more accurately visitor can be defined and made a decision is he a spammer or not.
Systems that analyze such data usually accumulate visitor data statistics and the judgments. We offer an overview of the statistical data collected by us (service to protect sites from spam CleanTalk).
Here I purposely do not cite the data analysis of IP addresses on black lists. Without them, you can obtain enough data, analyzing only the contents of form fields and HTTP headers.
I’ll review the data by text message, nickname and email address and HTTP headers and the audit results of JavaScript test.
Analysis on these figures algorithmically very simple and not demanding to resources, so it can be used before other more resource-intensi
The data reflect the real picture at the time of writing and made on the basis of our analysis of the current traffic (more than 2 000 000 requests per day). Data can be freely used in the analysis of visitors to your sites. I note that the judgment for each criterion separately is not true — the best result will be achieved with a comprehensive analysis.
-
Message text
Message text – it is certainly the main thing in the spam. Consequently, spammers will build their posts so that on several criteria, they are clearly different from normal messages.
The following table shows the most, in my view, informative statistics.
Message text settings (average values) | Not spam | Spam |
Number of links, pcs | 1.47 | 4.27 |
Number of contacts (phone, e-mail), pcs | 1.72 | 6.38 |
Form filling time, sec | 177 | 8 |
The ratio of the length of the message to the time of filling, symbols/sec | 23.81 | 308.54 |
Amount of links speaks for itself. The amount of contact information can also be said about spam. Form filling time and, as a consequence, the rate of posts set differ most strongly.
-
The nickname of the visitor
The nickname can also tell about a lot of things. Probable cause is the quality of the algorithms of generating names that spammers use.
Parameters of nickname (average values) | Not spam | Spam |
Length, symbols | 7.40 | 16.52 |
The number of delimiters, pcs | 1.89 | 3.80 |
The number of digits, pcs | 3.29 | 7.59 |
The length of a continuous sequence of consonant letters (for Latin), symbols | 3.61 | 5.90 |
One of the tasks of the spammer is not stumble on an error that a user with the same name is already on the site. So the uniqueness of nicknames currently provided, according to statistics, in the forehead – length, insert delimiters and numbers. As a result, you get a lot of nicknames with a large number of adjacent vowels and consonants, with the latter more.
-
Name in e-mail
Everything said for nicknames true for the name in the email.
Parameters of name in e-mail (average values) | Not spam | Spam |
Length, symbols | 10.09 | 19.16 |
The number of delimiters, pcs | 1.62 | 4.12 |
The number of digits, pcs | 4.30 | 9.57 |
Note that as the delimiters characters are often used point – generated character string, then it randomly adds points, so you get a lot of e-mail names.
-
HTTP-headers
Spam-bots forge their headers to not be very different from the browser.
However, statistics show that this is often true only at the time of writing the bot. In the future, it continues to work and send clearly outdated titles that can be seen in the table below.
The percentage of HTTP headers User-Agent | Not spam | Spam |
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) | 0.01% | 11.42% |
Opera/9.80 (Windows NT 6.2; Win64; x64) Presto/2.12.388 Version/12.17 | 0.01% | 10.84% |
Ready spam solutions may also leave their headings, in particular, when using HTTP-proxy. And this is also reflected in our statistics.
The percentage of HTTP headers Via | Not spam | Spam |
Mikrotik HttpProxy | 0.86% | 33.07% |
-
JavaScript-test
Additional simple but very effective check can be JavaScript-test. For example, changing the JS-code the desired cookies, the options are many.
The most advanced (and expensive) bots pass JS-tests. However, as can be seen from the statistics, a large percentage of spam comes from very simple programs, unable to do so.
Percentage of failing JS-test | Not spam | Spam |
change cookies through JS | 0.41% | 68.53% |
- Conclusion
I have shown statistical data collected by our system at the moment. Again, for the most accurate solution to spam/not spam you need to analyze the indexes comprehensively, as well as in combination with other methods of spam checks.