Category: CleanTalk

  • The development of chat-bots for Telegram and Slack with PHP

    General information

    This article describes how to create simple chat-bots of services Telegram and Slack on the example checks the IP|Email for spam using antispam service CleanTalk.

    Telegram

    The first step is the creation of your bot (in our case @CleanTalkBot) – for this purpose there is a bot Telegram @BotFather. Add it to your Telegram account and set the command /newbot. The bot will ask you to enter the name of the bot – enter the name. After that enter the user name of the bot – we have made the name of the bot and the bot user name is the same – the user name must end with bot or Bot – for example HabrArticleBot or CleanTalkBot. After entering the username the bot will be created and you will be given a token that will be used later for identification.

    The second step is to install a webhook — in other words, a request handler, coming into the chat-bot from users. When the user sets a command to your chat-bot, Telegram refers to the address that was specified as a webhook, and transmits a user message and service information, your handler generates response and sends back a Telegram, after that Telegram gives the answer to the user. This can be done using the command curl in the terminal –

    curl -d "url=https://example.com/telegramwaiter.php" https://api.telegram.org/botYOUR_TELEGRAM_TOKEN/setWebhook

    where YOUR_TELEGRAM_TOKEN – the same token that was given to you before the bot @BotFather and https://example.com/telegramwaiter.php – this is the address to which will handle requests Telegram. In response Telegram should return json string type

    {"ok":true,"result":true,"description":"Webhook is set"}

    that means the handler for your chat-bot successfully installed.

    Here it is necessary to add that the Telegram works only on the https – if you have a certificate issued by special organizations (not self-signed), then everything is fine, but if you want to use self-signed certificates – see the documentation here https://core.telegram.org/ bots / self-signed.

    The third step is to write the queries handler itself from the Telegram telegramwaiter.php — a sample script in PHP looks like this

    <?php
    
    set_time_limit(0);
    
    // Installing the token
    
    $botToken = "YOUR_TELEGRAM_TOKEN";
    
    $website = "https://api.telegram.org/bot".$botToken;
    
    // Received a request from Telegram
    
    $content = file_get_contents("php://input");
    
    $update = json_decode($content, TRUE);
    
    $message = $update["message"];
    
    // Get internal number of the chat Telegram and command entered by the user in the chat
    
    $chatId = $message["chat"]["id"];
    
    $text = $message["text"];
    
    // Example of processing the command /start
    
    if ($text == '/start') { $welcomemessage = 'Welcome!!! Check IP/Email for spam giving "check IP/Email" command';
    
    // Send the generated message back to the Telegram user
    
    file_get_contents($website."/sendmessage?chat_id=".$chatId."&text=".$welcomemessage);
    
    } ?>

    The procedure is – get in the variable $text command from the user in the chat, form according to the desired logic the message, and give back to the user using the function file_get_contents().

    How it works you can see by adding @CleanTalkBot bot in Telegram – enter the command check IP|Email and get the information about is the specified IP|Email spam.

    Example of a response

    Email st********@*****le.com is BLACKLISTED. Frequency 999. Updated Apr 24 2019. https://cleantalk.org/blacklists/st********@*****le.com.

    Slack

    The service Slack has a little different approach to creation of chat bots.

    Go here — https://api.slack.com/apps/new and create a new application Slack.

    In the app list https://api.slack.com/apps choose our app and go to the menu on the right for the link Slash Commands and click Create new command.

    In the form that appears the following fields

    Command – enter the command, beginning with / – for example /cdcheck.

    Request URL – URL commands request handler – similar webhook Telegram (eg https://cleantalk.org/slackwaiter.php).

    Short description — a brief description of what you can do with the created command.

    Save command. Note – your site must be running on the https – in this case self-signed certificates are NOT SUPPORTED by the service Slack.

    The token for identification can be found on the page a list of commands – under the list of commands is the field Verification token – then it appears as YOUR_SLACK_TOKEN.

    Write handler slackwaiter.php in PHP

    <?php
    
    set_time_limit(0);
    
    // Check input from Slack token for compliance with issued by the dashboard Slack
    
    if ($_POST['token'] == 'YOUR_SLACK_TOKEN') {
    
    // $param - this is the text that goes after command
    
    // for example if the command /ctcheck 127.0.0.1
    
    // then $param = 127.0.0.1
    
    $param = $_POST['text'];
    
    // Then according to the internal logic the answer is formed
    
    $slackresponse = ‘Here is the response to the command’;
    
    } else $slackresponse = ‘Error’;
    
    $response = array();
    
    $response['text'] = $slackresponse;
    
    header('Content-Type: application/json');
    
    echo json_encode($response);
    
    ?>

    Then go here https://api.slack.com/docs/slack-button and in the section Add the Slack button check mark incoming webhook and commands – Slack generates html-code of button by clicking on which other commands will be able to integrate your application in account Slack.

    The above button is placed on your site – by clicking on button opens next picture

    To login you need to select a channel, where you can use the application.

    By clicking on the button Authorize Slack redirects the user to a page Redirect URI (s), which is defined by you (the developer) here – https://api.slack.com/apps, select your application and go to the link App Credentials – see the following picture

    Slack not simply redirects the user to a given page, and adds a GET-variable code with the value that would later be processed by the script, for example

    https://cleantalk.org/authscript.php?code=Slack_Code

    Next, we give an example script code authscript.php. CLIENT_ID CLIENT_SECRET take from the corresponding fields in the previous image.

    <?php
    
    if (isset($_GET['code'])) { $client_id = 'CLIENT_ID';
    
    $client_secret = 'CLIENT_SECRET';
    
    $code = $_GET['code'];
    
    $response = file_get_contents("https://slack.com/api/oauth.access?client_id=".$client_id."& client_secret=".$client_secret."&code=".$code);
    
    $responsearr = json_decode($response, true);
    
    if (isset($responsearr['team_name'])){ header('Location: https://'.$responsearr['team_name'].'.slack.com');
    
    exit();
    
    } else { echo 'Error.';
    
    exit();
    
    } } else exit();
    
    ?>

    The procedure is – get from Slack GET variable code and another with two parameters – the client_id and client_secret – send a GET request to the page https://slack.com/api/oauth.access. In response, Slack will send the json-string with a lot of fields – something like this

    {‘ok’: true, ‘team_name’: ‘your_team_name’}

    then just get the name of the command and redirect the user to the main page of his command https://your_team_name.slack.com team – the application is authorized, you can use the application commands.

    The team of service Cleantalk hopes that this information will be useful for anyone interested in the development of chat-bots.

  • CleanTalk apps for Slack and Telegram chats

    We inform you that we have developed apps for Slack and Telegram, which allow you to check the blacklisted IPs/emails directly in the chat.

    To do this, you need to add the application to your chat and send IP/email command to do the checking. The application makes a request to our database and returns the result in the chat “Spam” or “Not Spam.”

    Instructions can be found here https://cleantalk.org/help/bots

    If you use Slack or Telegram chat frequently, you will be comfortable to use our application as well, so you won’t have to go to our website to check whether the IP/email is blacklisted or not.

  • Anti-Spam Filter for Subnets

    Dear users!

    We are pleased to announce the launch of an anti-spam filter for subnets.

    Now you can add to your personal black list not only the certain IP addresses, but also a separate subnet. You can add entries to your personal black list in Black&White lists section of your CleanTalk Dashboard.

    The instruction of how to add entries to your personal blacklists can be found here:https://cleantalk.org/help/sfw-blocks-networks.

  • Delegating of access rights to the CleanTalk Dashboard

    Dear Customers,

    We are pleased to announce the launch of a new option in the CleanTalk dashboard.

    This option allows you to delegate access rights to other users in CleanTalk dashboard.

    This option is useful for web studios and web masters serving the customer sites and allows you to provide access to view or give full access to manage settings for each site.

    Read access: allows the user to view all sections of the dashboard.

    Full access: allows you to change the service settings for a specific site, to make changes in the personal black lists, connect with extended options.

    For each of the websites, you can delegate different access rights to one user to assign read access, and for others to provide full access.

    Instructions for use can be found here https://cleantalk.org/help/delegation

    Please note that this option will be included in the advanced package from 15 July 2016.

  • SpamFireWall – prohibition of access to the site for spambots

    Every owner of the website or the webmaster is faced with such a scourge as spam in the comments or contact forms, registration by spambots in the guise of users. As a result, the form in the website processes these messages, which spend resources on the server. Some spam bots load the page to bypass the anti-spam protection, because of what resources are spent even more. In small amounts it is imperceptible, but when the web site per day receives thousands of such requests, this may significantly affect the CPU load of the server.

    Now we will tell you about a new option in the anti-spam plug-in for CleanTalk, which can effectively repel the attacks of spambots on your website. The option is called SpamFireWall (SFW), it blocks POST- and GET-requests from the most active spambots and does not allow them to download the server.

    How it works

    1. The user visits the website.
    2. His IP-address is checked against a database that contains records about more than two million IP-addresses that belong to the spambots.
    3. If the IP-address is contained in the database, the site displays a special page. Ordinary users will not notice anything, as the protection works in an invisible mode.
    4. All information about the process is stored in the database and available in the dashboard.

    The special page, which is displayed when suspected spam activity, not time-consuming for users who saw her by mistake. After 3 seconds, this user goes to the page automatically or sooner after clicking the link.

    This blocks all HTTP/HTTPS-traffic from spam active IP-addresses. Thus, in addition to spam attacks, from these IP-addresses will no longer able to be carried out and other types of attacks on the websites: bruteforce, DDoS, SQL injection, scanning of site by spambots, referral spam, etc.

    SpamFireWall allows users to configure their own “black lists” and allows you to add as a separate IP-address and a network.

    Currently SpamFireWall available for WordPress, Joomla, Drupal, Bitrix, SMF, MediaWiki, IPS Community Suite. In addition, you can use API-method to get a list of spam-active network https://cleantalk.org/help/api-spam-check).

    Logging requests SFW

    All the queries that triggered the SFW option, are stored in a log and then available in the control dashboard.

    In the statistics you can see the number of blocked requests as well as requests that have been blocked, but went to the site. At this point in the base SFW is 3.22 million IP-addresses. During 7 days, from 3 to 10 May, the SFW blocked 3,858,562 requests.

    About the service CleanTalk

    CleanTalk is a cloud service to protect websites from spam bots. CleanTalk uses protection methods that are invisible to the visitors of the website. This allows you to abandon the methods of protection that require the user to prove that he is a human (captcha, question-answer etc.).

  • From which CMS spam more often?

    The statistics are based on data from anti-spam service CleanTalk, for the period from April 2015 to March 2016. The analysis was conducted for the following CMS: WordPress, Joomla, 1C Bitrix, Drupal, phpBB3.0, phpBB3.1, IP.Board, SimpleMachines, MediaWiki.

    The analysis was attended by all the POST requests processed by the service, such as comments, registration, contact forms, orders, feedback, and others.

    The distribution of the main forms on the websites:

    • Comments 65.5% of sites
    • Registration 53%
    • Contacts 68.5%
    • Other 45%
    • Contact and comments 49%
    • Contact and registration 21%
    • Comments and registration 21%

    Sites with:

    • 1 form 23%
    • 2 forms 34%
    • 3 forms and more 41%

    Distribution of spam attacks per day on one site, with the division at CMS

    Top of CMS on spam attacks

    CMS The number of spam attacks
    MediaWiki 657.92
    Joomla 172.45
    1C Bitrix 129.27
    Drupal 118.14
    IP.Board 98.70
    WordPress 49.75
    SimpleMachines 41.75
    phpBB 3.1 27.51
    phpBB 3.0 25.88
    Average 146,82

    As we can see, MediaWiki is very different from other CMS. In our opinion, such a substantial gap related to the fact that this CMS has no sufficiently effective means of protection and to track changes made in article for administrators is very difficult. This leads to the fact that it’s convenient for spammers place links in the articles.

    The low proportion of spam on phpBB due to the rather low prevalence of this platform.

    The number of blocked spam attacks for the period

    Month Anti-spam SpamFireWall
    April 2015 34,956,588 0
    May 2015 39,269,843 0
    June 2015 48,258,175 0
    July 2015 51,081,673 0
    August 2015 44,131,678 0
    September 2015 50,954,715 0
    October 2015 49,895,055 9,026,116
    November 2015 46,807,047 17,129,574
    December 2015 62,355,098 11,971,351
    January 2016 54,720,390 17,540,442
    February 2016 63,326,170 14,036,018
    March 2016 67,676,972 13,710,624
    April 2016 68,038,697 13,413,217

    It should be noted that in October 2015, we launched SpamFireWall service, with a portion of spam attacks were blocked and they are not considered  for the Anti-Spam service. It is also worth noting about SpamFireWall statistics as blocked not only POST requests, but all GET requests to the site.

    As the graph shows, the amount of spam on web sites is only growing and has some seasonality. In the summer and autumn of spam growth stops or falls slightly, but with the onset of winter always starts growing.

    The percentage of spam in the POST requests by CMS

    CMS % of spam
    MediaWiki 99.76
    WordPress 98.21
    Drupal 96.08
    SimpleMachines 95.74
    IP.Board 91.72
    Joomla 91.04
    phpBB 3.1 90.35
    1C Bitrix 82.03
    phpBB 3.0 81.87
    Average 91,87

    Statistics shows that the proportion of spam in the comments/registration/contacts, etc. greater than 90%. In our opinion, the spammers find promotion links are still working effectively and if not to advance in the search, then to attract audiences to the site of its resources.

    About the service CleanTalk

    CleanTalk is a cloud service to protect websites from spam bots. CleanTalk uses protection methods that are invisible to the visitors of the website. This allows you to abandon the methods of protection that require the user to prove that he is a human (captcha, question-answer etc.).

  • The change the title of the WordPress plugin

    We changed the old title of the plugin for WordPress “Anti-Spam by CleanTalk” to the new “Spam protection by CleanTalk”. Don’t worry, we want to test how people perceive the long and short titles.

  • Non-visual methods to protect the site from spam. Part 3. Repeats

    Continuation of the article Non-visual methods to protect the site from spam

    Part 3: Repeats of substrings

    As mentioned above, non-visual methods for site protection against spam using text analysis. One of the most common spam signals – is the presence of repeated strings. As always, these examples are taken from actual company data CleanTalk.

    The search of such repeats must be minimally resource-intensive. Better if it will be called after the test from the first and second parts of the article that will be eliminated obvious spam and bring the text into a form suitable for analysis. Here I will give some statistics, as well as sample code.

    1. The sample of the code

    We use a function of determining the longest repeated substrings made by naive algorithm described here http://algolist.manual.ru/search/lrs/naive.php

    Example output is shown below.

     s  a  l  e     f  o  r     s  a  l  e     f  o  r     s  a  l  e
    0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21

    s  0   +  .  .  .  .  .  .  .  .  +  .  .  .  .  .  .  .  .  +  .  .  .
    a  1   .  +  .  .  .  .  .  .  .  .  +  .  .  .  .  .  .  .  .  +  .  .
    l  2   .  .  +  .  .  .  .  .  .  .  .  +  .  .  .  .  .  .  .  .  +  .
    e  3   .  .  .  +  .  .  .  .  .  .  .  .  +  .  .  .  .  .  .  .  .  +
    4   .  .  .  .  +  .  .  .  +  .  .  .  .  +  .  .  .  +  .  .  .  .
    f  5   .  .  .  .  .  +  .  .  .  .  .  .  .  .  +  .  .  .  .  .  .  .
    o  6   .  .  .  .  .  .  +  .  .  .  .  .  .  .  .  +  .  .  .  .  .  .
    r  7   .  .  .  .  .  .  .  +  .  .  .  .  .  .  .  .  +  .  .  .  .  .
    8   .  .  .  .  .  .  .  .  +  .  .  .  .  +  .  .  .  +  .  .  .  .
    s  9   .  .  .  .  .  .  .  .  .  +  .  .  .  .  .  .  .  .  +  .  .  .
    a 10   .  .  .  .  .  .  .  .  .  .  +  .  .  .  .  .  .  .  .  +  .  .
    l 11   .  .  .  .  .  .  .  .  .  .  .  +  .  .  .  .  .  .  .  .  +  .
    e 12   .  .  .  .  .  .  .  .  .  .  .  .  +  .  .  .  .  .  .  .  .  +
    13   .  .  .  .  .  .  .  .  .  .  .  .  .  +  .  .  .  +  .  .  .  .
    f 14   .  .  .  .  .  .  .  .  .  .  .  .  .  .  +  .  .  .  .  .  .  .
    o 15   .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  +  .  .  .  .  .  .
    r 16   .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  +  .  .  .  .  .
    17   .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  +  .  .  .  .
    s 18   .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  +  .  .  .
    a 19   .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  +  .  .
    l 20   .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  +  .
    e 21   .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  +

    $VAR1 = {
    'sale' => 3,
    'for sale' => 2
    };
    

    And here is the function in Perl with minimal changes. For convenience, here is the full text that displays the matrix above.

    #!/usr/bin/perl -w
    
    use strict;
    use utf8;
    use Data::Dumper;
    
    binmode(STDOUT, ':utf8');
    
    my $min_longest_repeat_length = 4;
    
    my $message = 'sale for sale for sale';
    my %longest_repeates = ();
    
    get_longest_repeates(\$message, \%longest_repeates);
    print Dumper(\%longest_repeates);
    
    sub get_longest_repeates {
    my $test_ref = shift;	# Link to text for analysis
    my $reps_ref = shift;	# Link to a hash of the result
    
    my @symbols = split //, $$test_ref;
    my $m_len = scalar @symbols;
    
    my @matrix = ();	# A square matrix of symbols matches
    
    # Filling the matrix to the right of the main diagonal
    for (my $i = 0; $i < $m_len; $i++) {	# Strings
    $matrix[$i] = [];
    for (my $j = $i; $j < $m_len; $j++) { # Columns only to the right of the main diagonal $matrix[$i][$j] = 1 if $symbols[$i] eq $symbols[$j]; } } # Analysis of the diagonal of the matrix to the right of the main diagonal and filling results my %repeats_tmp = (); # Hash of repeats my ($i, $j); # Search diagonal from right to left, ie from short to long repeats for ($i = $m_len - 1; $i > 0; $i--) {
    my $repeat = '';
    my $repeat_pos = undef;
    my $repeat_temp;
    
    for ($j = $i; $j < $m_len; $j++) { if (defined($matrix[$j-$i][$j]) && $matrix[$j-$i][$j] == 1) { $repeat_temp = $repeat; $repeat_temp =~ s/^ //; # If the received string of repeat is already in the hash of repeats if (defined($repeats_tmp{$repeat_temp})) { $repeat_pos = $j - length($repeat_temp); $repeats_tmp{$repeat_temp}{$repeat_pos} = 1; $repeat = $symbols[$j]; } else { $repeat .= $symbols[$j]; } } else { if ($repeat ne '') { $repeat =~ s/^ //; $repeat_pos = $j - length($repeat); if (length($repeat) >= $min_longest_repeat_length) {
    if (defined($repeats_tmp{$repeat})) {
    $repeats_tmp{$repeat}{$repeat_pos} = 1;
    } else {
    $repeats_tmp{$repeat} = {$repeat_pos => 1};
    }
    }
    $repeat = '';
    }
    }
    }
    if ($repeat ne '') {
    $repeat =~ s/^ //;
    $repeat_pos = $j - length($repeat);
    if (length($repeat) >= $min_longest_repeat_length) {
    if (defined($repeats_tmp{$repeat})) {
    $repeats_tmp{$repeat}{$repeat_pos} = 1;
    } else {
    $repeats_tmp{$repeat} = {$repeat_pos => 1};
    }
    }
    $repeat = '';
    }
    }
    
    foreach (keys %repeats_tmp){
    $$reps_ref{$_} = 1 + scalar keys %{$repeats_tmp{$_}};
    }
    
    # Output matrix for diagnostics
    print "\n";
    print ' ';
    for (my $i = 0; $i < $m_len; $i++) {
    print ' ' . $symbols[$i];
    }
    print "\n";
    print ' ';
    for (my $i = 0; $i < $m_len; $i++) {
    printf '%3d', $i;
    }
    print "\n";
    print "\n";
    for (my $i = 0; $i < $m_len; $i++) {
    print $symbols[$i];
    printf '%3d ', $i;
    for (my $j = 0; $j < $m_len; $j++) {
    my $value = '.';
    $value = '+' if (defined $matrix[$i][$j] && $matrix[$i][$j] == 1);
    printf(' %1s', $value);
    }
    print "\n";
    }
    print "\n";
    }

    2. Statistics of repeats

    We have selected the threshold of the minimum repeat length (it I do not give specifically), which gave the maximum efficiency in the tests. The results on the number of repeats as follows:

    The number of repeats In spam, % In not spam, %
    2 78,58 90,28
    3 11,93 4,86
    4 4,45 2,08
    5 2,30 1,39
    6 1,93 0
    7 0,22 0
    8 0,37 0
    9 0,07 0

    3. Conclusion

    I showed an implementation of the naive algorithm of search of repeating substring in the text. For the analysis can be used as the number of repetitions, and repetitions (e.g., stop-word). I repeat that in the fight against spam integrated tests are more effective.

    Learn more about CleanTalk Anti-Spam.

     

  • Non-visual methods to protect the site from spam. Part 2. The true face of symbols

    Continuation of the article Non-visual methods to protect the site from spam

    Part 2: The true face of symbols

    Non-visual methods to protect website from spam use, in particular, the analysis of the transmitted text. Spammers use many techniques to complicate the analysis. Here will be shown examples of one of them, namely, substitution of symbols. Examples are taken from actual company data CleanTalk.

    Symbols substitution is very simple, but as a result it can not run filters on stop-words, may worse working Bayesian filters, and filters with the definition of the language. Therefore, before using these filters it makes sense to return to the symbols their true face.

    Specify at once that replace symbols directly, for example, national symbols with the mark of the Latin ‘a’ to the very Latin ‘a’, is totally unacceptable without an analysis of the language and context. Also replace the letters, similar to zero by zero is possible only when you know exactly what to look for in the text (for example, telephone numbers).

    However, the character replacement is permitted in the case where the meaning of the written text is saved after changing. And the replacement is necessary to bring certain sets of special symbols to one.

    Here I will show you two of the most interesting ways of substitution of symbols we have encountered.

    1. Symbols replacement a normal typeface

    Spammers do everything to make text conspicuous, even at a cursory glance. Fortunately for them, Unicode provides a set of extended Latin characters typefaces. Fortunately for us, it is easily corrected.

    Below are the most common methods, as Latin characters are substituted with the same Latin, but not from the main range of the Latin alphabet.

    Replacement of Latin characters in the ordinary becomes a simple regular expression. After this change the following filters work better and faster, because input range greatly narrowed.

    1. Replacing the point

    The point is used as the symbol much wider than the punctuation mark – it is a field delimiter, and positions and the delimiter in numbers spam phone numbers, etc.

    So we are faced with the need to bring the variety of spam points into one single.

    The most common of such substitution points we encountered are shown below.

    Substitute, code

    Substitute, view

    U+3002
    U+0701 ܁
    U+0702 ܂
    U+2024
    U+FE12
    U+FE52
    U+FF61

    Replacement points can be made simple regular expression

    tr/
    \N{U+3002}\N{U+0701}\N{U+0702}\N{U+2024}\N{U+FE12}\N{U+FE52}\N{U+FF61}
    /
    \N{U+002E}\N{U+002E}\N{U+002E}\N{U+002E}\N{U+002E}\N{U+002E}\N{U+002E}
    /

    It is noticed that after replacing the points subsequent filters operate really effectively.

    1. Conclusion

    I brought two ways of substitution of symbols. Inverse replacement is simple, low system requirements and greatly increases the accuracy of the filters based on the analysis of words and expressions.

    Learn more about CleanTalk Anti-Spam.

  • Non-visual methods to protect the site from spam. Part 1. Statistics

    Part 1. What statistic says

    Non-visual methods to protect the site from spam suggest automatic analysis of data coming from the visitor. As more data is analyzed, the more fully and more accurately visitor can be defined and made a decision is he a spammer or not.

    Systems that analyze such data usually accumulate visitor data statistics and the judgments. We offer an overview of the statistical data collected by us (service to protect sites from spam CleanTalk).

    Here I purposely do not cite the data analysis of IP addresses on black lists. Without them, you can obtain enough data, analyzing only the contents of form fields and HTTP headers.

    I’ll review the data by text message, nickname and email address and HTTP headers and the audit results of JavaScript test.

    Analysis on these figures algorithmically very simple and not demanding to resources, so it can be used before other more resource-intensive inspections.

    The data reflect the real picture at the time of writing and made on the basis of our analysis of the current traffic (more than 2 000 000 requests per day). Data can be freely used in the analysis of visitors to your sites. I note that the judgment for each criterion separately is not true — the best result will be achieved with a comprehensive analysis.

    1. Message text

    Message text – it is certainly the main thing in the spam. Consequently, spammers will build their posts so that on several criteria, they are clearly different from normal messages.

    The following table shows the most, in my view, informative statistics.

    Message text settings (average values) Not spam Spam
    Number of links, pcs 1.47 4.27
    Number of contacts (phone, e-mail), pcs 1.72 6.38
    Form filling time, sec 177 8
    The ratio of the length of the message to the time of filling, symbols/sec 23.81 308.54

    Amount of links speaks for itself. The amount of contact information can also be said about spam. Form filling time and, as a consequence, the rate of posts set differ most strongly.

    1. The nickname of the visitor

    The nickname can also tell about a lot of things. Probable cause is the quality of the algorithms of generating names that spammers use.

    Parameters of nickname (average values) Not spam Spam
    Length, symbols 7.40 16.52
    The number of delimiters, pcs 1.89 3.80
    The number of digits, pcs 3.29 7.59
    The length of a continuous sequence of consonant letters (for Latin), symbols 3.61 5.90

    One of the tasks of the spammer is not stumble on an error that a user with the same name is already on the site. So the uniqueness of nicknames currently provided, according to statistics, in the forehead – length, insert delimiters and numbers. As a result, you get a lot of nicknames with a large number of adjacent vowels and consonants, with the latter more.

    1. Name in e-mail

    Everything said for nicknames true for the name in the email.

    Parameters of name in e-mail (average values) Not spam Spam
    Length, symbols 10.09 19.16
    The number of delimiters, pcs 1.62 4.12
    The number of digits, pcs 4.30 9.57

    Note that as the delimiters characters are often used point – generated character string, then it randomly adds points, so you get a lot of e-mail names.

    1. HTTP-headers

    Spam-bots forge their headers to not be very different from the browser.

    However, statistics show that this is often true only at the time of writing the bot. In the future, it continues to work and send clearly outdated titles that can be seen in the table below.

    The percentage of HTTP headers User-Agent Not spam Spam
    Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) 0.01% 11.42%
    Opera/9.80 (Windows NT 6.2; Win64; x64) Presto/2.12.388 Version/12.17 0.01% 10.84%

    Ready spam solutions may also leave their headings, in particular, when using HTTP-proxy. And this is also reflected in our statistics.

    The percentage of HTTP headers Via Not spam Spam
    Mikrotik HttpProxy 0.86% 33.07%
    1. JavaScript-test

    Additional simple but very effective check can be JavaScript-test. For example, changing the JS-code the desired cookies, the options are many.

    The most advanced (and expensive) bots pass JS-tests. However, as can be seen from the statistics, a large percentage of spam comes from very simple programs, unable to do so.

    Percentage of failing JS-test Not spam Spam
    change cookies through JS 0.41% 68.53%
    1. Conclusion

    I have shown statistical data collected by our system at the moment. Again, for the most accurate solution to spam/not spam you need to analyze the indexes comprehensively, as well as in combination with other methods of spam checks.

    Learn more about CleanTalk Anti-Spam.