Author: Alexander

  • The development of chat-bots for Telegram and Slack with PHP

    General information

    This article describes how to create simple chat-bots of services Telegram and Slack on the example checks the IP|Email for spam using antispam service CleanTalk.

    Telegram

    The first step is the creation of your bot (in our case @CleanTalkBot) – for this purpose there is a bot Telegram @BotFather. Add it to your Telegram account and set the command /newbot. The bot will ask you to enter the name of the bot – enter the name. After that enter the user name of the bot – we have made the name of the bot and the bot user name is the same – the user name must end with bot or Bot – for example HabrArticleBot or CleanTalkBot. After entering the username the bot will be created and you will be given a token that will be used later for identification.

    The second step is to install a webhook — in other words, a request handler, coming into the chat-bot from users. When the user sets a command to your chat-bot, Telegram refers to the address that was specified as a webhook, and transmits a user message and service information, your handler generates response and sends back a Telegram, after that Telegram gives the answer to the user. This can be done using the command curl in the terminal –

    curl -d "url=https://example.com/telegramwaiter.php" https://api.telegram.org/botYOUR_TELEGRAM_TOKEN/setWebhook

    where YOUR_TELEGRAM_TOKEN – the same token that was given to you before the bot @BotFather and https://example.com/telegramwaiter.php – this is the address to which will handle requests Telegram. In response Telegram should return json string type

    {"ok":true,"result":true,"description":"Webhook is set"}

    that means the handler for your chat-bot successfully installed.

    Here it is necessary to add that the Telegram works only on the https – if you have a certificate issued by special organizations (not self-signed), then everything is fine, but if you want to use self-signed certificates – see the documentation here https://core.telegram.org/ bots / self-signed.

    The third step is to write the queries handler itself from the Telegram telegramwaiter.php — a sample script in PHP looks like this

    <?php
    
    set_time_limit(0);
    
    // Installing the token
    
    $botToken = "YOUR_TELEGRAM_TOKEN";
    
    $website = "https://api.telegram.org/bot".$botToken;
    
    // Received a request from Telegram
    
    $content = file_get_contents("php://input");
    
    $update = json_decode($content, TRUE);
    
    $message = $update["message"];
    
    // Get internal number of the chat Telegram and command entered by the user in the chat
    
    $chatId = $message["chat"]["id"];
    
    $text = $message["text"];
    
    // Example of processing the command /start
    
    if ($text == '/start') { $welcomemessage = 'Welcome!!! Check IP/Email for spam giving "check IP/Email" command';
    
    // Send the generated message back to the Telegram user
    
    file_get_contents($website."/sendmessage?chat_id=".$chatId."&text=".$welcomemessage);
    
    } ?>

    The procedure is – get in the variable $text command from the user in the chat, form according to the desired logic the message, and give back to the user using the function file_get_contents().

    How it works you can see by adding @CleanTalkBot bot in Telegram – enter the command check IP|Email and get the information about is the specified IP|Email spam.

    Example of a response

    Email st********@ex*****.com is BLACKLISTED. Frequency 999. Updated Apr 24 2019. https://cleantalk.org/blacklists/st********@ex*****.com.

    Slack

    The service Slack has a little different approach to creation of chat bots.

    Go here — https://api.slack.com/apps/new and create a new application Slack.

    In the app list https://api.slack.com/apps choose our app and go to the menu on the right for the link Slash Commands and click Create new command.

    In the form that appears the following fields

    Command – enter the command, beginning with / – for example /cdcheck.

    Request URL – URL commands request handler – similar webhook Telegram (eg https://cleantalk.org/slackwaiter.php).

    Short description — a brief description of what you can do with the created command.

    Save command. Note – your site must be running on the https – in this case self-signed certificates are NOT SUPPORTED by the service Slack.

    The token for identification can be found on the page a list of commands – under the list of commands is the field Verification token – then it appears as YOUR_SLACK_TOKEN.

    Write handler slackwaiter.php in PHP

    <?php
    
    set_time_limit(0);
    
    // Check input from Slack token for compliance with issued by the dashboard Slack
    
    if ($_POST['token'] == 'YOUR_SLACK_TOKEN') {
    
    // $param - this is the text that goes after command
    
    // for example if the command /ctcheck 127.0.0.1
    
    // then $param = 127.0.0.1
    
    $param = $_POST['text'];
    
    // Then according to the internal logic the answer is formed
    
    $slackresponse = ‘Here is the response to the command’;
    
    } else $slackresponse = ‘Error’;
    
    $response = array();
    
    $response['text'] = $slackresponse;
    
    header('Content-Type: application/json');
    
    echo json_encode($response);
    
    ?>

    Then go here https://api.slack.com/docs/slack-button and in the section Add the Slack button check mark incoming webhook and commands – Slack generates html-code of button by clicking on which other commands will be able to integrate your application in account Slack.

    The above button is placed on your site – by clicking on button opens next picture

    To login you need to select a channel, where you can use the application.

    By clicking on the button Authorize Slack redirects the user to a page Redirect URI (s), which is defined by you (the developer) here – https://api.slack.com/apps, select your application and go to the link App Credentials – see the following picture

    Slack not simply redirects the user to a given page, and adds a GET-variable code with the value that would later be processed by the script, for example

    https://cleantalk.org/authscript.php?code=Slack_Code

    Next, we give an example script code authscript.php. CLIENT_ID CLIENT_SECRET take from the corresponding fields in the previous image.

    <?php
    
    if (isset($_GET['code'])) { $client_id = 'CLIENT_ID';
    
    $client_secret = 'CLIENT_SECRET';
    
    $code = $_GET['code'];
    
    $response = file_get_contents("https://slack.com/api/oauth.access?client_id=".$client_id."& client_secret=".$client_secret."&code=".$code);
    
    $responsearr = json_decode($response, true);
    
    if (isset($responsearr['team_name'])){ header('Location: https://'.$responsearr['team_name'].'.slack.com');
    
    exit();
    
    } else { echo 'Error.';
    
    exit();
    
    } } else exit();
    
    ?>

    The procedure is – get from Slack GET variable code and another with two parameters – the client_id and client_secret – send a GET request to the page https://slack.com/api/oauth.access. In response, Slack will send the json-string with a lot of fields – something like this

    {‘ok’: true, ‘team_name’: ‘your_team_name’}

    then just get the name of the command and redirect the user to the main page of his command https://your_team_name.slack.com team – the application is authorized, you can use the application commands.

    The team of service Cleantalk hopes that this information will be useful for anyone interested in the development of chat-bots.

  • How we increased conversion: the history of the cloud service CleanTalk Anti-Spam

    How we increased conversion: the history of the cloud service CleanTalk Anti-Spam

    Representatives of the cloud-based antispam service CleanTalk Anti-Spam told vc.ru about how they have achieved conversion improvements of the service site, using fairly simple solutions and assessing their effectiveness.

    CleanTalk Anti-Spam is a cloud service protection from spam bots for websites. Since the introduction of the service the number of clients has grown steadily, but over time it became clear that the effectiveness of website conversion can be increased.

    For ease of understanding and analysis, we divided the conversion to “free” (registration on the website) and “paid” (the beginning of the use of paid services of service). Of course, they are interrelated: increasing the number of users at the registration stage, we get the growth of paid users.

    So, we decided that the increase in the number of users of the service in our power, but it needs to look for ways to optimize existing channels of attraction. As a result, spending relatively little time and resources to search for and testing ways, we have achieved very impressive results surpassed our wildest expectations: the conversion registration has increased by about 98%, and registered in pay ― by 49.4%.

    Our users

    The users of CleanTalk Anti-Spam are the companies that have a corporate website, private owners of personal sites, webmasters and studios, as well as owners of Internet-shops. In fact, the need for protection from spam bots may arise from any website that has any form to fill.

    User expectations regarding the anti-spam service is quite natural and predictable: turning once, they want to get everything, to no longer pay for the service and pre-existing attention problems and not to waste their time. They want to be fully confident in the effectiveness and fully trust decision. In our case, this means that they get protection from spam, which are invisible to the visitors and do not make them to do extra work to prove that they are human.

    “Free” conversion

    The first thing we noticed as a potential source of growth in registrations, changes of site design. However, as it turned out in our case, the design has little effect on conversion. We have developed several variants of external execution of the main page and tested them, but any tangible result is not given.

    The next thing we decided to try – to put the registration button directly in the part of the site, which is the main driver of traffic. The solution is quite obvious: the interested visitors can see the offer in the place of the site for which they came. In our case it blacklists the IP and email addresses from which to send spam. As a result simple call to the registration in the right place for two months of testing gave a 6.6% increase in registrations. Of course, after this we left button on a new place.

    The next step we decided to place a pop-up window with a banner calling for registration. This banner is displayed only after the user makes the second search on site: thus see it only targeted visitors likely to be in need of protection from spambots.

    Despite popular and quite well-founded belief that a pop-up banner ads annoy visitors, we received an increase in registrations of search blacklisted in 38.9%. However, we have not received any complaints and have not seen a reduction of repeat visits to the site.

    After done, we can give advice: look at what additional information you can use to attract users. Describe in one sentence what problem you will solve, and make offer to users with a minimum required to achieve the goal steps.

    Further research of the own site led to the idea that it is possible to simplify the registration procedure and form for it. We have removed from form all of the extra fields, leaving only the most necessary. So, now users don’t need to come up with passwords, they are generated automatically and come immediately after registration to the specified email account together with other necessary information. As a result, the form has only two fields ― email and website address.

    In case of refusal from the manual password entry in favor of its automatic generation, we have been guided by the consideration that a new user arriving at unfamiliar service, is unlikely to want to use own regular password, but rather will be to come up with a new one. Of course, this complicates his task, the user may leave the website. Simplifying it, we have achieved growth of registrations of about 12%.

    Good results were achieved by embedding the registration form in the settings page of plugins for CMS WordPress and Joomla. Thanks to the simplicity and speed of registration of the conversion growth amounted to 19.5%.

    “Paid” conversion

    Because of the nature of the SaaS-model, most cloud services, including CleanTalk Anti-Spam, give users a free trial period. However, the duration of the period is different for each service, so find the best option we have decided experimentally. Moreover, the duration of the trial period and the conversion also depends on the source registrations.

    We have supposed that, depending on the source of registration, our users need different on the length of the trial period. From some sources, users come prepared and “ripened”, while the other need more time to explore the capabilities of the service. For CleanTalk registrations sources can be divided into three categories:

    • Search traffic from the blacklist;
    • Traffic from the plugins directory;
    • Transitions from websites and social networks.

    We have divided the sources and each of them changed the duration of the trial period. It should say that the system does not display the duration of the trial period in days, instead, users see the date up to which they can use the functionality CleanTalk without payment.

    For search traffic from the black list the best conversion was at the 7-day free period. Compared with a trial period in a 14-day period, which we applied to all users initially, the conversion increased by 86%!

    For registration of the plugins directory best indicator was 14 days. Selective survey of users showed that this group of users needs more time to assess the functionality of the service and its applicability to their situations. Since we initially used the 14-day free period, other options its duration caused a decrease in  the conversion.  

    But for users who came from websites and social networks, the trial period is not needed at all. We assume that this is due to the fact that these people move from the sites, where they are already familiar with the service, read the description, feedback or review. That is, they already have the recommendation of a trusted and reputable source for them, so there is no need to test. Thus, compared with the 14-day period, the growth of paid users amounted to 115%.

    Not less attentively should be approached to the choice of payment systems and the order of their placement on the payment page. Be sure to choose a proven and well-known payment system, not to be in favor of those in which the lower commission. Also, do not introduce too many ways of payment, as it disperses the attention of users and causes them to doubt the initial choice.

    Some time ago, we experienced some technical problems with receiving payment via PayPal. As a result, we have moved the system down, which was a mistake. During the test, we left only two methods of payment, and PayPal was again raised up. As a result, the number of payments increased by 23%.

    The placement of the security certificates on the payment page and the EV Green Bar in the address bar on the website should give to increase the confidence of users and, consequently, increase conversion. However, in our case both of these actions have not gave any result. This is certainly connected with the high level of technical competence of the audience of the service, which not needs to be convinced of the safety of popular payment systems.

    We continue to use EV Green Bar, and the icon for the security certificate was replaced by the icon back guarantee payment. However, it did not have any effect and only serves to inform users about this possibility.

    We were very pleased with the results. However, even those actions which had no desired effect in our case, in other situations can bring significant benefit to the business. For each service it is very individual and depends on many parameters. We wanted to draw your attention on the things that helped us in the development. Perhaps our experience will be useful to you.

  • CleanTalk apps for Slack and Telegram chats

    We inform you that we have developed apps for Slack and Telegram, which allow you to check the blacklisted IPs/emails directly in the chat.

    To do this, you need to add the application to your chat and send IP/email command to do the checking. The application makes a request to our database and returns the result in the chat “Spam” or “Not Spam.”

    Instructions can be found here https://cleantalk.org/help/bots

    If you use Slack or Telegram chat frequently, you will be comfortable to use our application as well, so you won’t have to go to our website to check whether the IP/email is blacklisted or not.

  • Anti-Spam Filter for Subnets

    Dear users!

    We are pleased to announce the launch of an anti-spam filter for subnets.

    Now you can add to your personal black list not only the certain IP addresses, but also a separate subnet. You can add entries to your personal black list in Black&White lists section of your CleanTalk Dashboard.

    The instruction of how to add entries to your personal blacklists can be found here:https://cleantalk.org/help/sfw-blocks-networks.

  • Delegating of access rights to the CleanTalk Dashboard

    Dear Customers,

    We are pleased to announce the launch of a new option in the CleanTalk dashboard.

    This option allows you to delegate access rights to other users in CleanTalk dashboard.

    This option is useful for web studios and web masters serving the customer sites and allows you to provide access to view or give full access to manage settings for each site.

    Read access: allows the user to view all sections of the dashboard.

    Full access: allows you to change the service settings for a specific site, to make changes in the personal black lists, connect with extended options.

    For each of the websites, you can delegate different access rights to one user to assign read access, and for others to provide full access.

    Instructions for use can be found here https://cleantalk.org/help/delegation

    Please note that this option will be included in the advanced package from 15 July 2016.

  • SpamFireWall – prohibition of access to the site for spambots

    Every owner of the website or the webmaster is faced with such a scourge as spam in the comments or contact forms, registration by spambots in the guise of users. As a result, the form in the website processes these messages, which spend resources on the server. Some spam bots load the page to bypass the anti-spam protection, because of what resources are spent even more. In small amounts it is imperceptible, but when the web site per day receives thousands of such requests, this may significantly affect the CPU load of the server.

    Now we will tell you about a new option in the anti-spam plug-in for CleanTalk, which can effectively repel the attacks of spambots on your website. The option is called SpamFireWall (SFW), it blocks POST- and GET-requests from the most active spambots and does not allow them to download the server.

    How it works

    1. The user visits the website.
    2. His IP-address is checked against a database that contains records about more than two million IP-addresses that belong to the spambots.
    3. If the IP-address is contained in the database, the site displays a special page. Ordinary users will not notice anything, as the protection works in an invisible mode.
    4. All information about the process is stored in the database and available in the dashboard.

    The special page, which is displayed when suspected spam activity, not time-consuming for users who saw her by mistake. After 3 seconds, this user goes to the page automatically or sooner after clicking the link.

    This blocks all HTTP/HTTPS-traffic from spam active IP-addresses. Thus, in addition to spam attacks, from these IP-addresses will no longer able to be carried out and other types of attacks on the websites: bruteforce, DDoS, SQL injection, scanning of site by spambots, referral spam, etc.

    SpamFireWall allows users to configure their own “black lists” and allows you to add as a separate IP-address and a network.

    Currently SpamFireWall available for WordPress, Joomla, Drupal, Bitrix, SMF, MediaWiki, IPS Community Suite. In addition, you can use API-method to get a list of spam-active network https://cleantalk.org/help/api-spam-check).

    Logging requests SFW

    All the queries that triggered the SFW option, are stored in a log and then available in the control dashboard.

    In the statistics you can see the number of blocked requests as well as requests that have been blocked, but went to the site. At this point in the base SFW is 3.22 million IP-addresses. During 7 days, from 3 to 10 May, the SFW blocked 3,858,562 requests.

    About the service CleanTalk

    CleanTalk is a cloud service to protect websites from spam bots. CleanTalk uses protection methods that are invisible to the visitors of the website. This allows you to abandon the methods of protection that require the user to prove that he is a human (captcha, question-answer etc.).

  • From which CMS spam more often?

    The statistics are based on data from anti-spam service CleanTalk, for the period from April 2015 to March 2016. The analysis was conducted for the following CMS: WordPress, Joomla, 1C Bitrix, Drupal, phpBB3.0, phpBB3.1, IP.Board, SimpleMachines, MediaWiki.

    The analysis was attended by all the POST requests processed by the service, such as comments, registration, contact forms, orders, feedback, and others.

    The distribution of the main forms on the websites:

    • Comments 65.5% of sites
    • Registration 53%
    • Contacts 68.5%
    • Other 45%
    • Contact and comments 49%
    • Contact and registration 21%
    • Comments and registration 21%

    Sites with:

    • 1 form 23%
    • 2 forms 34%
    • 3 forms and more 41%

    Distribution of spam attacks per day on one site, with the division at CMS

    Top of CMS on spam attacks

    CMS The number of spam attacks
    MediaWiki 657.92
    Joomla 172.45
    1C Bitrix 129.27
    Drupal 118.14
    IP.Board 98.70
    WordPress 49.75
    SimpleMachines 41.75
    phpBB 3.1 27.51
    phpBB 3.0 25.88
    Average 146,82

    As we can see, MediaWiki is very different from other CMS. In our opinion, such a substantial gap related to the fact that this CMS has no sufficiently effective means of protection and to track changes made in article for administrators is very difficult. This leads to the fact that it’s convenient for spammers place links in the articles.

    The low proportion of spam on phpBB due to the rather low prevalence of this platform.

    The number of blocked spam attacks for the period

    Month Anti-spam SpamFireWall
    April 2015 34,956,588 0
    May 2015 39,269,843 0
    June 2015 48,258,175 0
    July 2015 51,081,673 0
    August 2015 44,131,678 0
    September 2015 50,954,715 0
    October 2015 49,895,055 9,026,116
    November 2015 46,807,047 17,129,574
    December 2015 62,355,098 11,971,351
    January 2016 54,720,390 17,540,442
    February 2016 63,326,170 14,036,018
    March 2016 67,676,972 13,710,624
    April 2016 68,038,697 13,413,217

    It should be noted that in October 2015, we launched SpamFireWall service, with a portion of spam attacks were blocked and they are not considered  for the Anti-Spam service. It is also worth noting about SpamFireWall statistics as blocked not only POST requests, but all GET requests to the site.

    As the graph shows, the amount of spam on web sites is only growing and has some seasonality. In the summer and autumn of spam growth stops or falls slightly, but with the onset of winter always starts growing.

    The percentage of spam in the POST requests by CMS

    CMS % of spam
    MediaWiki 99.76
    WordPress 98.21
    Drupal 96.08
    SimpleMachines 95.74
    IP.Board 91.72
    Joomla 91.04
    phpBB 3.1 90.35
    1C Bitrix 82.03
    phpBB 3.0 81.87
    Average 91,87

    Statistics shows that the proportion of spam in the comments/registration/contacts, etc. greater than 90%. In our opinion, the spammers find promotion links are still working effectively and if not to advance in the search, then to attract audiences to the site of its resources.

    About the service CleanTalk

    CleanTalk is a cloud service to protect websites from spam bots. CleanTalk uses protection methods that are invisible to the visitors of the website. This allows you to abandon the methods of protection that require the user to prove that he is a human (captcha, question-answer etc.).

  • The change the title of the WordPress plugin

    We changed the old title of the plugin for WordPress “Anti-Spam by CleanTalk” to the new “Spam protection by CleanTalk”. Don’t worry, we want to test how people perceive the long and short titles.

  • Non-visual methods to protect the site from spam. Part 3. Repeats

    Continuation of the article Non-visual methods to protect the site from spam

    Part 3: Repeats of substrings

    As mentioned above, non-visual methods for site protection against spam using text analysis. One of the most common spam signals – is the presence of repeated strings. As always, these examples are taken from actual company data CleanTalk.

    The search of such repeats must be minimally resource-intensive. Better if it will be called after the test from the first and second parts of the article that will be eliminated obvious spam and bring the text into a form suitable for analysis. Here I will give some statistics, as well as sample code.

    1. The sample of the code

    We use a function of determining the longest repeated substrings made by naive algorithm described here http://algolist.manual.ru/search/lrs/naive.php

    Example output is shown below.

     s  a  l  e     f  o  r     s  a  l  e     f  o  r     s  a  l  e
    0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21

    s  0   +  .  .  .  .  .  .  .  .  +  .  .  .  .  .  .  .  .  +  .  .  .
    a  1   .  +  .  .  .  .  .  .  .  .  +  .  .  .  .  .  .  .  .  +  .  .
    l  2   .  .  +  .  .  .  .  .  .  .  .  +  .  .  .  .  .  .  .  .  +  .
    e  3   .  .  .  +  .  .  .  .  .  .  .  .  +  .  .  .  .  .  .  .  .  +
    4   .  .  .  .  +  .  .  .  +  .  .  .  .  +  .  .  .  +  .  .  .  .
    f  5   .  .  .  .  .  +  .  .  .  .  .  .  .  .  +  .  .  .  .  .  .  .
    o  6   .  .  .  .  .  .  +  .  .  .  .  .  .  .  .  +  .  .  .  .  .  .
    r  7   .  .  .  .  .  .  .  +  .  .  .  .  .  .  .  .  +  .  .  .  .  .
    8   .  .  .  .  .  .  .  .  +  .  .  .  .  +  .  .  .  +  .  .  .  .
    s  9   .  .  .  .  .  .  .  .  .  +  .  .  .  .  .  .  .  .  +  .  .  .
    a 10   .  .  .  .  .  .  .  .  .  .  +  .  .  .  .  .  .  .  .  +  .  .
    l 11   .  .  .  .  .  .  .  .  .  .  .  +  .  .  .  .  .  .  .  .  +  .
    e 12   .  .  .  .  .  .  .  .  .  .  .  .  +  .  .  .  .  .  .  .  .  +
    13   .  .  .  .  .  .  .  .  .  .  .  .  .  +  .  .  .  +  .  .  .  .
    f 14   .  .  .  .  .  .  .  .  .  .  .  .  .  .  +  .  .  .  .  .  .  .
    o 15   .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  +  .  .  .  .  .  .
    r 16   .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  +  .  .  .  .  .
    17   .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  +  .  .  .  .
    s 18   .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  +  .  .  .
    a 19   .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  +  .  .
    l 20   .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  +  .
    e 21   .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  +

    $VAR1 = {
    'sale' => 3,
    'for sale' => 2
    };
    

    And here is the function in Perl with minimal changes. For convenience, here is the full text that displays the matrix above.

    #!/usr/bin/perl -w
    
    use strict;
    use utf8;
    use Data::Dumper;
    
    binmode(STDOUT, ':utf8');
    
    my $min_longest_repeat_length = 4;
    
    my $message = 'sale for sale for sale';
    my %longest_repeates = ();
    
    get_longest_repeates(\$message, \%longest_repeates);
    print Dumper(\%longest_repeates);
    
    sub get_longest_repeates {
    my $test_ref = shift;	# Link to text for analysis
    my $reps_ref = shift;	# Link to a hash of the result
    
    my @symbols = split //, $$test_ref;
    my $m_len = scalar @symbols;
    
    my @matrix = ();	# A square matrix of symbols matches
    
    # Filling the matrix to the right of the main diagonal
    for (my $i = 0; $i < $m_len; $i++) {	# Strings
    $matrix[$i] = [];
    for (my $j = $i; $j < $m_len; $j++) { # Columns only to the right of the main diagonal $matrix[$i][$j] = 1 if $symbols[$i] eq $symbols[$j]; } } # Analysis of the diagonal of the matrix to the right of the main diagonal and filling results my %repeats_tmp = (); # Hash of repeats my ($i, $j); # Search diagonal from right to left, ie from short to long repeats for ($i = $m_len - 1; $i > 0; $i--) {
    my $repeat = '';
    my $repeat_pos = undef;
    my $repeat_temp;
    
    for ($j = $i; $j < $m_len; $j++) { if (defined($matrix[$j-$i][$j]) && $matrix[$j-$i][$j] == 1) { $repeat_temp = $repeat; $repeat_temp =~ s/^ //; # If the received string of repeat is already in the hash of repeats if (defined($repeats_tmp{$repeat_temp})) { $repeat_pos = $j - length($repeat_temp); $repeats_tmp{$repeat_temp}{$repeat_pos} = 1; $repeat = $symbols[$j]; } else { $repeat .= $symbols[$j]; } } else { if ($repeat ne '') { $repeat =~ s/^ //; $repeat_pos = $j - length($repeat); if (length($repeat) >= $min_longest_repeat_length) {
    if (defined($repeats_tmp{$repeat})) {
    $repeats_tmp{$repeat}{$repeat_pos} = 1;
    } else {
    $repeats_tmp{$repeat} = {$repeat_pos => 1};
    }
    }
    $repeat = '';
    }
    }
    }
    if ($repeat ne '') {
    $repeat =~ s/^ //;
    $repeat_pos = $j - length($repeat);
    if (length($repeat) >= $min_longest_repeat_length) {
    if (defined($repeats_tmp{$repeat})) {
    $repeats_tmp{$repeat}{$repeat_pos} = 1;
    } else {
    $repeats_tmp{$repeat} = {$repeat_pos => 1};
    }
    }
    $repeat = '';
    }
    }
    
    foreach (keys %repeats_tmp){
    $$reps_ref{$_} = 1 + scalar keys %{$repeats_tmp{$_}};
    }
    
    # Output matrix for diagnostics
    print "\n";
    print ' ';
    for (my $i = 0; $i < $m_len; $i++) {
    print ' ' . $symbols[$i];
    }
    print "\n";
    print ' ';
    for (my $i = 0; $i < $m_len; $i++) {
    printf '%3d', $i;
    }
    print "\n";
    print "\n";
    for (my $i = 0; $i < $m_len; $i++) {
    print $symbols[$i];
    printf '%3d ', $i;
    for (my $j = 0; $j < $m_len; $j++) {
    my $value = '.';
    $value = '+' if (defined $matrix[$i][$j] && $matrix[$i][$j] == 1);
    printf(' %1s', $value);
    }
    print "\n";
    }
    print "\n";
    }

    2. Statistics of repeats

    We have selected the threshold of the minimum repeat length (it I do not give specifically), which gave the maximum efficiency in the tests. The results on the number of repeats as follows:

    The number of repeats In spam, % In not spam, %
    2 78,58 90,28
    3 11,93 4,86
    4 4,45 2,08
    5 2,30 1,39
    6 1,93 0
    7 0,22 0
    8 0,37 0
    9 0,07 0

    3. Conclusion

    I showed an implementation of the naive algorithm of search of repeating substring in the text. For the analysis can be used as the number of repetitions, and repetitions (e.g., stop-word). I repeat that in the fight against spam integrated tests are more effective.

    Learn more about CleanTalk Anti-Spam.

     

  • Non-visual methods to protect the site from spam. Part 2. The true face of symbols

    Continuation of the article Non-visual methods to protect the site from spam

    Part 2: The true face of symbols

    Non-visual methods to protect website from spam use, in particular, the analysis of the transmitted text. Spammers use many techniques to complicate the analysis. Here will be shown examples of one of them, namely, substitution of symbols. Examples are taken from actual company data CleanTalk.

    Symbols substitution is very simple, but as a result it can not run filters on stop-words, may worse working Bayesian filters, and filters with the definition of the language. Therefore, before using these filters it makes sense to return to the symbols their true face.

    Specify at once that replace symbols directly, for example, national symbols with the mark of the Latin ‘a’ to the very Latin ‘a’, is totally unacceptable without an analysis of the language and context. Also replace the letters, similar to zero by zero is possible only when you know exactly what to look for in the text (for example, telephone numbers).

    However, the character replacement is permitted in the case where the meaning of the written text is saved after changing. And the replacement is necessary to bring certain sets of special symbols to one.

    Here I will show you two of the most interesting ways of substitution of symbols we have encountered.

    1. Symbols replacement a normal typeface

    Spammers do everything to make text conspicuous, even at a cursory glance. Fortunately for them, Unicode provides a set of extended Latin characters typefaces. Fortunately for us, it is easily corrected.

    Below are the most common methods, as Latin characters are substituted with the same Latin, but not from the main range of the Latin alphabet.

    Replacement of Latin characters in the ordinary becomes a simple regular expression. After this change the following filters work better and faster, because input range greatly narrowed.

    1. Replacing the point

    The point is used as the symbol much wider than the punctuation mark – it is a field delimiter, and positions and the delimiter in numbers spam phone numbers, etc.

    So we are faced with the need to bring the variety of spam points into one single.

    The most common of such substitution points we encountered are shown below.

    Substitute, code

    Substitute, view

    U+3002
    U+0701 ܁
    U+0702 ܂
    U+2024
    U+FE12
    U+FE52
    U+FF61

    Replacement points can be made simple regular expression

    tr/
    \N{U+3002}\N{U+0701}\N{U+0702}\N{U+2024}\N{U+FE12}\N{U+FE52}\N{U+FF61}
    /
    \N{U+002E}\N{U+002E}\N{U+002E}\N{U+002E}\N{U+002E}\N{U+002E}\N{U+002E}
    /

    It is noticed that after replacing the points subsequent filters operate really effectively.

    1. Conclusion

    I brought two ways of substitution of symbols. Inverse replacement is simple, low system requirements and greatly increases the accuracy of the filters based on the analysis of words and expressions.

    Learn more about CleanTalk Anti-Spam.