Continuation of the article Non-visual methods to protect the site from spam
Part 3: Repeats of substrings
As mentioned above, non-visual methods for site protection against spam using text analysis. One of the most common spam signals – is the presence of repeated strings. As always, these examples are taken from actual company data CleanTalk.
The search of such repeats must be minimally resource-intensive. Better if it will be called after the test from the first and second parts of the article that will be eliminated obvious spam and bring the text into a form suitable for analysis. Here I will give some statistics, as well as sample code.
- The sample of the code
We use a function of determining the longest repeated substrings made by naive algorithm described here http://algolist.manual.ru/search/lrs/naive.php
Example output is shown below.
s a l e f o r s a l e f o r s a l e
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
s 0 + . . . . . . . . + . . . . . . . . + . . .
a 1 . + . . . . . . . . + . . . . . . . . + . .
l 2 . . + . . . . . . . . + . . . . . . . . + .
e 3 . . . + . . . . . . . . + . . . . . . . . +
4 . . . . + . . . + . . . . + . . . + . . . .
f 5 . . . . . + . . . . . . . . + . . . . . . .
o 6 . . . . . . + . . . . . . . . + . . . . . .
r 7 . . . . . . . + . . . . . . . . + . . . . .
8 . . . . . . . . + . . . . + . . . + . . . .
s 9 . . . . . . . . . + . . . . . . . . + . . .
a 10 . . . . . . . . . . + . . . . . . . . + . .
l 11 . . . . . . . . . . . + . . . . . . . . + .
e 12 . . . . . . . . . . . . + . . . . . . . . +
13 . . . . . . . . . . . . . + . . . + . . . .
f 14 . . . . . . . . . . . . . . + . . . . . . .
o 15 . . . . . . . . . . . . . . . + . . . . . .
r 16 . . . . . . . . . . . . . . . . + . . . . .
17 . . . . . . . . . . . . . . . . . + . . . .
s 18 . . . . . . . . . . . . . . . . . . + . . .
a 19 . . . . . . . . . . . . . . . . . . . + . .
l 20 . . . . . . . . . . . . . . . . . . . . + .
e 21 . . . . . . . . . . . . . . . . . . . . . +
$VAR1 = {
'sale' => 3,
'for sale' => 2
};
And here is the function in Perl with minimal changes. For convenience, here is the full text that displays the matrix above.
#!/usr/bin/perl -w
use strict;
use utf8;
use Data::Dumper;
binmode(STDOUT, ':utf8');
my $min_longest_repeat_length = 4;
my $message = 'sale for sale for sale';
my %longest_repeates = ();
get_longest_repeates(\$message, \%longest_repeates);
print Dumper(\%longest_repeates);
sub get_longest_repeates {
my $test_ref = shift; # Link to text for analysis
my $reps_ref = shift; # Link to a hash of the result
my @symbols = split //, $$test_ref;
my $m_len = scalar @symbols;
my @matrix = (); # A square matrix of symbols matches
# Filling the matrix to the right of the main diagonal
for (my $i = 0; $i < $m_len; $i++) { # Strings
$matrix[$i] = [];
for (my $j = $i; $j < $m_len; $j++) { # Columns only to the right of the main diagonal $matrix[$i][$j] = 1 if $symbols[$i] eq $symbols[$j]; } } # Analysis of the diagonal of the matrix to the right of the main diagonal and filling results my %repeats_tmp = (); # Hash of repeats my ($i, $j); # Search diagonal from right to left, ie from short to long repeats for ($i = $m_len - 1; $i > 0; $i--) {
my $repeat = '';
my $repeat_pos = undef;
my $repeat_temp;
for ($j = $i; $j < $m_len; $j++) { if (defined($matrix[$j-$i][$j]) && $matrix[$j-$i][$j] == 1) { $repeat_temp = $repeat; $repeat_temp =~ s/^ //; # If the received string of repeat is already in the hash of repeats if (defined($repeats_tmp{$repeat_temp})) { $repeat_pos = $j - length($repeat_temp); $repeats_tmp{$repeat_temp}{$repeat_pos} = 1; $repeat = $symbols[$j]; } else { $repeat .= $symbols[$j]; } } else { if ($repeat ne '') { $repeat =~ s/^ //; $repeat_pos = $j - length($repeat); if (length($repeat) >= $min_longest_repeat_length) {
if (defined($repeats_tmp{$repeat})) {
$repeats_tmp{$repeat}{$repeat_pos} = 1;
} else {
$repeats_tmp{$repeat} = {$repeat_pos => 1};
}
}
$repeat = '';
}
}
}
if ($repeat ne '') {
$repeat =~ s/^ //;
$repeat_pos = $j - length($repeat);
if (length($repeat) >= $min_longest_repeat_length) {
if (defined($repeats_tmp{$repeat})) {
$repeats_tmp{$repeat}{$repeat_pos} = 1;
} else {
$repeats_tmp{$repeat} = {$repeat_pos => 1};
}
}
$repeat = '';
}
}
foreach (keys %repeats_tmp){
$$reps_ref{$_} = 1 + scalar keys %{$repeats_tmp{$_}};
}
# Output matrix for diagnostics
print "\n";
print ' ';
for (my $i = 0; $i < $m_len; $i++) {
print ' ' . $symbols[$i];
}
print "\n";
print ' ';
for (my $i = 0; $i < $m_len; $i++) {
printf '%3d', $i;
}
print "\n";
print "\n";
for (my $i = 0; $i < $m_len; $i++) {
print $symbols[$i];
printf '%3d ', $i;
for (my $j = 0; $j < $m_len; $j++) {
my $value = '.';
$value = '+' if (defined $matrix[$i][$j] && $matrix[$i][$j] == 1);
printf(' %1s', $value);
}
print "\n";
}
print "\n";
}
2. Statistics of repeats
We have selected the threshold of the minimum repeat length (it I do not give specifically), which gave the maximum efficiency in the tests. The results on the number of repeats as follows:
The number of repeats | In spam, % | In not spam, % |
---|---|---|
2 | 78,58 | 90,28 |
3 | 11,93 | 4,86 |
4 | 4,45 | 2,08 |
5 | 2,30 | 1,39 |
6 | 1,93 | 0 |
7 | 0,22 | 0 |
8 | 0,37 | 0 |
9 | 0,07 | 0 |
3. Conclusion
I showed an implementation of the naive algorithm of search of repeating substring in the text. For the analysis can be used as the number of repetitions, and repetitions (e.g., stop-word). I repeat that in the fight against spam integrated tests are more effective.
Learn more about CleanTalk Anti-Spam.