Continuation of the article Non-visual methods to protect the site from spam
Part 2: The true face of symbols
Non-visual methods to protect website from spam use, in particular, the analysis of the transmitted text. Spammers use many techniques to complicate the analysis. Here will be shown examples of one of them, namely, substitution of symbols. Examples are taken from actual company data CleanTalk.
Symbols substitution is very simple, but as a result it can not run filters on stop-words, may worse working Bayesian filters, and filters with the definition of the language. Therefore, before using these filters it makes sense to return to the symbols their true face.
Specify at once that replace symbols directly, for example, national symbols with the mark of the Latin ‘a’ to the very Latin ‘a’, is totally unacceptable without an analysis of the language and context. Also replace the letters, similar to zero by zero is possible only when you know exactly what to look for in the text (for example, telephone numbers).
However, the character replacement is permitted in the case where the meaning of the written text is saved after changing. And the replacement is necessary to bring certain sets of special symbols to one.
Here I will show you two of the most interesting ways of substitution of symbols we have encountered.
-
Symbols replacement a normal typeface
Spammers do everything to make text conspicuous, even at a cursory glance. Fortunately for them, Unicode provides a set of extended Latin characters typefaces. Fortunately for us, it is easily corrected.
Below are the most common methods, as Latin characters are substituted with the same Latin, but not from the main range of the Latin alphabet.
Replacement of Latin characters in the ordinary becomes a simple regular expression. After this change the following filters work better and faster, because input range greatly narrowed.
-
Replacing the point
The point is used as the symbol much wider than the punctuation mark – it is a field delimiter, and positions and the delimiter in numbers spam phone numbers, etc.
So we are faced with the need to bring the variety of spam points into one single.
The most common of such substitution points we encountered are shown below.
Substitute, code |
Substitute, view |
U+3002 | 。 |
U+0701 | ܁ |
U+0702 | ܂ |
U+2024 | ․ |
U+FE12 | ︒ |
U+FE52 | ﹒ |
U+FF61 | 。 |
Replacement points can be made simple regular expression
tr/
\N{U+3002}\N{U+0
/
\N{U+002E}\N{U+0
/
It is noticed that after replacing the points subsequent filters operate really effectively.
-
Conclusion
I brought two ways of substitution of symbols. Inverse replacement is simple, low system requirements and greatly increases the accuracy of the filters based on the analysis of words and expressions.
Learn more about CleanTalk Anti-Spam.