RU2758358C2

RU2758358C2 - Method for generating signature for spam detection

Info

Publication number: RU2758358C2
Application number: RU2020108167A
Authority: RU
Inventors: Дмитрий Сергеевич Голубев; Роман Андреевич Деденок; Андрей Алексеевич БУТ
Original assignee: Акционерное общество "Лаборатория Касперского"
Priority date: 2020-02-26
Filing date: 2020-02-26
Publication date: 2021-10-28
Also published as: RU2020108167A3; RU2020108167A

Abstract

FIELD: computer technology.

SUBSTANCE: method for generating a signature for spam detection contains stages, at which a set of electronic messages is formed to check for spam from a stream of electronic messages; based on attribute values of each electronic message from the generated set of electronic messages, at least one feature is found for spam detection; a decision tree is formed for spam detection using all found features; using the decision tree and the generated set of electronic messages, a list of found signs for detecting spam that have worked more than once is identified; a signature for detecting spam is formed based on the identified list of signs.

EFFECT: ensuring information security in conditions of mass mailing of electronic messages.

14 cl, 4 dwg

Description

Область техникиTechnology area

Изобретение относится к области информационной безопасности, а более конкретно к системам и способам создания сигнатур для борьбы со спамом.The invention relates to the field of information security, and more specifically to systems and methods for creating signatures to combat spam.

Уровень техникиState of the art

Реклама в Интернете является одним из самых дешевых способов рекламы. Спам-сообщения, как основной и наиболее массовый вид рекламы в современном мире, занимает от 70% общего объема почтового трафика.Online advertising is one of the cheapest ways to advertise. Spam messages, as the main and most massive form of advertising in the modern world, account for 70% of the total volume of mail traffic.

Спам – массовая рассылка рекламы или иного вида информации лицам, не выражавшим желания их получать. К спаму относятся сообщения, передаваемые по электронной почте, протоколам мгновенных сообщений, в социальных сетях, блогах, сайтах знакомств, форумах, а также посредством SMS- и MMS-сообщений.Spam - mass mailing of advertisements or other types of information to persons who did not express a desire to receive them. Spam includes messages sent by email, instant messaging protocols, social networks, blogs, dating sites, forums, and SMS and MMS messages.

Ввиду постоянного роста объемов рассылки спама возникают проблемы технического, экономического и криминального характера. Нагрузка на аппаратуру и каналы передачи данных, затраты времени пользователей на обработку сообщений, изменение направленности сообщений в сторону мошенничества и воровства – эти и другие аспекты показывают острую необходимость непрерывной борьбы со спамом.Due to the constant growth in the volume of spam mailing, problems of a technical, economic and criminal nature arise. The load on equipment and data transmission channels, user time spent on processing messages, changing the direction of messages towards fraud and theft - these and other aspects show the urgent need for a continuous fight against spam.

Существует много способов противодействия рассылкам спама. Одним из эффективных способов является использование обученных моделей машинного обучения для выявления электронных сообщений, содержащих спам.There are many ways to counter spam. One effective way is to use trained machine learning models to identify spam emails.

Например, в публикации US 8180834 B2 описана система, в которой периодически осуществляют дополнительное обучение классификаторов, которые применяются для обнаружения спама. Дополнительное обучение может быть выполнено при обнаружении ошибок или по инициативе пользователя.For example, publication US 8180834 B2 describes a system in which additional training of classifiers is periodically carried out, which are used to detect spam. Additional training can be performed upon detection of errors or at the initiative of the user.

Указанное решение осуществляет классификацию электронных сообщений при помощи инструментов машинного обучения, но не позволяет эффективно решить задачу создания сигнатуры для обнаружения спама в наборах сообщений, отправленных по электронной почте.This solution classifies e-mails using machine learning tools, but does not effectively solve the problem of creating a signature to detect spam in sets of messages sent by e-mail.

Раскрытие изобретенияDisclosure of invention

Изобретение относится к системам и способам создания сигнатур для борьбы со спамом.The invention relates to systems and methods for generating anti-spam signatures.

Технический результат настоящего изобретения заключается в обеспечении информационной безопасности в условиях массовой рассылки электронных сообщений. Указанный технический результат достигается путем формирования сигнатуры для обнаружения спама на основании дерева решений и набора электронных сообщенийThe technical result of the present invention is to ensure information security in conditions of mass mailing of electronic messages. The specified technical result is achieved by generating a signature for detecting spam based on a decision tree and a set of electronic messages

В одном из вариантов реализации предоставляется способ формирования сигнатуры для обнаружения спама, содержащий этапы, на которых: формируют набор электронных сообщений; вычисляют по меньшей мере один признак для обнаружения спама на основании сформированного набора электронных сообщений; формируют дерево решений для обнаружения спама с использованием всех вычисленных признаков; формируют сигнатуру для обнаружения спама на основании сформированного дерева решений и сформированного набора электронных сообщений.In one embodiment, a method for generating a signature for detecting spam is provided, comprising the steps of: generating a set of electronic messages; calculating at least one feature for detecting spam based on the generated set of electronic messages; generating a decision tree for detecting spam using all the calculated features; generating a signature for spam detection based on the generated decision tree and the generated set of electronic messages.

В другом варианте реализации способа под признаком для обнаружения спама понимают признак, вычисляемый на основании значений атрибута электронного сообщения, характеризующие наличие спама.In another embodiment of the method, a feature for detecting spam is understood as a feature calculated based on the values of an attribute of an electronic message that characterize the presence of spam.

Еще в одном варианте реализации способа под сигнатурой для обнаружения спама понимают перечень признаков для обнаружения спама и их значений, характерные для электронного сообщения, содержащего спам.In yet another embodiment of the method, a signature for spam detection is understood as a list of features for detecting spam and their values characteristic of an electronic message containing spam.

В другом варианте реализации способа набор электронных сообщений состоит из набора электронных сообщений для формирования дерева решений, который содержит не менее двух электронных сообщений, и проверочного набора электронных сообщений, который содержит не менее двух электронных сообщений.In another embodiment of the method, the set of electronic messages consists of a set of electronic messages for forming a decision tree, which contains at least two electronic messages, and a test set of electronic messages, which contains at least two electronic messages.

В другом варианте реализации способа набор электронных сообщений для формирования дерева решений не содержит электронных сообщений из проверочного набора электронных сообщений.In another embodiment of the method, the set of email messages for generating the decision tree does not include email messages from the test set of email messages.

Еще в одном варианте реализации способа выявляют значения атрибутов каждого электронного сообщения из набора электронных сообщений для формирования дерева решений.In yet another embodiment of the method, the attribute values of each email from the set of email messages are identified to generate a decision tree.

В другом варианте реализации способа вычисляют признаки для обнаружения спама на основании выявленных значений атрибутов.In another embodiment of the method, features for spam detection are calculated based on the detected attribute values.

Еще в одном варианте реализации способа применяют сформированное дерево решений для обнаружения спама для анализа каждого электронного сообщения из проверочного набора сообщений.In another embodiment of the method, the generated spam detection decision tree is used to analyze each electronic message from the test set of messages.

В другом варианте реализации способа по результатам анализа выявляют перечень признаков для обнаружения спама, сработавших более одного раза.In another embodiment of the method, based on the analysis results, a list of features for detecting spam that has been triggered more than once is identified.

Еще в одном варианте реализации способа формируют сигнатуру для обнаружения спама на основании выявленного перечня признаков для обнаружения спама.In yet another embodiment of the method, a signature for detecting spam is generated based on the identified list of features for detecting spam.

В другом варианте реализации способа при отсутствии перечня признаков, сработавших более одного раза, выполняют переобучение сформированного дерева решений для обнаружения спама.In another embodiment of the method, in the absence of a list of features that have been triggered more than once, the generated decision tree is retrained to detect spam.

Краткое описание чертежейBrief Description of Drawings

Фиг. 1 отображает структуру дерева решений для обнаружения спама и сигнатуры для обнаружения спама.FIG. 1 depicts the structure of a decision tree for spam detection and signatures for spam detection.

Фиг. 2 иллюстрирует структурную схему системы формирования сигнатуры для обнаружения спама.FIG. 2 illustrates a block diagram of a signature generation system for spam detection.

Фиг. 3 иллюстрирует алгоритм работы системы формирования сигнатуры для обнаружения спама.FIG. 3 illustrates a flow chart of a signature generation system for spam detection.

Фиг. 4 представляет пример компьютерной системы общего назначения.FIG. 4 shows an example of a general purpose computer system.

Хотя изобретение может иметь различные модификации и альтернативные формы, характерные признаки, показанные в качестве примера на чертежах, будут описаны подробно. Следует понимать, однако, что цель описания заключается не в ограничении изобретения конкретным его воплощением. Наоборот, целью описания является охват всех изменений, модификаций, входящих в рамки данного изобретения, как это определено приложенной формуле.Although the invention may take various modifications and alternative forms, the characteristic features shown by way of example in the drawings will be described in detail. It should be understood, however, that the purpose of the description is not to limit the invention to a specific embodiment. On the contrary, the purpose of the description is to cover all changes, modifications falling within the scope of this invention, as defined by the appended claims.

Описание вариантов осуществления изобретенияDescription of embodiments of the invention

Объекты и признаки настоящего изобретения, способы для достижения этих объектов и признаков станут очевидными посредством отсылки к примерным вариантам осуществления. Однако настоящее изобретение не ограничивается примерными вариантами осуществления, раскрытыми ниже, оно может воплощаться в различных видах. Сущность, приведенная в описании, является ничем иным, как конкретными деталями, необходимыми для помощи специалисту в области техники в исчерпывающем понимании изобретения, и настоящее изобретение определяется в объеме приложенной формулы.The objects and features of the present invention, methods for achieving these objects and features will become apparent by reference to exemplary embodiments. However, the present invention is not limited to the exemplary embodiments disclosed below, but may be embodied in various forms. The essence recited in the description is nothing more than the specific details necessary to assist a person skilled in the art in a thorough understanding of the invention, and the present invention is defined within the scope of the appended claims.

Введем ряд определений и понятий, которые будут использованы при описании вариантов осуществления изобретения.Let's introduce a number of definitions and concepts that will be used in describing the embodiments of the invention.

Электронная почта (electronic mail, e-mail) – набор услуг компьютерной сети по пересылке сообщений между ее пользователями. Является средством быстрой доставки писем, текстов программ, документов и другой подобной корреспонденции. При передаче сообщения по электронной почте, передающий и принимающий компьютеры не обязательно взаимодействуют друг с другом непосредственно (Дорот В.Л., Новиков Ф.А. Толковый словарь современной компьютерной лексики. - 3-е изд., перераб. и доп. - СПб.: БХВ-Петербург, 2004. - 608 с.: ил.).Electronic mail (e-mail, e-mail) - a set of computer network services for sending messages between its users. It is a means of fast delivery of letters, program texts, documents and other similar correspondence. When transmitting a message by e-mail, the transmitting and receiving computers do not necessarily interact with each other directly (Dorot V.L., Novikov F.A. .: BHV-Petersburg, 2004 .-- 608 p.: Ill.).

Электронное письмо или сообщение – согласно RFC 5322 представляет собой последовательность символов. Сообщения, соответствующие данной спецификации, включают символы с десятичными кодами от 1 до 127, интерпретируемые в соответствии с кодировкой US-ASCII. Сообщение состоит из полей заголовков (совокупность этих полей называют разделом заголовков сообщения), за которыми может следовать основная часть сообщения. Раздел заголовков представляет собой последовательность символьных строк, синтаксис которых описан в данной спецификации. Тело сообщения представляет собой последовательность символов, которая следует после раздела заголовков и отделена от него пустой строкой (строкой, содержащей только CRLF). Далее и по тексту под электронным сообщением понимают электронное сообщение, переданное по электронной почте.Email or message - According to RFC 5322, it is a sequence of characters. Messages conforming to this specification include characters with decimal codes 1 through 127, interpreted as US-ASCII. A message consists of header fields (collectively referred to as the message header section), which may be followed by the main body of the message. A header section is a sequence of character strings, the syntax for which is described in this specification. The body of a message is a sequence of characters that follows the header section and is separated from it by a blank line (a line containing only CRLF). Hereinafter and in the text, an electronic message is understood to mean an electronic message transmitted by electronic mail.

Атрибут электронного сообщения – необходимое, существенное, неотъемлемое свойство электронного сообщения.An electronic message attribute is a necessary, essential, inherent property of an electronic message.

Признак для обнаружения спама - признак, вычисляемый на основании значения атрибута электронного сообщения, характеризующий наличие спама и применяемый при использовании технологий машинного обучения.Spam detection attribute is a attribute calculated based on the value of an email message attribute that characterizes the presence of spam and is used when using machine learning technologies.

Поток сообщений – совокупность электронных сообщений, передаваемых по электронной почте, которые получает один или несколько пользователей. Набор электронных сообщений –- фиксированное количество сообщений, отобранных из потока сообщений для выполнения проверки наличия спама.Message flow is a collection of electronic messages transmitted by email that one or more users receive. Set of e-mail messages - a fixed number of messages selected from the message flow to perform a spam check.

Классическая вирусная сигнатура - это непрерывная последовательность байтов, характерная для того или иного вредоносного приложения. Сигнатура для обнаружения спама - перечень признаков для обнаружения спама и их конкретных значений, характерных для того или иного электронного сообщения, содержащего спам.The classic virus signature is a contiguous sequence of bytes typical of a particular malicious application. Signature for spam detection - a list of signs for detecting spam and their specific values, characteristic of a particular e-mail message containing spam.

Дерево решений – метод анализа данных для построения классификационных и регрессионных моделей, является как методом извлечения, так и одновременно методом представления данных. Дерево решений является способом представления правил в иерархической, последовательной структуре, где каждому объекту соответствует единственный узел, дающий решение.Decision tree is a data analysis method for building classification and regression models, it is both an extraction method and at the same time a data presentation method. A decision tree is a way of representing rules in a hierarchical, sequential structure, where each object has a single node that gives a solution.

Создатель массовой рассылки электронных сообщений, содержащих спам, обычно использует шаблон для генерации текста и содержимого упомянутых электронных сообщений. Для усложнения обнаружения спама, он также может использовать множество инструментов, например методы обфускации, анонимизации и т.д. Для выявления шаблонов и схожих алгоритмов создания сообщений, содержащих спам, может быть выполнено формирование деревьев решений. Фиг. 1 иллюстрирует структуру дерева решений для обнаружения спама и сигнатуры для обнаружения спама. Дерево решений 110 используют для классификации групп электронных сообщений. В качестве узлов дерева используют вычисляемые признаки. В качестве переходов между узлами могут быть использованы конкретные значения или диапазоны значений признаков. На основе нескольких узлов и переходов может быть сформирована сигнатура 120.The creator of bulk email messages containing spam typically uses a template to generate the text and content of the referenced email messages. To make spam detection more difficult, it can also use many tools, such as obfuscation, anonymization, etc. Decision trees can be generated to identify patterns and similar algorithms for generating spam messages. FIG. 1 illustrates the structure of a decision tree for spam detection and signatures for spam detection. Decision tree 110 is used to classify groups of email messages. Computed features are used as tree nodes. Specific values or ranges of feature values can be used as transitions between nodes. Based on several nodes and transitions, a signature 120 can be generated.

Формирование сигнатуры для обнаружения спама выполняют с помощью системы формирования сигнатуры для обнаружения спама. Фиг. 2 отображает структурную схему системы формирования сигнатуры для обнаружения спама, которая включает в себя набор электронных сообщений 210, средство распознавания 220, средство вычисления 230, средство формирования 240.Spam detection signature generation is performed by a spam detection signature generation system. FIG. 2 depicts a block diagram of a signature generation system for detecting spam, which includes a set of electronic messages 210, recognition means 220, calculator 230, and generation means 240.

Средство распознавания 220 предназначено для формирования набора электронных сообщений и передачи сформированного набора электронных сообщений средству вычисления 230. Набор электронных сообщений 210 состоит из набора электронных сообщений для формирования дерева решений и проверочного набора электронных сообщений. Набор электронных сообщений для формирования дерева решений содержит не менее двух электронных сообщений, упомянутый набор используют на этапе построения или изменения дерева решений. Проверочный набор электронных сообщений содержит не менее двух электронных сообщений, упомянутый набор используют на этапе построения сигнатур для обнаружения спама. При этом набор электронных сообщений для формирования дерева решений не содержит электронных сообщений из проверочного набора электронных сообщений.Recognizer 220 is configured to generate a set of electronic messages and transmit the generated set of electronic messages to calculator 230. Set of electronic messages 210 consists of a set of electronic messages for generating a decision tree and a test set of electronic messages. The set of e-mails for forming a decision tree contains at least two e-mails, the said set is used at the stage of building or changing the decision tree. The verification set of e-mails contains at least two e-mails, the set is used at the stage of building signatures for detecting spam. However, the set of e-mails for forming the decision tree does not contain e-mails from the test set of e-mails.

В одном из вариантов реализации формирование набора электронных сообщений 210 выполняют путем добавления сообщений, полученных разными пользователями за заданный период времени. В другом варианте реализации формирование набора электронных сообщений 210 выполняют путем добавления сообщений, полученных одним пользователем за определенный период времени. Еще в одном варианте реализации формирование набора электронных сообщений 210 выполняют путем добавления фиксированного количества сообщений. Оптимальный размер набора электронных сообщений 210 зависит от частоты получения сообщений. Размер определяют эмпирически путем постепенного увеличения, например, интервала времени, за который формируется набор электронных сообщений 210.In one implementation, the formation of a set of electronic messages 210 is performed by adding messages received by different users over a given period of time. In another implementation, the formation of a set of electronic messages 210 is performed by adding messages received by one user over a certain period of time. In yet another implementation, the generation of the set of electronic messages 210 is performed by adding a fixed number of messages. The optimal size of an email set 210 depends on the frequency of receiving messages. The size is determined empirically by incrementally increasing, for example, the time interval over which a set of emails 210 is generated.

Средство вычисления 230 предназначено для вычисления признаков для обнаружения спама на основании сформированного набора электронных сообщений 210, формирования дерева решений для обнаружения спама с использованием вычисленных признаков, передачи данных о сформированном дереве решений для обнаружения спама средству формирования 240.The calculator 230 is for calculating features for detecting spam based on the generated set of email messages 210, generating a decision tree for detecting spam using the computed features, and transmitting data about the generated decision tree for detecting spam to the generating means 240.

В одном из вариантов реализации признаки для обнаружения спама вычисляют на основании атрибутов электронного сообщения. Определение атрибутов электронного сообщения выполняют путем анализа процесса передачи и получения сообщения. Примерами атрибутов электронного сообщения являются: IP-адрес отправителя, размер электронного сообщения, язык текста электронного сообщения, количество символов в заголовке электронного сообщения, размер электронного сообщения и т.д.In one implementation, the spam detection features are calculated based on the attributes of the email message. The determination of the attributes of an electronic message is performed by analyzing the process of sending and receiving the message. Examples of email attributes are: the sender's IP address, the size of the email, the language of the email text, the number of characters in the email header, the size of the email, and so on.

Примерами вычисленных признаков для обнаружения спама являются следующие признаки: наличие динамической PTR-записи для IP-адреса (от англ. pointer – указатель) связывает IP-адрес хоста с его каноническим именем); контрольная сумма от HTML-верстки без вариативных атрибутов; msgid - уникальный номер сообщения; msgid_type - эвристически определенный через внешний вид заголовка; msgid-агент, отправивший сообщение; контрольная сумма от последовательности MIME-заголовков; тип содержимого письма и т.д.Examples of computed features for detecting spam are the following features: the presence of a dynamic PTR record for an IP address (from the English pointer) connects the IP address of a host with its canonical name); checksum from HTML layout without variable attributes; msgid - unique message number; msgid_type - heuristically determined through the appearance of the header; msgid agent that sent the message; checksum from the sequence of MIME headers; type of email content, etc.

Каждому из вычисленных признаков эмпирически задают вес w_i, который был рассчитан на основе заранее заданных статистических данных. Значение веса в той или иной степени характеризует наличие спама. После вычисления признаков выполняют формирование дерева решений для обнаружения спама. Например, началом дерева может быть признак mailer_name (приложение, с помощью которого было отправлено электронное сообщение), которое может принимать 3 значения. В случае если принято значение 1, то происходит переход к признаку max_url_length (максимальное значение длины URL в письме), если принято значение 2, то происходит переход к признаку msgid_type и т.д.Each of the calculated features is empirically given a weight w _i that has been calculated based on predetermined statistics. The weight value, to one degree or another, characterizes the presence of spam. After calculating the features, a decision tree is generated for detecting spam. For example, the start of the tree can be mailer_name (the application with which the email was sent), which can take 3 values. If the value 1 is accepted, then the transition to the attribute max_url_length (the maximum value of the URL length in the letter) occurs, if the value 2 is accepted, then the transition to the attribute msgid_type occurs, etc.

Средство формирования 240 предназначено для формирования сигнатуры для обнаружения спама на основании сформированного дерева решений и сформированного набора электронных сообщений.Generator 240 is for generating a signature for spam detection based on the generated decision tree and the generated set of email messages.

Сформированное дерево решений используют для анализа электронных сообщений проверочного набора электронных сообщений из набора электронных сообщений 210. В ходе анализа определяют атрибуты сообщения, вычисляют признаки для обнаружения спама и применяют дерево решений для обнаружения спама. После применения дерева электронное сообщение попадает в группу, которая имеет суммарный вес, вычисляемый в соответствии с весами вычисленных признаков по формуле W_сум=Σw_i. Эмпирическим способом определяют предельное значение суммарного веса групп. Группы электронных сообщений, у которых суммарный вес выше предельного значения, считают содержащими спам, а у которых ниже - не содержащими спам. Формирование сигнатуры для обнаружения спама выполняют путем выявления перечня признаков для обнаружения спама, сработавших более одного раза в группе электронных сообщений, содержащих спам.The generated decision tree is used to parse the email messages from the test set of email messages from the email set 210. The analysis determines the message attributes, computes the spam detection features, and applies the decision tree to spam detection. After applying the tree, the e-mail falls into a group that has a total weight calculated in accordance with the weights of the calculated features according to the formula W _sum = Σw _i . The limit value of the total weight of the groups is determined empirically. Groups of e-mails with an aggregate weight above the limit are considered spam, and those below the limit are considered non-spam. Generation of a signature for spam detection is performed by identifying a list of features for detecting spam that have been triggered more than once in a group of spam e-mail messages.

В случае если не было выявлено перечня признаков для обнаружения спама, не сработавших более одного раза ни в одной из групп, запускают переобучение сформированного дерева решений для обнаружения спама. В ходе переобучения дерево решений модифицируют с использованием технологий машинного обучения, например градиентного бустинга (gradient boosting), где формируют альтернативные варианты дерева, их глубину и ширину, и вычисляют показатели эффективности, достоверности, ложных срабатываний.If the list of signs for detecting spam that did not work more than once in any of the groups was not identified, retraining of the generated decision tree for detecting spam is started. During retraining, the decision tree is modified using machine learning technologies, for example, gradient boosting, where alternative tree variants, their depth and width are formed, and indicators of efficiency, reliability, and false positives are calculated.

Использование сигнатур позволяет значительно быстрее обрабатывать большое количество электронных сообщений, чем выполнение анализа при помощи всего дерева решений. Сигнатуру включают в набор обновлений для спам-фильтра, с помощью которого впоследствии выполняют проверку наличия спама в целях обеспечения информационной безопасности процесса передачи электронных сообщений по электронной почте.Using signatures allows you to process a large number of emails much faster than performing analysis using the entire decision tree. The signature is included in a set of updates for the spam filter, which subsequently checks for spam in order to ensure information security of the e-mail transmission process.

Например, пользователь получил четыре сообщения по электронной почте. Первое и второе сообщения попадают в набор электронных сообщений для формирования дерева решений. Выявленные атрибуты первого сообщения:For example, a user received four emails. The first and second messages are sent to a set of emails to form a decision tree. Revealed attributes of the first message:

• текст сообщения: «ПОКУПАЙТЕ ЦВЕТЫ», вместо пробелов использованы случайные символы белого цвета;• text of the message: "BUY FLOWERS", instead of spaces, random symbols of white color are used;

• IP-адрес: 191.157.1.1;• IP address: 191.157.1.1;

• размер сообщения: 1 Кб;• message size: 1 Kb;

• приложен файл: 200 Кб, 32.jpg.• attached file: 200 Kb, 32.jpg.

Выявленные атрибуты второго сообщения:Revealed attributes of the second message:

• IP-адрес: 181.147.2.2;• IP address: 181.147.2.2;

• размер сообщения: 1,5 Кб;• message size: 1.5 Kb;

• приложен файл: 300 Кб, 32.bmp.• attached file: 300 Kb, 32.bmp.

Вычисленные признаки от атрибутов первого сообщения:Computed features from the attributes of the first message:

- Признак 1 (неизвестный IP-адрес) - есть, вес 0,05;- Sign 1 (unknown IP address) - yes, weight 0.05;

- Признак 2 (размер сообщения <10 кБ) - есть, вес 0,1;- Sign 2 (message size <10 kB) - yes, weight 0.1;

- Признак 3 (скрытый текст) - есть, вес 0,3;- Sign 3 (hidden text) - yes, weight 0.3;

- Признак 4 (файл графического формата) - есть, вес 0,1.- Sign 4 (graphic format file) - yes, weight 0.1.

Вычисленные признаки от атрибутов второго сообщения:Computed features from the attributes of the second message:

Одним из вариантов дерева решений для обнаружения спама, сформированного на основе набора электронных сообщений для формирования дерева решений, будет следующее дерево решений: признак 1 с ветвями «есть» или «нет», если признак 1 имеет значение «есть», то происходит переход к признаку 2 с ветвями «есть» или «нет», если признак 2 имеет значение «есть», то происходит переход к признаку 4 с ветвями «есть» или «нет», если признак 4 имеет значение «есть», то происходит переход к признаку 3 с ветвями «есть» или «нет», в случае если признак 3 имеет значение «есть», то электронное сообщение попадет в группу 1, если признак 3 имеет значение «нет», то электронное сообщение попадет в группу 2, и т.д. Суммарный вес группы 1 будет 0,55. Суммарный вес группы 2 будет 0,25.One of the variants of the decision tree for detecting spam, generated on the basis of a set of e-mail messages for forming a decision tree, will be the following decision tree: feature 1 with branches "yes" or "no", if feature 1 has the value "yes", then a transition to feature 2 with branches "is" or "no", if feature 2 has the value "is", then there is a transition to feature 4 with branches "is" or "no", if feature 4 has the value "is", then there is a transition to feature 3 with "yes" or "no" branches, if feature 3 has the value "yes", then the email message will fall into group 1, if feature 3 has the value "no", then the email message will fall into group 2, and so on. .d. The total weight of group 1 will be 0.55. The total weight of group 2 will be 0.25.

Для применения сформированного дерева решений используют проверочный набор электронных сообщений. Третье и четвертое электронное сообщения попадают в проверочный набор электронных сообщений. Выявленные атрибуты третьего сообщения:To apply the generated decision tree, a test set of e-mail messages is used. The third and fourth emails fall into the test set of emails. Revealed attributes of the third message:

• IP-адрес: 193.153.1.1;• IP address: 193.153.1.1;

• размер сообщения: 1,7 Кб;• message size: 1.7 Kb;

• приложен файл: 250 Кб, 32.png.• attached file: 250 Kb, 32.png.

Выявленные атрибуты четвертого сообщения:Revealed attributes of the fourth message:

• текст сообщения: «С Днем Рождения!!!», нет символов белого цвета;• message text: "Happy Birthday !!!", no white characters;

• IP-адрес: 192.161.7.2;• IP address: 192.161.7.2;

• размер сообщения: 0,5 Кб;• message size: 0.5 Kb;

• приложен файл: 250 Кб, открытка.jpg.• attached file: 250 Kb, postcard.jpg.

Вычисленные признаки от атрибутов третьего сообщения:Computed features from the attributes of the third message:

- Признак 1 (неизвестный IP-адрес) - есть;- Sign 1 (unknown IP address) - yes;

- Признак 2 (размер сообщения <10 кБ) - есть;- Sign 2 (message size <10 kB) - yes;

- Признак 3 (скрытый текст) - есть;- Sign 3 (hidden text) - yes;

- Признак 4 (файл графического формата) - есть.- Sign 4 (graphic format file) - yes.

Вычисленные признаки от атрибутов четвертого сообщения:Calculated features from the attributes of the fourth message:

- Признак 3 (скрытый текст) - нет;- Sign 3 (hidden text) - no;

В результате применения сформированного дерева решений третье сообщение попадет в группу 1, четвертое сообщение - в группу 2. Задано предельное значение суммарного веса 0,5. Таким образом, третье сообщение является электронным сообщением, содержащим спам. Выявляют перечень признаков, сработавших более одного раза, весом больше 0.08. Формируют сигнатуру, где перечень признаков состоит из признака 2, связанного с признаком 4 через значение «есть», признака 4, связанного с признаком 3 значением «есть», и признака 3, который имеет значение «есть».As a result of applying the generated decision tree, the third message will fall into group 1, the fourth message - into group 2. The limit value of the total weight is set to 0.5. Thus, the third message is a spam email. A list of features that have triggered more than once with a weight greater than 0.08 is identified. A signature is formed, where the list of features consists of feature 2 associated with feature 4 through the value "is", feature 4 associated with feature 3 with the value "is", and feature 3, which has the value "is".

Фиг. 3 иллюстрирует алгоритм формирования сигнатуры для обнаружения спама. На этапе 311 при помощи средства распознавания 220 осуществляют формирование набора электронных сообщений и передают сформированный набор электронных сообщений средству вычисления 230. На этапе 312 при помощи средства вычисления 230 осуществляют вычисление признаков для обнаружения спама на основании сформированного набора электронных сообщений. На этапе 313 при помощи средства вычисления 230 осуществляют формирование дерева решений для обнаружения спама с использованием вычисленных признаков и передают данные о сформированном дереве решений для обнаружения спама средству формирования 240. На этапе 314 при помощи средства формирования 240 выполняют формирование сигнатуры для обнаружения спама на основании сформированного дерева решений и сформированного набора электронных сообщений.FIG. 3 illustrates an algorithm for generating a signature for spam detection. In step 311, the recognition means 220 generates a set of electronic messages and transmits the generated set of electronic messages to the calculator 230. In step 312, the calculator 230 calculates features for detecting spam based on the generated set of electronic messages. At step 313, the calculator 230 generates a decision tree for spam detection using the calculated features, and sends data about the generated decision tree for detecting spam to the generating means 240. At step 314, the generating means 240 generates a signature for spam detection based on the generated a decision tree and a generated set of emails.

Фиг. 4 представляет пример компьютерной системы общего назначения, персональный компьютер или сервер 20, содержащий центральный процессор 21, системную память 22 и системную шину 23, которая содержит разные системные компоненты, в том числе память, связанную с центральным процессором 21. Системная шина 23 реализована, как любая известная из уровня техники шинная структура, содержащая в свою очередь память шины или контроллер памяти шины, периферийную шину и локальную шину, которая способна взаимодействовать с любой другой шинной архитектурой. Системная память содержит постоянное запоминающее устройство (ПЗУ) 24, память с произвольным доступом (ОЗУ) 25. Основная система ввода/вывода (BIOS) 26, содержит основные процедуры, которые обеспечивают передачу информации между элементами персонального компьютера 20, например, в момент загрузки операционной системы с использованием ПЗУ 24.FIG. 4 shows an example of a general-purpose computer system, a personal computer or server 20, comprising a central processing unit 21, a system memory 22, and a system bus 23 that contains various system components, including memory associated with the central processing unit 21. The system bus 23 is implemented as any bus structure known from the prior art, containing in turn a bus memory or a bus memory controller, a peripheral bus and a local bus that is capable of interfacing with any other bus architecture. System memory contains read-only memory (ROM) 24, random access memory (RAM) 25. The main input / output system (BIOS) 26 contains basic procedures that transfer information between the elements of the personal computer 20, for example, at the time of loading the operating room. systems using ROM 24.

Персональный компьютер 20 в свою очередь содержит жесткий диск 27 для чтения и записи данных, привод магнитных дисков 28 для чтения и записи на сменные магнитные диски 29 и оптический привод 30 для чтения и записи на сменные оптические диски 31, такие как CD-ROM, DVD-ROM и иные оптические носители информации. Жесткий диск 27, привод магнитных дисков 28, оптический привод 30 соединены с системной шиной 23 через интерфейс жесткого диска 32, интерфейс магнитных дисков 33 и интерфейс оптического привода 34 соответственно. Приводы и соответствующие компьютерные носители информации представляют собой энергонезависимые средства хранения компьютерных инструкций, структур данных, программных модулей и прочих данных персонального компьютера 20.The personal computer 20, in turn, contains a hard disk 27 for reading and writing data, a magnetic disk drive 28 for reading and writing to removable magnetic disks 29 and an optical drive 30 for reading and writing to removable optical disks 31, such as CD-ROM, DVD -ROM and other optical media. The hard disk 27, the magnetic disk drive 28, and the optical drive 30 are connected to the system bus 23 via the hard disk interface 32, the magnetic disk interface 33, and the optical drive interface 34, respectively. Drives and corresponding computer storage media are non-volatile storage media for computer instructions, data structures, program modules and other data of a personal computer 20.

Настоящее описание раскрывает реализацию системы, которая использует жесткий диск 27, сменный магнитный диск 29 и сменный оптический диск 31, но следует понимать, что возможно применение иных типов компьютерных носителей информации 56, которые способны хранить данные в доступной для чтения компьютером форме (твердотельные накопители, флеш-карты памяти, цифровые диски, память с произвольным доступом (ОЗУ) и т.п.), которые подключены к системной шине 23 через контроллер 55.The present description discloses an implementation of a system that uses a hard disk 27, a removable magnetic disk 29 and a removable optical disk 31, but it should be understood that other types of computer storage media 56 that are capable of storing data in a computer readable form (solid state drives, flash memory cards, digital disks, random access memory (RAM), etc.), which are connected to the system bus 23 through the controller 55.

Компьютер 20 имеет файловую систему 36, где хранится записанная операционная система 35, а также дополнительные программные приложения 37, другие программные модули 38 и данные программ 39. Пользователь имеет возможность вводить команды и информацию в персональный компьютер 20 посредством устройств ввода (клавиатуры 40, манипулятора «мышь» 42). Могут использоваться другие устройства ввода (не отображены): микрофон, джойстик, игровая консоль, сканер и т.п. Подобные устройства ввода по своему обычаю подключают к компьютерной системе 20 через последовательный порт 46, который в свою очередь подсоединен к системной шине, но могут быть подключены иным способом, например, при помощи параллельного порта, игрового порта или универсальной последовательной шины (USB). Монитор 47 или иной тип устройства отображения также подсоединен к системной шине 23 через интерфейс, такой как видеоадаптер 48. В дополнение к монитору 47, персональный компьютер может быть оснащен другими периферийными устройствами вывода (не отображены), например, колонками, принтером и т.п.Computer 20 has a file system 36, where the recorded operating system 35 is stored, as well as additional software applications 37, other program modules 38 and program data 39. The user has the ability to enter commands and information into the personal computer 20 through input devices (keyboard 40, manipulator " mouse "42). Other input devices may be used (not shown): microphone, joystick, game console, scanner, etc. Such input devices are conventionally connected to the computer system 20 through a serial port 46, which in turn is connected to the system bus, but can be connected in another way, for example, using a parallel port, a game port, or a universal serial bus (USB). A monitor 47 or other type of display device is also connected to the system bus 23 via an interface such as a video adapter 48. In addition to the monitor 47, the personal computer may be equipped with other peripheral output devices (not displayed), such as speakers, a printer, etc. ...

Персональный компьютер 20 способен работать в сетевом окружении, при этом используется сетевое соединение с другим или несколькими удаленными компьютерами 49. Удаленный компьютер (или компьютеры) 49 являются такими же персональными компьютерами или серверами, которые имеют большинство или все упомянутые элементы, отмеченные ранее при описании существа персонального компьютера 20, представленного на Фиг. 4. В вычислительной сети могут присутствовать также и другие устройства, например, маршрутизаторы, сетевые станции, пиринговые устройства или иные сетевые узлы.The personal computer 20 is capable of operating in a networked environment using a network connection with other or more remote computers 49. The remote computer (or computers) 49 are the same personal computers or servers that have most or all of the elements mentioned earlier in the description of the entity. the personal computer 20 shown in FIG. 4. In a computer network, there may also be other devices, such as routers, network stations, peer-to-peer devices, or other network nodes.

Сетевые соединения могут образовывать локальную вычислительную сеть (LAN) 50 и глобальную вычислительную сеть (WAN). Такие сети применяются в корпоративных компьютерных сетях, внутренних сетях компаний и, как правило, имеют доступ к сети Интернет. В LAN- или WAN-сетях персональный компьютер 20 подключен к локальной сети 50 через сетевой адаптер или сетевой интерфейс 51. При использовании сетей персональный компьютер 20 может использовать модем 54 или иные средства обеспечения связи с глобальной вычислительной сетью, такой как Интернет. Модем 54, который является внутренним или внешним устройством, подключен к системной шине 23 посредством последовательного порта 46. Следует уточнить, что сетевые соединения являются лишь примерными и не обязаны отображать точную конфигурацию сети, т.е. в действительности существуют иные способы установления соединения техническими средствами связи одного компьютера с другим.Network connections can form a local area network (LAN) 50 and a wide area network (WAN). Such networks are used in corporate computer networks, internal networks of companies and, as a rule, have access to the Internet. In LAN or WAN networks, personal computer 20 is connected to local network 50 via a network adapter or network interface 51. When using networks, personal computer 20 may use a modem 54 or other means of providing communication with a wide area network, such as the Internet. Modem 54, which is an internal or external device, is connected to the system bus 23 via a serial port 46. It should be noted that the network connections are only exemplary and are not required to reflect the exact configuration of the network, i. E. in fact, there are other ways of establishing a connection by technical means of communication of one computer with another.

В заключение следует отметить, что приведенные в описании сведения являются примерами, которые не ограничивают объем настоящего изобретения, определенного формулой.In conclusion, it should be noted that the information given in the description are examples, which do not limit the scope of the present invention defined by the claims.

Claims

1. A method for generating a signature for spam detection, comprising the steps of:

a) generate a set of e-mail messages to check for spam from the e-mail stream;

b) based on the values of the attributes of each electronic message from the generated set of electronic messages, calculates at least one feature for detecting spam;

c) form a decision tree for spam detection using all calculated features;

d) using a decision tree and a generated set of e-mail messages, a list of computed features for detecting spam that has been triggered more than once is identified;

e) generate a signature for detecting spam based on the identified list of features.

2. The method according to claim 1, according to which a feature for detecting spam is understood as a feature calculated based on the values of an attribute of an electronic message that characterize the presence of spam.

3. The method according to claim 1, according to which a signature for spam detection is understood as a list of features for detecting spam and their values characteristic of an electronic message containing spam.

4. The method according to claim 1, wherein the set of electronic messages consists of a set of electronic messages for forming a decision tree, which contains at least two electronic messages, and a test set of electronic messages, which contains at least two electronic messages.

5. The method according to claim 4, wherein the set of electronic messages for forming the decision tree does not contain electronic messages from the test set of electronic messages.

6. The method according to claim 5, according to which the attribute values of each electronic message from the set of electronic messages are identified to form a decision tree.

7. The method according to claim 6, in which features for detecting spam are calculated based on the detected attribute values.

8. The method according to claim 1, wherein the generated spam detection decision tree is used to analyze each electronic message from the test set of messages.

9. The method according to claim 8, according to which, based on the analysis results, a list of signs for detecting spam that have been triggered more than once is identified.

10. The method according to claim 9, according to which a signature for detecting spam is generated based on the identified list of features for detecting spam.

11. The method according to claim 9, according to which, in the absence of a list of features that have been triggered more than once, the generated decision tree is retrained to detect spam.

12. The method according to claim 4, according to which the formation of a set of electronic messages is performed by adding messages received by different users for a certain period of time.

13. The method according to claim 4, according to which the formation of a set of electronic messages is performed by adding messages received by one user for a certain period of time.

14. The method according to claim 4, according to which the formation of the set of electronic messages is performed by adding a fixed number of messages.