RU2614561C1

RU2614561C1 - System and method of similar files determining

Info

Publication number: RU2614561C1
Application number: RU2015154379A
Authority: RU
Inventors: Алексей Евгеньевич Антонов; Алексей Михайлович Романенко
Original assignee: Закрытое акционерное общество "Лаборатория Касперского"
Priority date: 2015-12-18
Filing date: 2015-12-18
Publication date: 2017-03-28

Abstract

FIELD: information technology.

SUBSTANCE: method of similar files determining, in which the set of immutable and mutable files characteristics are defined; in this case the file characteristics is considered as a variable sign, if the characteristic takes different values for a plurality of similar files. File characteristics are considered as immutable, if a characteristic takes the same value for a plurality of similar files; a variety of charateristics is separated from at least one file; a plurality of separated file characteristics is divided into at least two subsets: the subset of variable characteristics and a subset of the immutable characteristics; the convolution of each of the abovementioned subsets of file characteristics is formed; the file convolution is formed as a convolution combination of each of the abovementioned subsets of file characteristics; the convolution of at least one file is compared with a set of pre-designed file convolutions; the file is recognized as similar to the files from the plurality of similar files having the same convolution, if by comparison the convolution of the noted file is the same as the file convolution from the noted plurality.

EFFECT: finding of similar files.

2 cl, 5 dwg

Description

Область техникиTechnical field

Изобретение относится к области обработки данных, а именно к системам и способам определения похожих файлов.The invention relates to the field of data processing, and in particular to systems and methods for determining similar files.

Уровень техникиState of the art

В настоящее время наряду с широким распространением компьютеров и подобных им устройств, таких как мобильные телефоны, имеет место рост количества компьютерных угроз. Под компьютерными угрозами в общем случае подразумевают объекты, способные нанести какой-либо вред компьютеру или пользователю компьютера, например: сетевые черви, клавиатурные шпионы, компьютерные вирусы.Currently, along with the widespread adoption of computers and similar devices, such as mobile phones, there is an increase in the number of computer threats. Computer threats in the general case mean objects that can cause any harm to a computer or a computer user, for example: network worms, keyloggers, computer viruses.

Для защиты пользователя и его персонального компьютера от возможных компьютерных угроз применяются различные антивирусные технологии. В состав антивирусного программного обеспечения могут входить различные модули обнаружения компьютерных угроз. Частными случаями таких модулей являются модули сигнатурного и эвристического обнаружения. Однако возможны ситуации, когда указанные модули не являются эффективными - для создания современного вредоносного программного обеспечения зачастую используются методы, противодействующие эмуляции программного обеспечения (например: использование недокументированных функций, анализ результата выполнения функций, в том числе проверка установления определенных флагов процессора после выполнения функций или анализ возвращаемых кодов ошибок), а также полиморфные упаковщики. Под упаковкой исполняемого файла в данном случае подразумевается сжатие файла с добавлением к телу файла последовательности инструкций для распаковки. В случае использования полиморфного упаковщика на базе одного и того же вредоносного файла генерируется множество подобных ему файлов, идентичных по функционалу, но имеющих различную структуру. В таком случае эффективность сигнатурного обнаружения не будет большой, так как для каждого файла из множества полиморфных копий будет необходима уникальная сигнатура.To protect the user and his personal computer from possible computer threats, various anti-virus technologies are used. Antivirus software may include various computer threat detection modules. Special cases of such modules are signature and heuristic detection modules. However, there may be situations when these modules are not effective - to create modern malware, methods are often used that counteract software emulation (for example: using undocumented functions, analyzing the result of function execution, including checking if certain processor flags are set after functions have been performed, or analyzing return error codes), as well as polymorphic packers. In this case, packing an executable file means compressing a file with adding a sequence of instructions for unpacking to the file body. In the case of using a polymorphic packer based on the same malicious file, a lot of files similar to it are generated, identical in functionality, but with a different structure. In this case, the signature detection efficiency will not be large, since for each file from multiple polymorphic copies, a unique signature will be required.

Существует альтернативный метод обнаружения вредоносного программного обеспечения посредством обнаружения похожих вредоносных файлов, в котором при анализе файла рассматриваются различные метаданные файла, которые могут варьироваться в зависимости от типа рассматриваемого файла и, например для исполняемого файла, могут выглядеть следующим образом: положение секций файла, размеры секций или иная информация из заголовка исполняемого файла, строковые данные, извлекаемые из файла. В качестве метаданных также могут рассматриваться и производные данные: информационная энтропия (мера неопределенности информации файла), частотные характеристики машинных инструкций или байт. При наличии достаточно большого набора файлов и признаков файлов формируется система классификации, которая, будучи обученной на множестве указанных файлов, умеет сопоставлять набор признаков, выделяемых из файла, некоторой категории файлов, например категории файлов вредоносного программного обеспечения.There is an alternative method for detecting malicious software by detecting similar malicious files, which, when analyzing a file, considers various file metadata, which may vary depending on the type of file being examined, and, for example, for an executable file, may look like this: position of file sections, section sizes or other information from the header of the executable file, string data extracted from the file. Derivative data can also be considered as metadata: informational entropy (a measure of the uncertainty of file information), frequency characteristics of machine instructions or bytes. If there is a sufficiently large set of files and file attributes, a classification system is formed, which, being trained on the set of specified files, can compare the set of attributes allocated from the file to a certain category of files, for example, the category of malicious software files.

В области техники существуют решения, реализующие подход обнаружения вредоносного программного обеспечения на основе метаданных, извлекаемых из файлов программного обеспечения, в том числе и такие методы, где из указанных метаданных файла формируют свертку (англ. convolution) - «гибкий» хэш файла (locality sensitive hash). Такой хэш, а именно способ вычисления, обладает важным свойством - для файлов, метаданные которых незначительно отличаются, значение хэшей будет совпадать.There are solutions in the field of technology that implement a malware detection approach based on metadata extracted from software files, including methods where convolution is created from the specified file metadata — a “flexible” file hash (locality sensitive hash). Such a hash, namely, the calculation method, has an important property - for files whose metadata is slightly different, the value of the hashes will be the same.

Так, изобретение, описанное в заявке US 2009/0300761 A1, предлагает метод обнаружения файлов вредоносного программного обеспечения на основе анализа метаданных файла, в частности информации об упаковщике или уникальных строках. Также в описанном изобретении в качестве метаданных могут применяться генерируемые на основе файла последовательности инструкций псевдокода и частотные характеристики упомянутых инструкций. Из метаданных, выделенных из файла, формируется «intelligent hash» - свертка, представляющая собой строку, хранящую необходимый для обнаружения файлов вредоносного программного обеспечения набор признаков файла.Thus, the invention described in the application US 2009/0300761 A1, provides a method for detecting malware files based on the analysis of file metadata, in particular information about the packer or unique lines. Also, in the described invention, pseudo-code instruction sequences and frequency characteristics of said instructions generated on the basis of the file can be used as metadata. From the metadata extracted from the file, an “intelligent hash” is formed - a convolution, which is a string that stores a set of file attributes necessary for detecting malware files.

Изобретение, описанное в заявке US 2010/0192222 A1, предлагает в качестве метаданных дополнительно использовать результаты работы эмулятора и поведенческого анализатора.The invention described in the application US 2010/0192222 A1, offers as metadata to additionally use the results of the emulator and behavioral analyzer.

Важным требованием, предъявляемым к существующим решениям для обнаружения файлов вредоносного программного обеспечения на основе выделения метаданных, является требование к качеству обнаружения, заключающееся в минимизации ложных срабатываний. В существующих решениях для обнаружения таких файлов вычисляется степень похожести файла на заранее сформированный набор файлов на основании всевозможных метрик расстояний между наборами метаданных файла и группы файлов. Данный подход создает значительную нагрузку на инфраструктуру компьютерной сети, реализующую связь между компьютером пользователя и базой данных метаданных файлов из заранее сформированных наборов файлов, в случае передачи большого набора метаданных анализируемого файла с использованием ресурсов упомянутой компьютерной сети.An important requirement for existing solutions for detecting malware files based on the extraction of metadata is the requirement for the quality of detection, which consists in minimizing false positives. In existing solutions for detecting such files, the degree of similarity of a file to a pre-formed set of files is calculated based on all possible distance metrics between the file metadata sets and the file group. This approach creates a significant load on the computer network infrastructure that implements the connection between the user's computer and the file metadata database from pre-formed file sets, in case of transferring a large set of metadata of the analyzed file using the resources of the mentioned computer network.

Хотя рассмотренные подходы направлены на решение определенных задач в области защиты от компьютерных угроз, они обладают недостатком - желаемое качество обнаружения вредоносных файлов не достигается. Настоящее изобретение позволяет более эффективно решить задачу обнаружения вредоносных файлов на основе метаданных этих файлов.Although the considered approaches are aimed at solving certain tasks in the field of protection against computer threats, they have a drawback - the desired quality of detection of malicious files is not achieved. The present invention allows to more effectively solve the problem of detecting malicious files based on the metadata of these files.

Раскрытие изобретенияDisclosure of invention

Настоящее изобретение предназначено для определения похожести файлов.The present invention is intended to determine the similarity of files.

Технический результат настоящего изобретения заключается в определении похожих файлов, который достигается путем признания файла похожим на файлы из множества похожих файлов, имеющих одинаковую свертку, если свертка указанного файла совпадает со сверткой файла из указанного множества.The technical result of the present invention is to determine similar files, which is achieved by recognizing the file as similar to files from a plurality of similar files having the same convolution, if the convolution of the specified file matches the convolution of a file from the specified set.

Способ определения похожести файлов, в котором: определяют множества изменяемых и неизменяемых признаков файлов, при этом файлы хранятся в базе данных файлов для обучения; при этом признак файла считают изменяемым, если для множества похожих файлов признак принимает различные значения; при этом признак файла считают неизменяемым, если для множества похожих файлов признак принимает одинаковое значение; выделяют множество признаков по меньшей мере из одного файла; разделяют множество выделенных признаков файла по меньшей мере на два подмножества: подмножество изменяемых признаков и подмножество неизменяемых признаков; при этом подмножество изменяемых признаков файла включает в себя по меньшей мере: размер файла, размер образа файла, количество секций файла, RVA (Relative Virtual Address) секций файла, RVA точки входа, частотные характеристики символов, множество строк файла и их количество, отфильтрованное множество строк файла и их количество; при этом подмножество неизменяемых признаков файла включает в себя по меньшей мере: тип использованного при создании файла компилятора, тип подсистемы, характеристики файла из COFF (Common Object File Format) заголовка; формируют свертку каждого из вышеописанных подмножеств признаков файла; при этом свертка подмножества неизменяемых признаков не является устойчивой к изменениям признаков подмножества; при этом свертка подмножества изменяемых признаков является устойчивой к изменениям признаков подмножества; формируют свертку файла как комбинацию сверток каждого из вышеописанных подмножеств признаков файла; сравнивают свертку по меньшей мере одного файла с набором заранее созданных сверток файлов; признают файл похожим на файлы из множества похожих файлов, имеющих одинаковую свертку, если при сравнении свертка указанного файла совпадает со сверткой файла из указанного множества.A method for determining the similarity of files, in which: determine the set of mutable and immutable characteristics of files, while the files are stored in a database of files for training; however, the file attribute is considered mutable if, for a plurality of similar files, the attribute takes different values; however, the file attribute is considered unchanged if for many similar files the attribute takes the same value; extracting a plurality of features from at least one file; divide the set of selected features of the file into at least two subsets: a subset of mutable features and a subset of immutable features; in this case, a subset of modifiable attributes of a file includes at least: file size, file image size, number of file sections, RVA (Relative Virtual Address) file sections, RVA entry points, frequency characteristics of characters, many lines of a file and their number, filtered many file lines and their number; in this case, a subset of immutable file attributes includes at least: the type of the compiler used to create the file, the type of subsystem, the file characteristics from the COFF (Common Object File Format) header; forming a convolution of each of the above subsets of file attributes; however, the convolution of a subset of unchanging signs is not resistant to changes in the characteristics of the subset; in this case, the convolution of a subset of mutable features is resistant to changes in the characteristics of the subset; forming a file convolution as a combination of convolutions of each of the above subsets of file attributes; comparing the convolution of at least one file with a set of pre-created convolution files; recognize a file similar to files from a variety of similar files having the same convolution, if, when comparing, the convolution of the specified file matches the convolution of the file from the specified set.

Система определения похожести файлов, которая содержит: средство выделения признаков, предназначенное для выделения множества признаков из по меньшей мере одного файла и передающее выделенное множество признаков на вход средства разделения признаков; средство разделения признаков, предназначенное для формирования из множества выделенных признаков как минимум двух подмножеств: подмножества изменяемых признаков и подмножества неизменяемых признаков, и передающее сформированные подмножества признаков на вход средства формирования сверток признаков, а также для определения множества изменяемых и неизменяемых признаков файлов, при этом упомянутые файлы хранятся в базе данных файлов для обучения; при этом признак файла считают изменяемым, если для множества похожих файлов признак принимает различные значения; при этом признак файла считают неизменяемым, если для множества похожих файлов признак принимает одинаковое значение; при этом подмножество изменяемых признаков файла включает в себя по меньшей мере: размер файла, размер образа файла, количество секций файла, RVA секций файла, RVA точки входа, частотные характеристики символов, множество строк файла и их количество, отфильтрованное множество строк файла и их количество; при этом подмножество неизменяемых признаков файла включает в себя по меньшей мере: тип использованного при создании файла компилятора, тип подсистемы, характеристики файла из COFF заголовка; средство формирования сверток признаков, предназначенное для формирования свертки каждого из вышеописанных подмножеств признаков файла и передающее сформированные свертки подмножеств признаков на вход средства формирования свертки файла; при этом свертка подмножества неизменяемых признаков не является устойчивой к изменениям признаков подмножества; при этом свертка подмножества изменяемых признаков является устойчивой к изменениям признаков подмножества; средство формирования свертки файла, предназначенное для формирования свертки файла как комбинации сверток каждого из вышеописанных подмножеств признаков файла и передающее сформированную свертку файла на вход средства сравнения сверток файлов; базу данных сверток, предназначенную для хранения по меньшей мере одной свертки файла и соответствующей указанной свертке информации о вредоносности файла; базу данных файлов для обучения, предназначенную для хранения файлов, которые используются для определения множества изменяемых и неизменяемых признаков; средство сравнения сверток файлов, связанное с базой данных сверток и предназначенное для сравнения свертки по меньшей мере одного файла со свертками, присутствующими в базе данных сверток, и признания файла похожим на файлы из множества похожих файлов, имеющих одинаковую свертку, если при сравнении свертка указанного файла совпадает со сверткой файла из указанного множества.A system for determining the similarity of files, which comprises: a feature extraction means for extracting a plurality of features from at least one file and transmitting a selected set of features to the input of the feature separation means; means for separating features intended to form at least two subsets from the set of selected features: a subset of mutable features and a subset of immutable features, and transmitting the generated subsets of features to the input of a feature convolution tool, as well as to define a plurality of modifiable and unchanged file attributes, while files are stored in a database of files for training; however, the file attribute is considered mutable if, for a plurality of similar files, the attribute takes different values; however, the file attribute is considered unchanged if for many similar files the attribute takes the same value; at the same time, a subset of modifiable attributes of a file includes at least: file size, file image size, number of file sections, RVA file sections, RVA entry points, frequency characteristics of characters, many file lines and their number, filtered many file lines and their number ; in this case, a subset of immutable file attributes includes at least: the type of the compiler used to create the file, the type of subsystem, file characteristics from the COFF header; means for generating convolution of signs, designed to form a convolution of each of the above subsets of file attributes and transmitting the generated convolutions of subsets of attributes to the input of the means for generating convolution of the file; however, the convolution of a subset of unchanging signs is not resistant to changes in the characteristics of the subset; in this case, the convolution of a subset of mutable features is resistant to changes in the characteristics of the subset; file convolution forming means for generating a file convolution as a combination of convolutions of each of the above subsets of file attributes and transmitting the generated file convolution to the input of the file convolution comparison tool; a convolution database for storing at least one convolution of the file and the corresponding malicious convolution of the file; a database of files for training, designed to store files that are used to determine the set of mutable and immutable characteristics; file convolution comparison tool associated with the convolution database, designed to compare the convolution of at least one file with convolutions present in the convolution database, and make the file look like files from many similar files that have the same convolution, if the comparison convolves the specified file matches the convolution of a file from the specified set.

Краткое описание чертежейBrief Description of the Drawings

Дополнительные цели, признаки и преимущества настоящего изобретения будут очевидными из прочтения последующего описания осуществления изобретения со ссылкой на прилагаемые чертежи, на которых:Additional objectives, features and advantages of the present invention will be apparent from reading the following description of an embodiment of the invention with reference to the accompanying drawings, in which:

Фиг. 1 показывает структурную схему системы обнаружения вредоносного программного обеспечения.FIG. 1 shows a block diagram of a malware detection system.

Фиг. 2 иллюстрирует вариант взаимодействия средства выделения признаков и средства разделения признаков.FIG. 2 illustrates an interaction option of a feature extraction means and a feature separation means.

Фиг. 3 иллюстрирует структурную схему формирования хэш-функции на этапе обучения алгоритма SSH.FIG. 3 illustrates a block diagram of the formation of a hash function in the learning phase of the SSH algorithm.

Фиг. 4 показывает примерную схему алгоритма работы одного из вариантов реализации системы обнаружения вредоносного программного обеспечения.FIG. 4 shows an exemplary flow chart of one embodiment of a malware detection system.

Фиг. 5 показывает пример компьютерной системы общего назначения.FIG. 5 shows an example of a general purpose computer system.

Хотя изобретение может иметь различные модификации и альтернативные формы, характерные признаки, показанные в качестве примера на чертежах, будут описаны подробно. Следует понимать, однако, что цель описания заключается не в ограничении изобретения конкретным его воплощением. Наоборот, целью описания является охват всех изменений, модификаций, входящих в рамки данного изобретения, как это определено приложенной формуле.Although the invention may have various modifications and alternative forms, the characteristic features shown by way of example in the drawings will be described in detail. It should be understood, however, that the purpose of the description is not to limit the invention to its specific embodiment. On the contrary, the purpose of the description is to cover all changes, modifications that are included in the scope of this invention, as defined by the attached formula.

Описание вариантов осуществления изобретенияDescription of Embodiments

Объекты и признаки настоящего изобретения, способы для достижения этих объектов и признаков станут очевидными посредством отсылки к примерным вариантам осуществления. Однако настоящее изобретение не ограничивается примерными вариантами осуществления, раскрытыми ниже, оно может воплощаться в различных видах. Сущность, приведенная в описании, является не чем иным, как конкретными деталями, необходимыми для помощи специалисту в области техники в исчерпывающем понимании изобретения, и настоящее изобретение определяется в объеме приложенной формулы.The objects and features of the present invention, methods for achieving these objects and features will become apparent by reference to exemplary embodiments. However, the present invention is not limited to the exemplary embodiments disclosed below, it can be embodied in various forms. The essence described in the description is nothing more than the specific details necessary to assist the specialist in the field of technology in a comprehensive understanding of the invention, and the present invention is defined in the scope of the attached claims.

На Фиг. 1 показана структурная схема системы обнаружения вредоносного программного обеспечения. В частном случае реализации источником файла для анализа является антивирусное программное обеспечение на компьютере, которое признало файл требующим дополнительного анализа. При анализе файла вышеупомянутой системой обнаружения средство выделения признаков 110 извлекает из файла множество признаков. Данное множество признаков варьируется в зависимости от типа файла. В частном случае реализации упомянутая система обнаружения анализирует исполняемые файлы формата РЕ (Portable Executable). В частном случае реализации изобретения множество признаков для упомянутого типа файлов включает в себя следующие признаки: размер файла, размер образа файла, количество секций файла, RVA (Related Virtual Address) секций файла, RVA точки входа, тип подсистемы, характеристики файла из COFF (Common Object File Format) заголовка, взаимное расположение объектов таблицы директорий, расположение объектов таблицы директорий по секциям файла, тип использованного при создании файла компилятора, частотные характеристики символов (в том числе печатных), множество строк файла и их количество. В частном случае реализации множество признаков файла для каждой секции файла включает в себя следующие признаки: информационная энтропия начала и конца секции, среднее значение ненулевых байт начала и конца секции, виртуальный размер секции, физический размер секции. В частном случае реализации изобретения средство выделения признаков выделяет множество строк для РЕ-файла из дампа, полученного при эмуляции процесса выполнения упомянутого файла.In FIG. 1 shows a block diagram of a malware detection system. In the particular case of implementation, the source of the file for analysis is the anti-virus software on the computer, which recognized the file as requiring further analysis. When analyzing the file with the aforementioned detection system, the feature extractor 110 extracts a plurality of features from the file. This set of features varies by file type. In the particular case of implementation, the said detection system analyzes executable files of the PE (Portable Executable) format. In the particular case of the invention, the set of features for the mentioned file type includes the following features: file size, file image size, number of file sections, RVA (Related Virtual Address) file sections, RVA entry points, subsystem type, file characteristics from COFF (Common Object File Format) header, relative position of directory table objects, location of directory table objects by file sections, type of compiler used to create the file, frequency characteristics of characters (including printed ones), many file lines their number. In the particular case of implementation, the set of file attributes for each file section includes the following attributes: informational entropy of the beginning and end of the section, average value of nonzero bytes of the beginning and end of the section, virtual size of the section, physical size of the section. In the particular case of the invention, the feature extraction means extracts a plurality of lines for the PE file from the dump obtained by emulating the execution process of the said file.

Множество выделенных из файла признаков поступает на вход средства разделения признаков 120, которое разделяет множество признаков на по меньшей мере два подмножества. В частном случае реализации множество признаков разбивается на следующие подмножества: подмножество номинальных признаков, подмножество количественных признаков, подмножество порядковых признаков, подмножество бинарных признаков. В другом частном случае реализации множество признаков файла разделяется на следующие подмножества: подмножество признаков, содержащее как минимум один изменяемый признак, и подмножество признаков, содержащее как минимум один неизменяемый признак. Признак файла считается изменяемым, если для множества похожих файлов значение данного признака варьируется. Признак файла считается неизменяемым, если для множества похожих файлов значение признака остается неизменным. В свою очередь, файлы считаются похожими, если степень сходства между файлами превышает установленный порог. В частном случае реализации степень сходства между файлами определяют на основании степени сходства данных, хранящихся в файлах. В другом частном случае реализации степень сходства файлов определяют на основании степени сходства функционала файлов. В частном случае реализации в качестве функционала файла рассматривается журнал вызовов API-функций операционной системы при эмуляции исполнения файла. В частном случае реализации изобретения степень сходства определяют в соответствии с мерой Дайса, в другом частном случае реализации степень сходства определяют в соответствии с одной из метрик: Хэмминга, Левенштейна, Жаккара.A plurality of features selected from the file is input to the feature splitter 120, which divides the plurality of features into at least two subsets. In the particular case of implementation, the set of signs is divided into the following subsets: a subset of nominal signs, a subset of quantitative signs, a subset of ordinal signs, a subset of binary signs. In another special case of implementation, the set of characteristics of a file is divided into the following subsets: a subset of characteristics containing at least one mutable attribute, and a subset of attributes containing at least one immutable attribute. A file attribute is considered mutable if the value of this attribute varies for many similar files. A file attribute is considered unchanged if for many similar files the value of the attribute remains unchanged. In turn, files are considered similar if the degree of similarity between the files exceeds the set threshold. In the particular case of the implementation, the degree of similarity between the files is determined based on the degree of similarity of the data stored in the files. In another particular embodiment, the degree of similarity of files is determined based on the degree of similarity of the functionality of the files. In the particular case of implementation, the log of calls to the API functions of the operating system when emulating file execution is considered as a file functionality. In the particular case of the invention, the degree of similarity is determined in accordance with the Dyce measure, in another particular case of implementation, the degree of similarity is determined in accordance with one of the metrics: Hamming, Levenshtein, Jacquard.

После разделения средством 120 выделенного из файла множества признаков на по меньшей мере два подмножества, одно из которых в частном случае реализации содержит как минимум один изменяемых признак, а другое - как минимум один неизменяемый, подмножества признаков поступают на вход средства формирования сверток признаков 130.After the means 120 separates the set of features selected from the file into at least two subsets, one of which in the particular case of implementation contains at least one mutable attribute, and the other at least one unchanged, the subsets of attributes are input to the means of forming a convolution of signs 130.

На Фиг. 2 показан вариант реализации взаимодействия средства выделения признаков 110 и средства разделения признаков 120. В частном случае реализации изобретения средство выделения признаков 110 передает множество выделенных строк на вход средству фильтрации строк 210. Средство фильтрации строк 210 осуществляет фильтрацию строк, считающихся сгенерированными случайным образом, например строк, генерируемых вредоносным программным обеспечением для формирования с их помощью пути для копирования файлов вредоносного программного обеспечения. Фильтрация осуществляется посредством определения вероятности принадлежности строки к естественному языку, заданному словарем. Строки и подстроки с малым (не превышающим установленный порог) значением указанной вероятности считаются неинформативными и фильтруются.In FIG. 2 shows an embodiment of the interaction of the feature extraction tool 110 and the feature separation tool 120. In the particular case of the invention, the feature extraction tool 110 transmits a plurality of selected lines to the input of the line filtering 210. The line filtering tool 210 filters the lines considered randomly generated, for example, the lines generated by malware to form paths for copying malware files with their help. Filtering is carried out by determining the probability of a string belonging to the natural language defined by the dictionary. Lines and substrings with a small (not exceeding the set threshold) value of the indicated probability are considered uninformative and filtered.

Принцип работы средства фильтрации строк 210 можно проиллюстрировать следующим примером. Пусть множество строк файла содержит следующие строки: «klnsdfjnjrmuiwenwefnuinwfui», «GetDriveType», «C:\lkjsdfjkh\windows.txt». При фильтрации указанного списка строк первая строка не попадет в результирующее отфильтрованное множество строк, вторая попадет в неизменном виде, а третья попадет в измененном виде - «*\*\windows.*». Отфильтрованное таким образом множество строк, а также информация о количестве строк из отфильтрованного множества поступают на вход средства разделения признаков 120 в качестве элементов множества признаков файла.The operating principle of the string filtering tool 210 can be illustrated by the following example. Let the many lines of the file contain the following lines: "klnsdfjnjrmuiwenwefnuinwfui", "GetDriveType", "C: \ lkjsdfjkh \ windows.txt". When filtering the specified list of lines, the first line will not fall into the resulting filtered set of lines, the second will fall unchanged, and the third will fall in a modified form - "* \ * \ windows. *". The multiple lines filtered in this way, as well as information about the number of lines from the filtered multiple, are input to the feature separation means 120 as elements of the multiple features of the file.

В частном варианте реализации изобретения для файла типа РЕ подмножество признаков, содержащее как минимум один изменяемый признак, включает в себя следующие признаки: размер файла, размер образа файла, количество секций файла, RVA секций файла, RVA точки входа, взаимное расположение объектов таблицы директорий, частотные характеристики символов (в том числе печатных), множество строк файла и их количество, отфильтрованное множество строк и их количество. В частном случае реализации упомянутое подмножество признаков для каждой секции файла включает в себя следующие признаки: информационная энтропия начала и конца секции, среднее значение ненулевых байт начала и конца секции, виртуальный размер секции, физический размер секции. В частном случае реализации изобретения множество строк файла выделяется в отдельное подмножество признаков, содержащее как минимум один изменяемый признак. В другом частном случае реализации изобретения отфильтрованное множество строк файла выделяется в отдельное подмножество признаков, содержащее как минимум один изменяемый признак.In a particular embodiment of the invention, for a file of type PE, a subset of features containing at least one mutable feature includes the following features: file size, file image size, number of file sections, RVA file sections, RVA entry points, relative position of directory table objects, frequency characteristics of characters (including printed characters), many lines of a file and their number, filtered many lines and their number. In the particular case of implementation, the mentioned subset of features for each file section includes the following features: informational entropy of the beginning and end of the section, average value of nonzero bytes of the beginning and end of the section, virtual size of the section, physical size of the section. In the particular case of the invention, a plurality of lines of a file are allocated into a separate subset of features containing at least one mutable feature. In another particular embodiment of the invention, the filtered set of lines of a file is allocated to a separate subset of attributes containing at least one variable attribute.

В частном случае реализации изобретения для получения признаков файла типа РЕ, попадающих в подмножество признаков, содержащее как минимум один неизменяемый признак, из множества признаков файла выделяются признаки использованного при создании файла компилятора, тип подсистемы (NATIVE, WINDOWS_GUI, WINDOWS_CUI, OS2_CUI, POSIX_CUI, NATIVE_WINDOWS, WINDOWS_CE_GUI), характеристики файла из COFF заголовка, расположение объектов таблицы директорий по секциям файла.In the particular case of the invention, in order to obtain features of a file of type PE that fall into a subset of features containing at least one unchanging feature, the features of the compiler used to create the file, the type of subsystem (NATIVE, WINDOWS_GUI, WINDOWS_CUI, OS2_CUI, POSIX_CUI, NATIVE_WINDOW , WINDOWS_CE_GUI), file characteristics from COFF header, location of directory table objects by file sections.

Свертка подмножества признаков, содержащего как минимум один неизменяемый признак, далее именуемое как «подмножество неизменяемых признаков», формируется при помощи средства формирования сверток 130 таким образом, что она не является устойчивой к изменениям признаков подмножества.The convolution of a subset of features containing at least one unchanging feature, hereinafter referred to as “a subset of unchanging features”, is formed using the means of forming convolution 130 so that it is not resistant to changes in the characteristics of the subset.

В частном случае реализации изобретения свертка подмножества неизменяемых признаков представляет собой байтовую строку и формируется средством формирования сверток признаков 130 путем конкатенации строковых записей значения каждого из признаков, разделенных специальным символом, который в частном случае реализации представляет собой символ «*». Например, для консольного приложения, скомпилированного при помощи компилятора Microsoft Visual Studio 2008, свертка подмножества неизменяемых признаков будет выглядеть следующим образом: «WINDOWS_CUI*MSVS2008».In the particular case of the invention, the convolution of a subset of immutable features is a byte string and is formed by means of generating convolution of features 130 by concatenating string records of the values of each of the signs separated by a special character, which in the particular case of implementation is the symbol "*". For example, for a console application compiled using the Microsoft Visual Studio 2008 compiler, the convolution of a subset of immutable attributes would look like this: "WINDOWS_CUI * MSVS2008."

Свертка подмножества признаков, содержащего как минимум один изменяемый признак, далее именуемого как «подмножество изменяемых признаков», формируется при помощи средства формирования сверток признаков 130 таким образом, что свертка является устойчивой к изменениям признаков подмножества.The convolution of a subset of features containing at least one variable characteristic, hereinafter referred to as “a subset of variable characteristics”, is formed by means of forming convolution of signs 130 so that the convolution is resistant to changes in the characteristics of the subset.

В частном случае реализации устойчивость сверток подмножеств изменяемых признаков достигается путем исключения из данных подмножеств набора признаков, значения которых варьируются в рамках множества похожих файлов. Таким образом, наборы изменяемых признаков для двух похожих файлов будут тождественны, и, независимо от способа формирования, свертки признаков будут также тождественны.In the particular case of implementation, the stability of convolutions of subsets of mutable attributes is achieved by eliminating from the given subsets a set of attributes whose values vary within many similar files. Thus, the sets of mutable features for two similar files will be identical, and, regardless of the method of formation, convolution of features will also be identical.

В другом частном случае реализации изобретения устойчивость сверток подмножеств изменяемых признаков достигается путем установления для каждого из признаков, характеризуемых значением одного или нескольких байт, числового окна, характеризующего минимальное и максимальное значение, принимаемое каждым байтом, характеризующим признак. В частном случае реализации изобретения свертка получается путем конкатенации байтового представления признаков с учетом установленного предела для каждого байта, характеризующего признак файла.In another particular case of the invention, the stability of convolutions of subsets of mutable features is achieved by setting for each of the features characterized by the value of one or more bytes a numerical window characterizing the minimum and maximum value accepted by each byte characterizing the feature. In the particular case of the invention, the convolution is obtained by concatenating the byte representation of the signs, taking into account the established limit for each byte characterizing the file attribute.

В другом частном случае реализации изобретения устойчивость сверток подмножеств изменяемых признаков достигается путем использования сверток, обладающих свойством locality sensitive hashing (lsh).In another particular case of the invention, the stability of convolutions of subsets of variable characteristics is achieved by using convolutions with the property of locality sensitive hashing (lsh).

В частном случае реализации изобретения при выделении из подмножества изменяемых признаков множества строк (в том числе отфильтрованного множества строк) в отдельное подмножество, очевидно, возникает необходимость обеспечить устойчивость свертки упомянутого подмножества, состоящего из строк. В частном случае реализации изобретения для обеспечения устойчивости свертки подмножества строк файла используется косинусное хэширование, результатом которого является последовательность байт. Полученная таким образом свертка является сверткой подмножества признаков, состоящего из строк файла.In the particular case of the invention, when isolating from a subset of variable features of a plurality of lines (including a filtered set of lines) into a separate subset, it is obviously necessary to ensure the convolution of the said subset consisting of lines. In the particular case of the invention, to ensure the stability of convolution of a subset of the lines of the file, cosine hashing is used, the result of which is a sequence of bytes. The convolution obtained in this way is a convolution of a subset of features consisting of lines of a file.

В одном из вариантов реализации изобретения для обеспечения устойчивости свертки подмножества изменяемых признаков средство формирования сверток признаков 130 реализует способ создания свертки подмножества изменяемых признаков, основанный на алгоритме Semi-Supervised Hashing (SSH). Преимущество использования данного алгоритма для решения lsh-задачи обусловлено формированием оптимальной хэш-функции для свертки подмножества изменяемых признаков в процессе обучения используемого алгоритма.In one embodiment of the invention, to ensure convolution convolution of a subset of mutable features, feature convolution forming means 130 implements a method for creating convolution of a subset of mutable features based on the Semi-Supervised Hashing (SSH) algorithm. The advantage of using this algorithm to solve the lsh-problem is due to the formation of an optimal hash function for convolution of a subset of mutable features in the learning process of the algorithm used.

На Фиг. 3 показана структурная схема формирования хэш-функции в рамках этапа обучения SSH-алгоритма. Для всех файлов из базы данных файлов для обучения 310 средство выделения признаков 110 выделяет множество признаков. Наборы из множеств выделенных признаков для каждого файла из базы данных 310 поступают на вход средства разделения признаков 120. Средство разделения признаков 120 из каждого входного набора признаков выделяет подмножество изменяемых признаков. Наборы таких подмножеств признаков, соответствующие файлам из базы данных файлов для обучения 310, поступают средству формирования хэш-функции 320 в виде векторов байтовых значений признаков. Средству формирования хэш-функции 320 со стороны базы данных отношений между файлами для обучения 330 на вход поступает множество отношений похожести между файлами, которые хранятся в базе данных файлов для обучения 310. В частном случае реализации отношение похожести имеет вид кортежа, хранящего идентификаторы файлов из базы данных 310, между которыми рассматриваются отношения, и идентификатор отношения похожести между указанными файлами.In FIG. Figure 3 shows a structural diagram of the formation of a hash function as part of the learning phase of the SSH algorithm. For all files from the file database for training 310, a feature extractor 110 extracts a plurality of features. Sets of the sets of selected features for each file from the database 310 are input to the feature splitter 120. The feature splitter 120 from each input feature set selects a subset of mutable features. The sets of such subsets of attributes corresponding to the files from the database of files for training 310 are received by the hash function generator 320 in the form of vectors of byte values of the attributes. A means for generating a hash function 320 from the database of the relationships between files for training 330 receives a lot of similarity relationships between files that are stored in a database of files for training 310. In a particular implementation, the similarity relationship looks like a tuple storing file identifiers from the database data 310 between which relations are considered, and an identifier of the similarity relationship between said files.

В частном случае реализации базы данных отношений между файлами для обучения 330 отношение похожести характеризуется одним из следующих значений: «похожи», «различны», «отношение неизвестно». Значение «похожи» применяется к паре похожих файлов. Для пар файлов, которые не являются похожими, устанавливается отношение похожести «различны». Если для пары файлов из базы данных файлов для обучения 310 отношение похожести не указано явным образом, отношение похожести устанавливается как «отношение неизвестно».In the particular case of the implementation of the database of relations between files for training 330, the similarity relation is characterized by one of the following meanings: “similar”, “different”, “relation unknown”. The value “similar” applies to a pair of similar files. For pairs of files that are not similar, a similar relationship is set to “different”. If the similarity relationship is not explicitly specified for a pair of files from a training database file 310, the similarity relationship is set to “unknown ratio”.

В частном случае реализации базы данных отношений между файлами для обучения 330 отношение похожести «различны» устанавливается между двумя файлами, один из которых является вредоносным, а другой не является таковым. В частном случае реализации отношение похожести «похожи» устанавливается между парами вредоносных похожих файлов. В другом частном случае реализации отношение похожести «похожи» устанавливается между парами похожих файлов, не являющихся вредоносными.In the particular case of the implementation of the database of file-to-file relationships for training 330, the similarity relationship is “different” between two files, one of which is malicious and the other is not. In the particular case of the implementation, the similarity relation “similar” is established between pairs of malicious similar files. In another special case of the implementation, the similarity relation “similar” is established between pairs of similar files that are not malicious.

Получив информацию о векторах изменяемых признаков файлов и отношениях похожести между файлами, средство формирования хэш-функции 320 производит настраивание хэш-функции в соответствии с этапом обучения SHH-алгоритма. Множество векторов изменяемых признаков файлов рассматриваются как множество точек в n-мерном гиперпространстве, где n - количество признаков из множества изменяемых признаков. Средство формирования хэш-функции 320 на основании этапа обучения SSH-алгоритма разбивает указанное гиперпространство плоскостями таким образом, что файлы, для которых указано отношение похожести «похожи», остаются по одну сторону от плоскости, а файлы, для которых указано отношение похожести «различны», - по разные. Результатом обучения является вычисление коэффициентов, задающих плоскости разбиения гиперпространства. Коэффициенты представляются в виде матрицы. Средство формирования хэш-функции 320 подает указанную матрицу коэффициентов на вход средства формирования сверток признаков 130. В свою очередь, средство формирования сверток признаков 130, используя вышеописанную матрицу коэффициентов, для каждого входного сгруппированного в виде вектора набора признаков из подмножества изменяемых признаков формирует свертку подмножества изменяемых признаков, представляющую собой последовательность байт.Having received information about vectors of mutable attributes of files and similarity relationships between files, hash function generation tool 320 performs hash function adjustment in accordance with the SHH algorithm training step. The set of vectors of mutable attributes of files are considered as the set of points in n-dimensional hyperspace, where n is the number of attributes from the set of mutable signs. The hash generation tool 320, based on the SSH learning step, splits the indicated hyperspace with planes so that the files for which the similarity relation is “similar” remain on one side of the plane, and the files for which the similarity relation is “different” , - different. The result of the training is the calculation of the coefficients defining the hyperspace partition planes. Coefficients are presented in the form of a matrix. The hash function generation tool 320 supplies the specified matrix of coefficients to the input of the feature convolution forming means 130. In turn, the feature convolution forming means 130, using the above matrix of coefficients, for each input feature set grouped as a vector from a subset of mutable attributes forms a convolution of a subset of mutable signs, which is a sequence of bytes.

Особенностью описанного механизма формирования сверток является то, что для набора признаков двух файлов будет сгенерирована идентичная свертка, если для файлов, используемых для выделения указанных признаков, отношение похожести на этапе обучения SSH-алгоритма было установлено как «похожи». В то же время свертки наборов признаков двух файлов, отношение между которыми было установлено как «различны», будут различаться. Способ формирования хэш-функции для получения свертки подмножества изменяемых признаков гарантирует устойчивость свертки к изменениям признаков из подмножества изменяемых признаков. Это означает, что свертки наборов подмножества изменяемых признаков двух похожих файлов будут идентичны, даже если какие-либо признаки из подмножества изменяемых признаков указанных файлов будут иметь различные значения.A feature of the described convolution formation mechanism is that an identical convolution will be generated for a set of features of two files, if for files used to highlight the specified features, the similarity ratio at the SSH training stage was set to “similar”. At the same time, convolutions of feature sets of two files, the relationship between which was established as “different,” will differ. A method for generating a hash function to obtain convolution of a subset of mutable features ensures that convolution is resistant to changes in features from a subset of mutable features. This means that convolutions of sets of a subset of mutable attributes of two similar files will be identical, even if any signs from a subset of mutable attributes of the specified files will have different meanings.

Средство формирования свертки файла 140 получает на вход со стороны средства формирования сверток признаков 130 свертки всех выделенных подмножеств признаков. Свертка каждого из подмножеств признаков представляет собой последовательность байт. Свертка файла формируется путем комбинации сверток подмножеств признаков файла. Будучи сформированной на основе устойчивых сверток подмножеств изменяемых признаков и сверток подмножеств неизменяемых признаков, свертка файла также обладает устойчивостью, принимая одинаковый, с точностью до байта, вид для множества похожих файлов.The convolution forming means of the file 140 receives input from the side of the convolution forming means of signs 130 convolution of all selected subsets of signs. The convolution of each of the subsets of signs is a sequence of bytes. File convolution is formed by a combination of convolutions of subsets of file attributes. Being formed on the basis of stable convolutions of subsets of mutable attributes and convolutions of subsets of immutable attributes, the convolution of the file is also stable, assuming the same form, up to a byte, for many similar files.

В частном случае реализации средство формирования свертки файла 140 формирует свертку файла как конкатенацию байтового представления сверток подмножеств признаков.In the particular case of the implementation, the means for generating the convolution of the file 140 forms the convolution of the file as a concatenation of the byte representation of the convolution of the subsets of attributes.

В другом частном случае реализации средство формирования свертки файла 140 формирует свертку файла на основании косинусного хэширования, применяемого к сверткам подмножеств признаков файла.In another particular embodiment, the file convolution forming means 140 generates a file convolution based on cosine hashing applied to convolutions of subsets of the file attributes.

В еще одном частном варианте реализации изобретения средство формирования свертки файла 140 после формирования свертки файла дополнительно вычисляет хэш-функцию от байтового представления свертки файла, после чего сверткой файла считается вычисленное значение упомянутой хэш-функции. В частном случае реализации средством формирования свертки файла 140 в качестве хэш-функции, применяемой к байтовому представлению свертки файла, используется MD5.In yet another particular embodiment of the invention, the means for generating the convolution of the file 140 after generating the convolution of the file additionally calculates the hash function from the byte representation of the convolution of the file, after which the convolution of the file is the calculated value of the said hash function. In the particular case of the implementation of the convolution file 140, MD5 is used as the hash function applied to the byte representation of the convolution of the file.

После вычисления свертки файла средство формирования свертки файла 140 передает сгенерированную свертку анализируемого файла средству сравнения сверток файлов 150. Средство сравнения сверток файлов 150 также связано с базой данных сверток 170.After calculating the file convolution, the file convolution forming tool 140 transmits the generated convolution of the analyzed file to the file convolution comparison tool 150. The file convolution comparison tool 150 is also associated with the convolution database 170.

Средство сравнения сверток файлов 150 сравнивает полученную от средства формирования свертки файла 140 свертку с множеством сверток файлов. Упомянутое множество сверток файлов хранится в базе данных сверток 170.The file convolution comparison tool 150 compares the convolution obtained from the file convolution tool 140 to a plurality of file convolutions. Mentioned many convolution files are stored in the database convolution 170.

База данных сверток 170 помимо сверток файлов хранит идентификаторы соответствующих сверткам файлов, а также информацию о вредоносности файлов, свертки которых хранятся в базе данных. Свертки файлов, хранимые в базе данных сверток 170, заранее формируются в соответствии с вышеописанным способом формирования свертки файла. При формировании набора сверток для базы данных сверток 170 используются кластеры похожих файлов. В частном случае реализации как минимум один кластер файлов, используемый для формирования свертки, которая будет храниться в базе данных сверток 170, состоит из множества вредоносных файлов.The convolution database 170, in addition to convolution files, stores identifiers corresponding to convolution files, as well as information about the harmfulness of files whose convolutions are stored in the database. File convolutions stored in the convolution database 170 are preformed in accordance with the above-described method for generating the file convolution. When forming the convolution set for convolution database 170, clusters of similar files are used. In the particular case of the implementation, at least one cluster of files used to form the convolution, which will be stored in the convolution database 170, consists of many malicious files.

Если при сравнении свертки анализируемого файла, осуществляемом средством сравнения сверток файлов 150, упомянутое средство находит среди сверток, хранящихся в базе данных сверток 170, идентичную свертку, средство сравнения сверток файлов 150 признает анализируемый файл похожим на файлы из кластера похожих файлов, взятого за основу для формирования свертки в базе данных сверток 170.If, when comparing the convolution of the analyzed file by means of the file convolution comparison tool 150, the mentioned tool finds an identical convolution among the convolutions stored in the convolution database 170, the file convolution comparison tool 150 recognizes the analyzed file as similar to the files from the cluster of similar files taken as the basis for convolution forming in the convolution database 170.

В частном случае реализации средство сравнения сверток файлов 150 признает файл вредоносным, если он признан похожим на файлы кластера похожих файлов, которые являются вредоносными файлами. Информация о вредоносности файлов, используемых для формирования свертки, попадающей в базу данных сверток 170, задается при создании соответствующей записи в базе данных.In the particular case of the implementation, the file convolution comparison tool 150 recognizes the file as malicious if it is recognized as similar to a cluster of similar files that are malicious files. Information about the harmfulness of the files used to form the convolution that falls into the convolution database 170 is specified when creating the corresponding record in the database.

В частном случае реализации в качестве входных данных средства выделения признаков 110 используются файлы, анализируемые антивирусным программным обеспечением на компьютере пользователя. Результат анализа рассматриваемого файла с помощью описанных в рамках изобретения средств используется в сочетании с другими средствами защиты пользователя, предоставляемыми антивирусным программным обеспечением с целью повышения степени защищенности пользователя.In the particular case of implementation, the files analyzed by the anti-virus software on the user's computer are used as input data for the feature extraction tool 110. The result of the analysis of the file in question using the means described in the framework of the invention is used in combination with other user protection tools provided by antivirus software in order to increase the degree of user protection.

В частном случае реализации изобретения база данных сверток 170 и средство сравнения сверток файлов 150 являются частью серверной архитектуры, соединенной с персональным компьютером пользователя, на котором расположены средства, описанные в рамках изобретения и необходимые для формирования свертки файла. В другом частном случае реализации база данных сверток 170 и средство сравнения сверток файлов 150 располагаются непосредственно на персональном компьютере пользователя.In the particular case of the invention, the convolution database 170 and the file convolution comparison tool 150 are part of the server architecture connected to the user's personal computer, on which the tools described in the framework of the invention and necessary for generating the file convolution are located. In another particular embodiment, the convolution database 170 and the file convolution comparison tool 150 are located directly on the user's personal computer.

На Фиг. 4 изображена примерная схема алгоритма работы одного из вариантов реализации вышеописанной системы обнаружения вредоносного программного обеспечения. На этапе 410 средством выделения признаков 110 производится выделение множества признаков файла. Выделенное множество признаков передается средству разделения признаков 120, которое на этапе 420 разделяет множество выделенных признаков файла на как минимум одно подмножество, содержащее по меньшей мере один изменяемый признак, и как минимум одно подмножество, содержащее по меньшей мере один неизменяемый признак. Подмножества признаков поступают на средство формирования сверток подмножеств признаков 130, которое на этапе 430 осуществляет формирование сверток каждого из входных подмножеств признаков файла. На этапе 440 средство формирования свертки файла 140, получая от средства формирования сверток признаков 130 свертки каждого подмножества признаков файла, формирует свертку файла как конкатенацию сверток подмножеств признаков. Полученная на этапе 440 свертка файла поступает на вход средства сравнения сверток файлов 150, которое на этапе 450 производит поиск указанной свертки среди множества заранее сформированных сверток из базы данных сверток 170. В случае если на этапе 460 средством сравнения сверток файлов 150 будет выявлено совпадение свертки анализируемого файла и свертки из базы данных сверток 170, анализируемый файл будет признан похожим на файлы из кластера похожих файлов, которые были использованы для формирования свертки, совпавшей со сверткой анализируемого файла. Если на этапе 490 будет выяснено, что для формирования совпавшей свертки был использован кластер похожих вредоносных файлов, то анализируемый файл будет признан вредоносным на этапе 500. Если же на этапе 490 будет выяснено, что для формирования совпавшей свертки был использован кластер похожих файлов, которые не являются вредоносными, то анализируемый файл будет признан не вредоносным на этапе 510.In FIG. 4 depicts an exemplary algorithm for the operation of one embodiment of the above malware detection system. At step 410, a feature extraction tool 110 extracts a plurality of file attributes. The extracted set of features is transmitted to the feature splitter 120, which, at step 420, splits the set of selected features of the file into at least one subset containing at least one mutable attribute and at least one subset containing at least one immutable attribute. The subsets of signs are supplied to the means of forming convolutions of the subsets of signs 130, which at step 430 generates convolutions of each of the input subsets of the signs of the file. At step 440, the means for generating the convolution of the file 140, receiving from the means for generating the convolution of attributes 130 of the convolution of each subset of the attributes of the file, generates a convolution of the file as a concatenation of convolutions of the subsets of the attributes. The file convolution obtained at step 440 is input to the file convolution comparison tool 150, which, at step 450, searches for the specified convolution among the many pre-formed convolutions from the convolution database 170. If, at step 460, the convolution of the analyzed one is detected by the file convolution comparison tool 150 file and convolution from the convolution database 170, the analyzed file will be recognized as similar to files from a cluster of similar files that were used to form a convolution that coincided with the convolution of the analyzed file. If at step 490 it becomes clear that a cluster of similar malicious files was used to generate a matching convolution, then the analyzed file will be considered malicious at step 500. If at step 490 it will be found that a cluster of similar files were used to generate a matching convolution, which are not are malicious, the analyzed file will be recognized as not malicious at step 510.

Фиг. 5 представляет пример компьютерной системы общего назначения, персональный компьютер или сервер 20, содержащий центральный процессор 21, системную память 22 и системную шину 23, которая содержит разные системные компоненты, в том числе память, связанную с центральным процессором 21. Системная шина 23 реализована, как любая известная из уровня техники шинная структура, содержащая в свою очередь память шины или контроллер памяти шины, периферийную шину и локальную шину, которая способна взаимодействовать с любой другой шинной архитектурой. Системная память содержит постоянное запоминающее устройство (ПЗУ) 24, память с произвольным доступом (ОЗУ) 25. Основная система ввода/вывода (BIOS) 26, содержит основные процедуры, которые обеспечивают передачу информации между элементами персонального компьютера 20, например, в момент загрузки операционной системы с использованием ПЗУ 24.FIG. 5 is an example of a general purpose computer system, a personal computer or server 20 comprising a central processor 21, a system memory 22, and a system bus 23 that contains various system components, including memory associated with the central processor 21. The system bus 23 is implemented as any prior art bus structure comprising, in turn, a bus memory or a bus memory controller, a peripheral bus and a local bus that is capable of interacting with any other bus architecture. The system memory contains read-only memory (ROM) 24, random access memory (RAM) 25. The main input / output system (BIOS) 26, contains basic procedures that ensure the transfer of information between the elements of the personal computer 20, for example, at the time of loading the operating ROM systems 24.

Персональный компьютер 20 в свою очередь содержит жесткий диск 27 для чтения и записи данных, привод магнитных дисков 28 для чтения и записи на сменные магнитные диски 29 и оптический привод 30 для чтения и записи на сменные оптические диски 31, такие как CD-ROM, DVD-ROM и иные оптические носители информации. Жесткий диск 27, привод магнитных дисков 28, оптический привод 30 соединены с системной шиной 23 через интерфейс жесткого диска 32, интерфейс магнитных дисков 33 и интерфейс оптического привода 34 соответственно. Приводы и соответствующие компьютерные носители информации представляют собой энергонезависимые средства хранения компьютерных инструкций, структур данных, программных модулей и прочих данных персонального компьютера 20.The personal computer 20 in turn contains a hard disk 27 for reading and writing data, a magnetic disk drive 28 for reading and writing to removable magnetic disks 29, and an optical drive 30 for reading and writing to removable optical disks 31, such as a CD-ROM, DVD -ROM and other optical information carriers. The hard disk 27, the magnetic disk drive 28, the optical drive 30 are connected to the system bus 23 through the interface of the hard disk 32, the interface of the magnetic disks 33 and the interface of the optical drive 34, respectively. Drives and associated computer storage media are non-volatile means of storing computer instructions, data structures, software modules and other data of a personal computer 20.

Настоящее описание раскрывает реализацию системы, которая использует жесткий диск 27, сменный магнитный диск 29 и сменный оптический диск 31, но следует понимать, что возможно применение иных типов компьютерных носителей информации 56, которые способны хранить данные в доступной для чтения компьютером форме (твердотельные накопители, флеш-карты памяти, цифровые диски, память с произвольным доступом (ОЗУ) и т.п.), которые подключены к системной шине 23 через контроллер 55.The present description discloses an implementation of a system that uses a hard disk 27, a removable magnetic disk 29, and a removable optical disk 31, but it should be understood that other types of computer storage media 56 that can store data in a form readable by a computer (solid state drives, flash memory cards, digital disks, random access memory (RAM), etc.) that are connected to the system bus 23 through the controller 55.

Компьютер 20 имеет файловую систему 36, где хранится записанная операционная система 35, а также дополнительные программные приложения 37, другие программные модули 38 и данные программ 39. Пользователь имеет возможность вводить команды и информацию в персональный компьютер 20 посредством устройств ввода (клавиатуры 40, манипулятора «мышь» 42). Могут использоваться другие устройства ввода (не отображены): микрофон, джойстик, игровая консоль, сканнер и т.п. Подобные устройства ввода по своему обычаю подключают к компьютерной системе 20 через последовательный порт 46, который в свою очередь подсоединен к системной шине, но могут быть подключены иным способом, например, при помощи параллельного порта, игрового порта или универсальной последовательной шины (USB). Монитор 47 или иной тип устройства отображения также подсоединен к системной шине 23 через интерфейс, такой как видеоадаптер 48. В дополнение к монитору 47, персональный компьютер может быть оснащен другими периферийными устройствами вывода (не отображены), например, колонками, принтером и т.п.Computer 20 has a file system 36 where the recorded operating system 35 is stored, as well as additional software applications 37, other program modules 38, and program data 39. The user is able to enter commands and information into personal computer 20 via input devices (keyboard 40, keypad “ the mouse "42). Other input devices (not displayed) can be used: microphone, joystick, game console, scanner, etc. Such input devices are, as usual, connected to the computer system 20 via a serial port 46, which in turn is connected to the system bus, but can be connected in another way, for example, using a parallel port, a game port, or a universal serial bus (USB). A monitor 47 or other type of display device is also connected to the system bus 23 via an interface such as a video adapter 48. In addition to the monitor 47, the personal computer can be equipped with other peripheral output devices (not displayed), for example, speakers, a printer, etc. .

Персональный компьютер 20 способен работать в сетевом окружении, при этом используется сетевое соединение с другим или несколькими удаленными компьютерами 49. Удаленный компьютер (или компьютеры) 49 являются такими же персональными компьютерами или серверами, которые имеют большинство или все упомянутые элементы, отмеченные ранее при описании существа персонального компьютера 20, представленного на Фиг. 5. В вычислительной сети могут присутствовать также и другие устройства, например, маршрутизаторы, сетевые станции, пиринговые устройства или иные сетевые узлы.The personal computer 20 is capable of operating in a networked environment, using a network connection with another or more remote computers 49. The remote computer (or computers) 49 are the same personal computers or servers that have most or all of the elements mentioned earlier in the description of the creature the personal computer 20 of FIG. 5. Other devices, such as routers, network stations, peer-to-peer devices, or other network nodes, may also be present on the computer network.

Сетевые соединения могут образовывать локальную вычислительную сеть (LAN) 50 и глобальную вычислительную сеть (WAN). Такие сети применяются в корпоративных компьютерных сетях, внутренних сетях компаний и, как правило, имеют доступ к сети Интернет. В LAN- или WAN-сетях персональный компьютер 20 подключен к локальной сети 50 через сетевой адаптер или сетевой интерфейс 51. При использовании сетей персональный компьютер 20 может использовать модем 54 или иные средства обеспечения связи с глобальной вычислительной сетью, такой как Интернет. Модем 54, который является внутренним или внешним устройством, подключен к системной шине 23 посредством последовательного порта 46. Следует уточнить, что сетевые соединения являются лишь примерными и не обязаны отображать точную конфигурацию сети, т.е. в действительности существуют иные способы установления соединения техническими средствами связи одного компьютера с другим.Network connections can form a local area network (LAN) 50 and a wide area network (WAN). Such networks are used in corporate computer networks, internal networks of companies and, as a rule, have access to the Internet. In LAN or WAN networks, the personal computer 20 is connected to the local area network 50 via a network adapter or network interface 51. When using the networks, the personal computer 20 may use a modem 54 or other means of providing communication with a global computer network such as the Internet. The modem 54, which is an internal or external device, is connected to the system bus 23 via the serial port 46. It should be clarified that the network connections are only exemplary and are not required to display the exact network configuration, i.e. in reality, there are other ways to establish a technical connection between one computer and another.

В заключение следует отметить, что приведенные в описании сведения являются примерами, которые не ограничивают объем настоящего изобретения, определенного формулой.In conclusion, it should be noted that the information provided in the description are examples that do not limit the scope of the present invention defined by the claims.

Claims

1. A method for determining the similarity of files, in which:

a) determine the set of mutable and immutable characteristics of files, while the files are stored in a database of files for training;

however, the file attribute is considered mutable if, for a plurality of similar files, the attribute takes different values;

however, the file attribute is considered unchanged if for many similar files the attribute takes the same value;

b) allocate many features from at least one file;

c) divide the set of selected features of the file into at least two subsets: a subset of mutable features and a subset of immutable features;

in this case, a subset of modifiable attributes of a file includes at least: file size, file image size, number of file sections, RVA (Relative Virtual Address) file sections, RVA entry points, frequency characteristics of characters, many lines of a file and their number, filtered many file lines and their number;

in this case, a subset of immutable file attributes includes at least: the type of the compiler used to create the file, the type of subsystem, the file characteristics from the COFF (Common Object File Format) header;

g) form a convolution of each of the above subsets of the characteristics of the file;

however, the convolution of a subset of unchanging signs is not resistant to changes in the characteristics of the subset;

in this case, the convolution of a subset of mutable features is resistant to changes in the characteristics of the subset;

e) form a convolution of the file as a combination of convolutions of each of the above subsets of the characteristics of the file;

e) comparing the convolution of at least one file with a set of pre-created convolution files;

g) recognize the file as similar to files from a variety of similar files having the same convolution, if, when comparing, the convolution of the specified file matches the convolution of the file from the specified set.

2. A system for determining the similarity of files, which contains:

a) a feature extraction means for extracting a plurality of features from at least one file and transmitting a selected set of features to the input of the feature separation means;

b) a feature separation tool designed to form at least two subsets from the set of selected features: a subset of variable features and a subset of immutable features, and transmitting the formed subsets of features to the input of a feature convolution forming tool, as well as to determine a plurality of variable and unchanged file attributes, when this mentioned files are stored in a database of files for training;

at the same time, a subset of modifiable attributes of a file includes at least: file size, file image size, number of file sections, RVA file sections, RVA entry points, frequency characteristics of characters, many file lines and their number, filtered many file lines and their number ;

in this case, a subset of immutable file attributes includes at least: the type of the compiler used to create the file, the type of subsystem, file characteristics from the COFF header;

c) means for generating convolution of signs, intended for forming a convolution of each of the above subsets of file attributes and transmitting the generated convolutions of subsets of attributes to the input of the means for forming convolution of the file;

d) file convolution forming means for generating a file convolution as a combination of convolutions of each of the above subsets of file attributes and transmitting the generated file convolution to the input of the file convolution comparison tool;

e) a convolution database for storing at least one convolution of the file and the corresponding convolution of the file malware information;

e) a database of files for training, designed to store files that are used to determine the set of mutable and immutable characteristics;

g) a file convolution comparison tool associated with the convolution database, designed to compare the convolution of at least one file with convolutions present in the convolution database, and make the file look like files from many similar files that have the same convolution, if the comparison convolution the specified file matches the convolution of a file from the specified set.