RU2427890C2

RU2427890C2 - System and method to compare files based on functionality templates

Info

Publication number: RU2427890C2
Application number: RU2009136238/08A
Authority: RU
Inventors: Роман Сергеевич Василенко (RU); Роман Сергеевич Василенко
Original assignee: ЗАО "Лаборатория Касперского"
Priority date: 2009-10-01
Filing date: 2009-10-01
Publication date: 2011-08-27
Also published as: RU2009136238A

Abstract

FIELD: information technologies.

SUBSTANCE: method to determine belonging of files to collections of available files on the basis of files comparison with the help of functionality templates includes stages, at which functionality templates are generated on the basis of information on the executed file. Then extracted noise information is deleted from functionality templates of the executed file. Then units of functionality templates of the executed file are reduced to normalised view. Then these units are compared to units of functionality templates of available files, and using comparison results, decision is made on belonging of the unit to one of functionality templates of available files. Creating functionality templates by available malicious software, newly arrived files may be compared with them, and automatic records may be added with condition of similarity; characteristic logical units are extracted from collections of malicious programs, and heuristic rules are created by these units; automatic descriptions are generated. Also the possibility appears to carry out clusterisation of objects, which helps to accelerate their further processing.

EFFECT: increased reliability and accuracy of malicious software detection, achieved by comparison of executed files by means of functionality templates.

14 cl, 16 dwg

Description

Область техникиTechnical field

Изобретение относится к системам и методам создания описания и сравнения функционала исполняемых файлов и более конкретно к определению их принадлежности к известным коллекциям.The invention relates to systems and methods for creating descriptions and comparing the functionality of executable files, and more specifically to determining their belonging to known collections.

Уровень техникиState of the art

В настоящее время наблюдается значительный рост количества вредоносных программ. Каждый день появляется несколько десятков тысяч новых вариантов вредоносного программного обеспечения, из-за чего антивирусным компаниям все сложнее и сложнее предоставлять пользователю достойный уровень защиты. Одна из причин тому - появление полиморфных и метаморфных вредоносных программ. Полиморфизм заключается в формировании уникального кода программы непосредственно во время исполнения, при этом сама процедура, формирующая код, также является уникальной при каждом новом заражении. Метаморфизм заключается в шифровании или искажении некоторых частей программы, делая ее «неузнаваемой». «Новый вариант» не означает полностью отличную от других новую вредоносную программу, но в большинстве своем, новыми они являются только для существующей в данный момент одной из основных технологий детектирования - сигнатурной.Currently, there is a significant increase in the number of malware. Every day, several tens of thousands of new variants of malware appear, which makes it more and more difficult for antivirus companies to provide the user with a decent level of protection. One reason for this is the emergence of polymorphic and metamorphic malware. Polymorphism consists in the formation of a unique program code directly at runtime, while the procedure that generates the code itself is also unique with each new infection. Metamorphism is the encryption or distortion of some parts of the program, making it “unrecognizable”. “New option” does not mean a completely different malicious program, but for the most part, they are new only for the existing one of the main detection technologies - signature.

Растущая угроза со стороны полиморфных и метаморфных вирусов заставила антивирусных разработчиков дополнить сигнатурный анализ другими методами, которые включали технологии редуцированной маски, криптографического и статистического анализа, а также эмуляции кода.The growing threat from polymorphic and metamorphic viruses has forced antivirus developers to complement signature analysis with other methods, which include reduced mask technology, cryptographic and statistical analysis, and code emulation.

Технология редуцированной маски заключается в применении к зашифрованному телу вируса некоторого преобразования, которое на выходе выдает уникальную комбинацию, идентифицирующую вирус вне зависимости от выбора ключа шифрования.The technology of the reduced mask consists in applying a certain transformation to the encrypted body of the virus, which at the output produces a unique combination that identifies the virus, regardless of the choice of encryption key.

Эмуляция кода также использовалась для более эффективного эвристического обнаружения, поскольку она делала возможным динамический анализ кода. Однако базовый метод сигнатурного детектирования практически не улучшался.Code emulation was also used for more efficient heuristic detection, as it enabled dynamic code analysis. However, the basic method of signature detection was practically not improved.

Альтернативным решением проблемы резкого увеличения числа угроз стал поведенческий анализ. В то время как традиционные антивирусные сканеры хранили сигнатуры вредоносных программ в базах данных и при сканировании сверяли код с имеющимися в базах сигнатурами, поведенческий блокиратор определял, является ли программа вредоносной, исходя из ее поведения в системе. Если программа выполняла действия, не разрешенные правилами, определенными заранее, то выполнение этой программы блокировалось.An alternative solution to the problem of a sharp increase in the number of threats was behavioral analysis. While traditional anti-virus scanners stored malware signatures in the databases and scanned the code against the signatures in the databases, the behavioral blocker determined whether the program was malicious based on its behavior in the system. If a program performed actions not permitted by rules defined in advance, then the execution of this program was blocked.

Основным преимуществом поведенческого блокиратора является его способность отличать «хорошие» программы от «плохих» без помощи профессионального вирусного аналитика. Поскольку нет необходимости анализировать каждую новую угрозу, регулярное обновление баз сигнатур становится ненужным, а пользователи тем временем надежно защищены от новых угроз.The main advantage of a behavioral blocker is its ability to distinguish between “good” and “bad” programs without the help of a professional virus analyst. Since there is no need to analyze each new threat, regular updating of the signature databases becomes unnecessary, and meanwhile, users are reliably protected from new threats.

Проблема в том, что существует некая промежуточная зона между явно вредоносными и допустимыми действиями. Кроме того, одни и те же действия могут быть вредоносными в программе, предназначенной для нанесения ущерба, и полезными в легитимном программном обеспечении. Например, низкоуровневая запись данных, используемая вредоносной программой для того, чтобы стереть информацию с жесткого диска, совершенно легитимно применяется операционной системой. Кроме того, поведенческому блокиратору, установленному на файловом сервере, сложно определить, законно пользователь изменяет или удаляет документ, или это результат действии вредоносной программы.The problem is that there is some kind of intermediate zone between clearly malicious and acceptable actions. In addition, the same actions can be malicious in a program designed to cause harm, and useful in legitimate software. For example, the low-level data recording used by a malicious program to erase information from a hard disk is completely legitimately used by the operating system. In addition, it is difficult for a behavioral blocker installed on a file server to determine whether a user legally modifies or deletes a document, or is the result of a malicious program.

В связи с этим основная масса антивирусов по-прежнему ориентирована на обнаружение известных вредоносных программ, используя метод сигнатур. Тем не менее, поведенческий анализ получил свое дальнейшее развитие: в некоторых современных решениях по защите от информационных угроз он используется в сочетании с другими методами поиска, обезвреживания и удаления вредоносного кода.In this regard, the bulk of antiviruses are still focused on the detection of known malicious programs using the signature method. Nevertheless, behavioral analysis was further developed: in some modern solutions to protect against information threats, it is used in combination with other methods of searching, neutralizing and removing malicious code.

Большой процент ложных срабатываний и невысокий уровень детектирования вредоносного программного обеспечения посредством новых технологий не позволяет отказаться от сигнатурного метода детектирования и ставит вопрос его усовершенствования.A large percentage of false positives and a low level of detection of malicious software through new technologies does not allow abandoning the signature method of detection and raises the question of its improvement.

Существует другой метод описания функциональности исполняемых файлов, описанный, например, в патенте US 5752039, в котором раскрывается идея сравнения исполняемых файлов, обновление одного исполняемого файла до функционала другого исполняемого файла, а также описывается система структуризации исполняемых файлов с помощью создания логических блоков, аналогичная описанной в изобретении.There is another method for describing the functionality of executable files, described, for example, in patent US 5752039, which discloses the idea of comparing executable files, updating one executable file to the functionality of another executable file, and also describes a system for structuring executable files by creating logical blocks similar to that described in the invention.

Однако в патенте не описываются логические объединения элементов исполняемых файлов, присутствующих в данном изобретении.However, the patent does not describe the logical associations of the elements of the executable files present in this invention.

Существуют другие методы выделения в коде исполняемого файла бесполезной для проведения сравнения информации, описанные, например, в заявке US 20050028149 A1, в которой раскрывается идея создания схемы, которая распознает циклический код, или патенте US 7349931, в котором раскрывается идея сканирования компьютерной системы в поисках файлов с добавленным кодом, работающих с потенциально вредоносными процессами.There are other methods for extracting information that is useless for comparing information in the code of an executable file, as described, for example, in application US 20050028149 A1, which discloses the idea of creating a circuit that recognizes a cyclic code, or US 7349931, which discloses the idea of scanning a computer system for searches Files with added code that work with potentially malicious processes.

Однако в реализации данных методов ничего не сказано об избавлении от шумящего кода от компилятора.However, the implementation of these methods does not say anything about getting rid of noisy code from the compiler.

Также существуют методы сравнения неизвестных исполняемых файлов с коллекцией известных файлов, а также составления коллекций, описанные, например, в заявке US 20080195587 A1, где описываются метод и система, которые определяют похожести моделей множеств, или в патенте US 7103913, в котором раскрывается идея метода сравнения проверяемых записей с уже хранящимися записями в системе, или в патенте US 7447703, где раскрывается идея создания коллекций информации, которые улучшают продуктивность работы сотрудников.There are also methods for comparing unknown executable files with a collection of known files, as well as compiling collections, described, for example, in application US 20080195587 A1, which describes a method and system that determine the similarity of set models, or in US 7103913, which discloses the idea of a method comparing verified records with already stored records in the system, or in US 7447703, which discloses the idea of creating collections of information that improve employee productivity.

Однако в представленном изобретении используются различные виды составления коллекций вредоносных файлов - список более широкий, чем в представленных выше патентах. Метод сравнения файлов, аналогичный данному изобретению, используется в представленном патенте, однако существуют различия в его реализации.However, the present invention uses various types of malware collections - the list is wider than in the above patents. A file comparison method similar to this invention is used in the presented patent, however, there are differences in its implementation.

Анализ предшествующего уровня техники и возможностей, которые появляются при комбинировании их в одной системе, позволяют получить новый результат. Соответственно, техническим результатом заявленного изобретения является предоставление системы, способа и машиночитаемого носителя, которые повышают надежность и точность детектирования вредоносного программного обеспечения за счет сравнения исполняемых файлов посредством шаблонов функциональности.An analysis of the prior art and the possibilities that arise when combining them in one system, allows you to get a new result. Accordingly, the technical result of the claimed invention is the provision of a system, method and computer-readable medium that increase the reliability and accuracy of malware detection by comparing executable files through functionality templates.

Раскрытие изобретенияDisclosure of invention

Настоящее изобретение предназначено для описания и сравнения функционала исполняемых файлов с целью определения их принадлежности к коллекциям известных файлов.The present invention is intended to describe and compare the functionality of executable files in order to determine their belonging to collections of known files.

Технический результат настоящего изобретения заключается в повышении надежности и точности детектирования вредоносного программного обеспечения и достигается за счет сравнения исполняемых файлов посредством шаблонов функциональности.The technical result of the present invention is to increase the reliability and accuracy of detection of malicious software and is achieved by comparing executable files through templates of functionality.

Согласно способу определения принадлежности файлов к коллекциям известных файлов на основе сравнения файлов с помощью шаблонов функциональности, содержащий этапы, на которых: (а) формируют шаблоны функциональности на основе информации об исполняемом файле; (б) удаляют выделенную шумовую информацию из шаблонов функциональности исполняемого файла; (в) приводят блоки шаблонов функциональности исполняемого файла к нормализованному виду для последующего сравнения; (г) сравнивают нормализованные блоки шаблонов функциональности исполняемого файла с блоками шаблонов функциональности известных файлов и по результатам сравнения принимают решение о принадлежности нормализованного блока шаблонов функциональности к одному из шаблонов функциональности известных файлов.According to a method for determining whether a file belongs to collections of known files based on file comparison using functionality templates, it comprises the steps of: (a) generating functionality templates based on information about an executable file; (b) remove the selected noise information from the functionality templates of the executable file; (c) bring the blocks of the templates of the functionality of the executable file to a normalized form for subsequent comparison; (d) comparing the normalized blocks of the templates of the functionality of the executable file with the blocks of the templates of the functionality of known files and based on the results of the comparison decide on whether the normalized block of templates of the functionality belongs to one of the templates of the known files functionality.

В частном варианте осуществления информация об исполняемом файле может быть собрана посредством таких инструментов сбора информации, как модуль, эмулирующий исполнение программы; модуль виртуализации программ; модуль контроля приложений; модуль, который дизассемблирует программу и позволяет облегчить дальнейшее выстраивание между определенными функциями связей.In a private embodiment, information about the executable file can be collected through information collection tools such as a module that emulates program execution; software virtualization module; application control module; a module that disassembles the program and allows you to facilitate further alignment between certain communication functions.

В другом частном варианте осуществления формирование шаблонов функциональности исполняемого файла осуществляется в два этапа: динамический и статический. Динамический этап подразумевает сбор информации об исполняемом файле при помощи инструментов сбора информации; и результатом работы динамического этапа при формировании шаблонов функциональности исполняемого файла являются дамп памяти и журнал событий. Статический этап подразумевает непосредственную трассировку дампа памяти и формирование шаблонов функциональности исполняемого файла.In another private embodiment, the formation of the functionality templates of the executable file is carried out in two stages: dynamic and static. The dynamic stage involves collecting information about the executable file using information collection tools; and the result of the dynamic stage during the formation of the functionality templates for the executable file is a memory dump and an event log. The static stage involves direct tracing of a memory dump and the formation of functionality templates for the executable file.

В другом частном варианте осуществления шаблоны функциональности исполняемого файла бывают двух типов: шаблон функциональности, который основан на последовательности блоков; и шаблон функциональности, который основан на последовательности ребер. Блоками называются совокупности элементов, такие как API-функции, ссылки на строки кода и ссылки на другие логические блоки. Ребрами называются участки между вызовами одного блока другим.In another particular embodiment, the executable file functionality templates are of two types: a functionality template that is based on a sequence of blocks; and a functionality template that is based on a sequence of edges. Blocks are collections of elements, such as API functions, links to lines of code, and links to other logical blocks. Ribs are the areas between calls from one block to another.

В другом частном варианте осуществления выделенная шумовая информация включает в себя базовые библиотеки компилятора и статически слинкованные функции стандартных библиотек.In another particular embodiment, the extracted noise information includes basic compiler libraries and statically linked functions of standard libraries.

В другом частном варианте осуществления сравнение шаблонов функциональности исполняемого файла с шаблонами из коллекций известных файлов осуществляется по различным для каждого типа шаблонов функциональности математическим алгоритмам, таким как расстояние Хемминга; расстояние Левенштейна; алгоритм вычисления весов; алгоритм выделения характеристической информации.In another particular embodiment, the comparison of the functionality templates of the executable file with the templates from the collections of known files is carried out according to different mathematical algorithms for each type of functionality template, such as the Hamming distance; Levenshtein distance; algorithm for calculating weights; algorithm for extracting characteristic information.

Система определения принадлежности файлов к коллекциям известных файлов на основе сравнения файлов с помощью шаблонов функциональности, содержащая: средство эмуляции, которое осуществляет сбор информации об исполняемом файле; средство создания шаблонов функциональности исполняемого файла, которое на основе информации, собранной средством эмуляции, формирует блоки и составляет из них шаблоны функциональности исполняемого файла; средство обработки шаблонов функциональности исполняемых файлов, которое удаляет выделенную из шаблонов функциональности шумовую информацию и приводит блоки шаблонов функциональности к нормализованному виду; средство хранения коллекций шаблонов функциональности известных файлов; средство сравнения шаблонов функциональности исполняемых файлов, которое осуществляет сравнение нормализованных блоков шаблонов функциональности, полученных от средства обработки шаблонов функциональности, с шаблонами функциональности известных файлов из упомянутого средства хранения коллекции в соответствии с классификацией шаблонов функциональности исполняемых файлов.A system for determining file ownership of collections of known files based on file comparisons using functionality templates, comprising: an emulation tool that collects information about an executable file; means for creating functionality templates for the executable file, which, on the basis of the information collected by the emulation tool, forms blocks and makes up functionality templates for the executable file from them; means for processing the functionality templates of executable files, which removes noise information extracted from the functionality templates and brings the blocks of functionality templates to a normalized form; means for storing collections of functionality templates of known files; means for comparing functionality templates of executable files, which compares normalized blocks of functionality templates received from means for processing functionality templates with functionality templates of known files from said storage medium of a collection in accordance with the classification of functionality templates for executable files.

В частном варианте осуществления средство эмуляции включает в себя такие инструменты сбора информации, как модуль, эмулирующий исполнение программы; модуль виртуализации программ; модуль контроля приложений; модуль, который дизассемблирует программу и позволяет облегчить дальнейшее выстраивание между определенными функциями связей.In a particular embodiment, the emulation means includes information collection tools such as a module emulating program execution; software virtualization module; application control module; a module that disassembles the program and allows you to facilitate further alignment between certain communication functions.

Машиночитаемый носитель определения принадлежности файлов к коллекциям известных файлов на основе сравнения файлов с помощью шаблонов функциональности, на котором сохранена компьютерная программа, при исполнении которой на компьютере выполняются следующие этапы: (а) формируют шаблоны функциональности на основе информации об исполняемом файле; (б) удаляют выделенную шумовую информацию из шаблонов функциональности исполняемого файла; (в) приводят блоки шаблонов функциональности исполняемого файла к нормализованному виду для последующего сравнения; (г) сравнивают нормализованные блоки шаблонов функциональности исполняемого файла с блоками шаблонов функциональности известных файлов и по результатам сравнения принимают решение о принадлежности нормализованного блока шаблонов функциональности к одному из шаблонов функциональности известных файлов.A machine-readable medium for determining whether a file belongs to collections of known files based on file comparisons using the functionality templates on which the computer program is stored, when the computer executes the following steps: (a) form the functionality templates based on information about the executable file; (b) remove the selected noise information from the functionality templates of the executable file; (c) bring the blocks of the templates of the functionality of the executable file to a normalized form for subsequent comparison; (d) comparing the normalized blocks of the templates of the functionality of the executable file with the blocks of the templates of the functionality of known files and based on the results of the comparison decide on whether the normalized block of templates of the functionality belongs to one of the templates of the known files functionality.

Дополнительные признаки и преимущества изобретения будут установлены из описания, которое прилагается, и частично будут очевидны из описания или могут быть получены при использовании изобретения. Преимущества изобретения будут реализовываться и достигаться посредством структуры, конкретно подчеркнутой в написанном описании и формулах, а также прилагаемых чертежах.Additional features and advantages of the invention will be established from the description that is attached, and will in part be apparent from the description, or may be obtained by using the invention. The advantages of the invention will be realized and achieved by means of the structure specifically emphasized in the written description and formulas, as well as the accompanying drawings.

Понятно, что предшествующее общее описание и последующее подробное описание являются примерными и объясняющими и предназначены для обеспечения дополнительного объяснения заявленного изобретения.It is understood that the foregoing general description and the following detailed description are exemplary and explanatory and are intended to provide further explanation of the claimed invention.

Краткое описание чертежейBrief Description of the Drawings

Дополнительные цели, признаки и преимущества настоящего изобретения будут очевидными из прочтения последующего описания осуществления изобретения со ссылкой на прилагаемые чертежи, на которых:Additional objectives, features and advantages of the present invention will be apparent from reading the following description of an embodiment of the invention with reference to the accompanying drawings, in which:

Фиг.1А показывает схему взаимодействия средств данного изобретения.Figa shows a diagram of the interaction of the means of this invention.

Фиг.1Б - способ работы данного изобретения.Figb - method of operation of the present invention.

Фиг.2 показывает пример логического блока, сформированного в результате работы настоящего изобретения.Figure 2 shows an example of a logical block generated as a result of the operation of the present invention.

Фиг.3А и 3Б показывают примеры сигнатур различных компиляторов.3A and 3B show examples of signatures of various compilers.

Фиг.4 показывает схематичное представление структуры функционального шаблона.Figure 4 shows a schematic representation of the structure of a functional template.

Фиг.5 показывает структуру данных функционального шаблона.5 shows a data structure of a functional template.

Фиг.6 показывает схематичное представление наборов файлов для алгоритма выделения стандартных логических блоков по «значению информативности».6 shows a schematic representation of sets of files for the algorithm for allocating standard logical blocks according to the "value of information content".

Фиг.7 показывает схему сравнения логических блоков двух функциональных шаблонов.7 shows a comparison diagram of logical blocks of two functional patterns.

Фиг.8 показывает взаимно-однозначное соответствие со значением похожести между логическими блоками двух функциональных шаблонов.Fig. 8 shows a one-to-one correspondence with a similarity value between logical units of two functional patterns.

Фиг.9 показывает таблицу сравнения логических путей двух информационных шаблонов.Fig.9 shows a table comparing the logical paths of two information templates.

Фиг.10 показывает бинарное представление логических путей функционального шаблона в виде последовательности хеш-функции, взятых от имен событий/элементов блока.Figure 10 shows a binary representation of the logical paths of a functional template in the form of a sequence of hash functions taken from the names of events / block elements.

Фиг.11 показывает пример сравнения блоков по алгоритму измерения расстояния Левенштейна.11 shows an example of a comparison of blocks according to the Levenshtein distance measurement algorithm.

Фиг.12А и 12Б показывает выделение информации из функциональных шаблонов для алгоритма детектирования, основывающегося на выделении характеристической информации.12A and 12B show the extraction of information from functional patterns for a detection algorithm based on the extraction of characteristic information.

Фиг.13 показывает пример функционального шаблона для алгоритма детектирования, основывающегося на выделении характеристической информации.13 shows an example of a functional pattern for a detection algorithm based on the extraction of characteristic information.

Описание вариантов осуществления изобретенияDescription of Embodiments

Объекты и признаки настоящего изобретения, способы для достижения этих объектов и признаков станут очевидными посредством отсылки к примерным вариантам осуществления. Однако настоящее изобретение не ограничивается примерными вариантами осуществления, раскрытыми ниже, оно может воплощаться в различных видах. Сущность, приведенная в описании, является ничем иным, как конкретными деталями, обеспеченными для помощи специалисту в области техники в исчерпывающем понимании изобретения, и настоящее изобретение определяется только в объеме приложенной формулы.The objects and features of the present invention, methods for achieving these objects and features will become apparent by reference to exemplary embodiments. However, the present invention is not limited to the exemplary embodiments disclosed below, it can be embodied in various forms. The essence described in the description is nothing more than the specific details provided to assist the specialist in the field of technology in a comprehensive understanding of the invention, and the present invention is defined only in the scope of the attached claims.

Данное изобретение позволяет описывать и сравнивать функционал исполняемых файлов, собирая и структурируя его в шаблоны функциональности. При этом сбор информации производится как статическими, так и динамическими системами, что позволяет собрать максимум информации о действиях программы.This invention allows to describe and compare the functionality of executable files, collecting and structuring it into functionality templates. In this case, information is collected by both static and dynamic systems, which allows you to collect maximum information about the actions of the program.

В отличие от эвристического детектирования, изобретение позволяет построить полные структурированные описания исполняемых файлов, а не выделять конкретный «вредоносный» функционал. Метод основан на сборе и сравнении именно функционала исполняемого файла, поэтому изменение двоичного представления исполняемого файла никак не влияет на описание программы, а следовательно, и на результаты сравнения.Unlike heuristic detection, the invention allows you to build complete structured descriptions of executable files, rather than highlighting specific “malicious” functionality. The method is based on the collection and comparison of exactly the functionality of the executable file; therefore, changing the binary representation of the executable file does not affect the description of the program, and consequently, the comparison results.

Такие мощные инструменты сбора информации об исполняемом файле, как Эмулятор - инструмент моделирования состояния имитируемой системы, для выполнения оригинального машинного кода; «Песочница» - инструмент виртуализации, контролирующий взаимодействие между операционной системой и программой; HIPS - инструмент контроля приложений и Трейсер (Tracer) - модуль, который дизассемблирует программу и позволяет облегчить дальнейшее выстраивание между определенными функциями логических связей - позволяют собирать большое количество информации о поступившем на проверку исполняемом файле. Сама по себе собранная с разных источников информация в чистом виде не поддается анализу и сравнению. Для проведения качественного сравнения ее нужно нормализовать и структурировать. Одним из вариантов представления собранной информации является шаблон функциональности.Powerful tools for collecting information about the executable file, such as Emulator - a tool for modeling the state of a simulated system, to execute the original machine code; “Sandbox” - a virtualization tool that controls the interaction between the operating system and the program; HIPS, an application control tool, and Tracer, a module that disassembles the program and makes it easy to further build logical functions between certain functions, make it possible to collect a large amount of information about the executable file received for verification. Information in itself collected from various sources in its pure form is not amenable to analysis and comparison. To conduct a qualitative comparison, it needs to be normalized and structured. One option for presenting the collected information is a functionality template.

Шаблон функциональности представляет собой набор собранной и объединенной в логические блоки информации, а также связей между ними. Логический блок (Logic Block) - это совокупность элементов, таких как API-функций, ссылки на строки кода, ссылки на другие логические блоки и др. Пример логического блока изображен на Фиг.2. Объединение собранной информации в шаблоны функциональности дает ряд преимуществ:A functionality template is a collection of information collected and integrated into logical blocks, as well as the relationships between them. A logical block (Logic Block) is a collection of elements such as API functions, links to lines of code, links to other logical blocks, etc. An example of a logical block is shown in Figure 2. Combining the collected information into functionality templates provides several advantages:

создание шаблонов функциональности проходит по четко заданным логическим правилам, описанным далее, поэтому внутри одного семейства вредоносных программ, видоизменение шаблонов функциональности, их описывающих, будет очень ограниченным;the creation of functionality templates follows the clearly defined logical rules described below, therefore, within one family of malicious programs, modifying the functionality templates that describe them will be very limited;

локальность: элементы, отвечающие за выполнение определенного действия, будь то получение привилегий или выполнение запросов на сервер, обычно расположены либо в одном логическом блоке, либо в наборе связанных между собой блоков;locality: the elements responsible for performing a certain action, whether obtaining privileges or executing requests to the server, are usually located either in one logical block or in a set of blocks interconnected;

так как шаблон функциональности обладает четкой структурой, то он поддается сравнению различными математическими алгоритмами, описанными далее;since the functionality template has a clear structure, it can be compared with various mathematical algorithms described below;

структура шаблона функциональности не зависит от инструментов сбора информации, поэтому для разных задач и ограничений инструменты сбора могут комбинироваться;the structure of the functionality template does not depend on the tools for collecting information, therefore, for different tasks and limitations, the collection tools can be combined;

количество уникальных шаблонов функциональности на порядки меньше, чем количество уникальных вредоносных файлов, т.к. большинство файлов могут быть похожими по функциональности и одновременно отличаться друг от друга бинарно. Уникальный шаблон функциональности можно составить для каждой программы (исполняемого файла) и для каждого семейства (коллекции) - он характеризует функционал программы или семейства программ в целом.the number of unique functionality templates is orders of magnitude smaller than the number of unique malicious files, because most files can be similar in functionality and at the same time differ binary from each other. A unique functionality template can be compiled for each program (executable file) and for each family (collection) - it characterizes the functionality of the program or the family of programs as a whole.

Создавая шаблоны функциональности по уже известным вредоносным программам, можно:By creating functionality templates for already known malware, you can:

сравнивать с ними вновь прибывшие файлы и добавлять в базы автоматические записи при условии похожести;compare newly arrived files with them and add automatic records to the databases, subject to similarity;

выделять из коллекций вредоносных программ характеристические логические блоки и создавать по этим блокам правила-эвристики;select characteristic logical blocks from malware collections and create heuristic rules for these blocks;

генерировать автоматические описания.generate automatic descriptions.

Так же появляется возможность проводить кластеризацию неизвестных объектов, а также объектов вообще, что способствует ускорению их дальнейшей обработки.Also, it becomes possible to cluster unknown objects, as well as objects in general, which helps to accelerate their further processing.

Схема взаимодействия средств данного изобретения приведена на Фиг.1А. Процесс создания шаблонов функциональности исполняемых файлов состоит из двух этапов. На первом этапе средства эмуляции 110 осуществляют динамический сбор информации об исполняемом файле в процессе его исполнения. Средство эмуляции 110 может включать в себя различные инструменты сбора информации, такие как модуль, эмулирующий исполнение программы; модуль виртуализации программ - «песочница»; модуль контроля приложений; или модуль, который дизассемблирует программу и производит некоторые действия, которые позволяют затем выстраивать между определенными функциями логическую связь. На втором этапе с помощью средства создания шаблонов функциональности исполняемых файлов 120 на основе накопленной информации по логическим правилам формируются шаблоны функциональности. Далее средство обработки шаблонов функциональности исполняемых файлов 130 разделяет шаблоны функциональности исполняемых файлов по типам, основываясь на логических связях между последовательностями логических элементов, а также выделяет из шаблонов функциональности шумовую информацию и приводит внутренние структурные элементы шаблонов функциональности к нормализованному виду. Подготовленный к дальнейшему анализу шаблон функциональности передается средству сравнения шаблонов функциональности исполняемых файлов 140, которое осуществляет сравнение внутренних структурных элементов шаблонов функциональности с шаблонами функциональности известных файлов из средства хранения коллекций шаблонов функциональности известных файлов 150, используя математические алгоритмы, в соответствии с классификацией шаблонов функциональности исполняемых файлов. В результате сравнения, относящийся к шаблону функциональности исполняемый файл относится к соответствующей коллекции файлов.The interaction scheme of the means of this invention is shown in Fig.1A. The process of creating functionality templates for executable files consists of two stages. At the first stage, emulation tools 110 dynamically collect information about the executable file during its execution. The emulator 110 may include various information collection tools, such as a module emulating program execution; program virtualization module - “sandbox”; application control module; or a module that disassembles the program and performs some actions that then allow you to build a logical connection between certain functions. At the second stage, using the means of creating functionality templates for executable files 120, functionality templates are generated based on the accumulated information according to logical rules. Further, the tool for processing the functionality templates of executable files 130 separates the templates of the functionality of the executable files by types, based on the logical relationships between sequences of logical elements, and also extracts noise information from the templates of functionality and brings the internal structural elements of the templates of functionality to a normalized form. The functionality template prepared for further analysis is transmitted to the executable file functionality template comparator 140, which compares the internal structural elements of the function templates with the known file functionality templates from the storage tool collections of known file templates 150 using mathematical algorithms in accordance with the classification of the executable file functionality templates . As a result of the comparison, the executable file related to the functionality template refers to the corresponding collection of files.

На фиг.1Б приведен способ работы данного изобретения. По этому способу на этапе 170 при помощи инструментов сбора информации накапливаются данные об исполняемом файле 160. Эти данные главным образом состоят из дампа памяти 171 и журнала событий 172. Далее на основании полученных данных на этапе 180 формируются шаблоны функциональности исполняемого файла, которые бывают двух типов: шаблоны функциональности, которые основаны на последовательности блоков 181; и шаблоны функциональности, которые основаны на последовательности ребер 182. Сформированные шаблоны функциональности исполняемого файла подвергаются дополнительной обработке на этапе 190, в ходе которой из них выделяется шумовая информация. После этого на этапе 191 блоки шаблонов функциональности исполняемого файла приводятся к нормализованному виду для последующего сравнения. На этапе 192 происходит сравнение нормализованных блоков шаблонов функциональности исполняемого файла с блоками шаблонов функциональности известных файлов и по результатам сравнения на этапе 193 принимается решение о принадлежности нормализованного блока шаблонов функциональности к одному из шаблонов функциональности известных файлов.On figb shows the method of operation of the present invention. According to this method, at step 170, information on the executable file 160 is accumulated using the information collection tools. This data mainly consists of a memory dump 171 and an event log 172. Then, based on the received data, at step 180, executable file functionality templates are generated that are of two types : functionality templates that are based on a sequence of blocks 181; and functionality templates that are based on a sequence of edges 182. The generated functionality templates of the executable file are subjected to additional processing at step 190, during which noise information is extracted from them. After that, at step 191, the blocks of templates of the functionality of the executable file are brought to a normalized form for subsequent comparison. At step 192, the normalized blocks of the templates of the functionality of the executable file are compared with the blocks of the templates of the known files and the results of the comparison at step 193, the decision is made whether the normalized block of the templates of functionality belongs to one of the templates of the known files.

Далее будут подробно рассмотрены все этапы функционирования данного изобретения.Next, all stages of the operation of this invention will be described in detail.

Создание шаблона функциональности исполняемого файла при помощи средства создания шаблонов функциональности 120 осуществляется в два основных этапа.Creating an executable file functionality template using the functionality template creation tool 120 is carried out in two main steps.

Динамический этап:Dynamic stage:

предварительная обработка исполняемого файла динамической системой сбора информации, функционирующей в рамках средств эмуляции 110. Например, может использоваться эмулятор, хотя, как говорилось ранее, идеология шаблонов функциональности не ограничивает в использовании других средств, таких как SandBox, HIPS и др., а также объединения этих инструментов. На выходе с этого этапа должно быть два основных компонента дальнейшего анализа:preliminary processing of the executable file by a dynamic information collection system operating within the framework of emulation tools 110. For example, an emulator can be used, although, as mentioned earlier, the ideology of functionality templates does not limit the use of other tools, such as SandBox, HIPS, etc., as well as associations these tools. At the exit from this stage there should be two main components of further analysis:

отпечаток памяти (дамп);memory fingerprint (dump);

журнал событий.the event log.

Основная информация из журнала событий, которая потребуется, - это набор следующих пар: {VA₁ - HASH₁, VA₂ - HASH₂,…, VA_n - HASH_n}, где VA_i - виртуальный адрес данного элемента-события, HASH; 210 на Фиг.2 - хеш-функция (например, CRC32), взятая от имени события, 220. Событием может являться вызов API функции, чтение памяти системных dll и т.п.The main information from the event log that is required is a set of the following pairs: {VA ₁ - HASH ₁ , VA ₂ - HASH ₂ , ..., VA _n - HASH _n }, where VA _i is the virtual address of this event element, HASH; 210 in FIG. 2 is a hash function (for example, CRC32) taken on behalf of the event, 220. The event may be a call to an API function, reading the memory of system dlls, etc.

Статический этап:Static stage:

на этом этапе происходит непосредственно трассировка отпечатка памяти и формирование шаблона функциональности исполняемого файла на основе всей собранной информации. Для поиска первичных точек входа в отпечатке памяти используются сигнатуры компилятора.At this stage, the trace of the memory fingerprint is directly generated and the functionality template of the executable file is formed on the basis of all the information collected. To search for primary entry points in the fingerprint, compiler signatures are used.

На Фиг.3А изображен пример сигнатуры компилятора Microsoft Visual C++6.×l (MSVCRT):FIG. 3A shows an example of a signature of a Microsoft Visual C ++ 6 compiler. × l (MSVCRT):

558BEC6AFF68::::::::68::::::::64A100000000506489250000000083::::5356 578965E833DB895DFC6A::FF15::::::::59830D::::::::FF830D::::::::FFFF15::::::::8 BOD, где:: (два двоеточия) обозначает любой байт. Хеш сигнатуры выделен цветом в столбце Хеши 310. В столбцах 320 и 330 приведены соответствующие команды и значения.558BEC6AFF68 :::::::: 68 :::::::: 64A100000000506489250000000083 :::: 5356 578965E833DB895DFC6A :: FF15 :::::::: 59830D :::::::: FF830D :::: :::: FFFF15 :::::::: 8 BOD, where :: (two colons) denotes any byte. The hash of the signature is highlighted in the column Hashes 310. Columns 320 and 330 show the corresponding commands and values.

Пример сигнатуры компилятора Borland C++представлен на Фиг.3Б:An example of the signature of the Borland C ++ compiler is shown in FIG. 3B:

EB1066623A432B2B484F4F4B90E9. Хеш сигнатуры выделен цветом в столбце Хеши 340. В столбцах 350 и 360 приведены соответствующие команды и значения.EB1066623A432B2B484F4F4B90E9. The hash of the signature is highlighted in the Hashes 340 column. Columns 350 and 360 show the corresponding commands and values.

Для формирования логических блоков имеем следующие правила:To form logical blocks, we have the following rules:

логический блок начинается с «точки входа» трассировки;the logical block begins with the “entry point” of the trace;

команда call, не вызывающая API функцию и ведущая в выделенную область памяти, является вызовом следующего или существующего логического блока;The call command, which does not call the API function and leads to the allocated memory area, is a call to the next or existing logical block;

трассировка происходит до появления команды ret (ret N), int N, jmp N или до любого из исключений invalid opcode. После чего проверяется наличие перехода за данную команду. Если перехода нет, логический блок заканчивается. Если есть, то производится попытка трассировки за данной командой;tracing occurs before the ret (ret N) command, int N, jmp N, or before any of the invalid opcode exceptions appear. After that, the presence of a transition for this command is checked. If there is no transition, the logic block ends. If there is, then an attempt is made to trace for this command;

любая встреченная ссылка на код кладется в список «точек входа», т.е. мест в коде, с которых программа может начать исполнение;any reference to the code encountered is put in the list of "entry points", i.e. places in the code where the program can start execution;

при нахождении вызова API функции, ссылки на строку или соответствия трассируемого адреса логу динамической системы в логический блок добавляется новый соответствующий элемент.when a call to an API function, a string reference, or a traceable address matches a dynamic system log, a new corresponding element is added to the logical block.

Схематично структура шаблона функциональности изображена на Фиг.4. HASH_i, HASH_i+1 в блоках 440 и 450 - хеши от функций (включая их аргументы), например, от функции «WriteFile», хеш которой равен ССЕ95612 в блоке 430. При этом выделяется 2 типа последовательностей элементов, связанных между собой:Schematically, the structure of the template functionality shown in Fig.4. HASH _i , HASH _{i + 1} in blocks 440 and 450 are hashes from functions (including their arguments), for example, from the WriteFile function, the hash of which is equal to ССЕ95612 in block 430. In this case, 2 types of sequences of elements connected among themselves are distinguished:

непосредственно логические блоки иdirectly logical blocks and

участки между вызовами одного логического блока другим, названные ребрами (Edge). Например, Edge #1.1 и Edge #1.2 блока 410.areas between calls of one logical block to another, called edges. For example, Edge # 1.1 and Edge # 1.2 of block 410.

Как говорилось ранее, логический блок - это совокупность элементов, таких как API-функций, ссылки на строки, ссылки на другие логические блоки и др. На чертеже логическими блоками являются Logic Block #1 410, Logic Block #2 420, Logic Block #3 430, Logic Block #4 440 и Logic Block #5 450.As mentioned earlier, a logical block is a collection of elements such as API functions, links to strings, links to other logical blocks, etc. In the drawing, the logical blocks are Logic Block # 1 410, Logic Block # 2 420, Logic Block # 3 430, Logic Block # 4 440 and Logic Block # 5 450.

Логический блок, состоящий только из вызовов API-функций, не содержащий переходов, является простейшим логическим блоком и целиком является ребром. При возникновении в логическом блоке вызова на другой логический блок, первый логический блок будет делиться на два ребра: от его начала до вызова второго логического блока и от начала вызова второго логического блока до конца первого логического блока. Соответственно, если логический блок ссылается на n других логических блоков, то он будет состоять из n+1 ребер. Например, в логическом блоке Logic Block #1 410 существуют ребра Edge #1.1, Edge #1.2 и т.д., где CALL Logic Block #2, CALL Logic Block #3 - вызовы логических блоков Logic Block #2 220 и Logic Block #3 430, соответственно.A logical block consisting only of calls to API functions that does not contain transitions is the simplest logical block and is entirely an edge. When a call occurs in a logical block to another logical block, the first logical block will be divided into two edges: from its beginning to the call of the second logical block and from the beginning of the call of the second logical block to the end of the first logical block. Accordingly, if a logical block refers to n other logical blocks, then it will consist of n + 1 edges. For example, in the logic block Logic Block # 1 410 there are edges Edge # 1.1, Edge # 1.2, etc., where CALL Logic Block # 2, CALL Logic Block # 3 - calls to logic blocks Logic Block # 2 220 and Logic Block # 3,430, respectively.

Основываясь на логических связях между последовательностями элементов, было создано два типа шаблонов функциональности исполняемых файлов:Based on the logical relationships between sequences of elements, two types of functionality templates for executable files were created:

упорядоченный набор логических блоков, далее будем использовать термин простой шаблон - шаблон функциональности, в котором основополагающим фактором являются зависимости элементов внутри логических блоков;an ordered set of logical blocks, hereinafter we will use the term simple template - a functionality template in which the fundamental factor is the dependencies of the elements inside the logical blocks;

набор взаимосвязанных ребер, далее будем использовать термин краевой шаблон - шаблон функциональности, в котором основополагающим фактором являются зависимости между ребрами.a set of interconnected edges, then we will use the term boundary template - a functionality template in which the dependencies between the edges are a fundamental factor.

За разделение шаблонов по типам отвечает средство обработки шаблонов функциональности 130 данного изобретения, изображенное на Фиг.1.For the separation of patterns by type, the means of processing patterns of functionality 130 of the present invention, shown in FIG. 1, are responsible.

На представленном примере, Фиг.4, структура данных шаблонов функциональности будет выглядеть следующим способом, изображенным на Фиг.5: упорядоченный набор логических блоков 510 - список логических блоков (возможна сортировка по размеру для оптимизации процесса).In the presented example, FIG. 4, the data structure of the functionality templates will look as follows, as shown in FIG. 5: an ordered set of logical blocks 510 — a list of logical blocks (sorting by size to optimize the process is possible).

Для составления набора взаимосвязанных ребер 520 выстраивается взаимосвязь ребер по тому порядку, по которому они и вызываются в коде программы. Каждое ребро встречается в наборе 520 только один раз для избегания дублирования и циклизации информации.To compile a set of interconnected ribs 520, the interconnection of the ribs is built in the order in which they are called in the program code. Each edge occurs in set 520 only once to avoid duplication and cyclization of information.

Логический путь (Logic Path), например, логический путь 530 или 540 - это путь от «точки входа» в программу, т.е. от места в коде, с которого программа может начать исполнение. Соответственно, путей столько, сколько существует «точек входа» в программе. В список «точек входа», как уже говорилось, на статическом этапе помещаются любые встреченные ссылки на код.A logical path (Logic Path), for example, a logical path 530 or 540 - is the path from the "entry point" into the program, i.e. from the place in the code where the program can start execution. Accordingly, there are as many paths as there are "entry points" in the program. In the list of “entry points”, as already mentioned, at the static stage, any encountered links to the code are placed.

Среди всей информации, описывающей файл и собранной в шаблоне функциональности, может присутствовать «шумовая» информация, такая как базовые библиотеки компилятора, статически слинкованные функции стандартных библиотек и т.д. Если не убрать шумовую информацию, то вполне возможна ситуация, что чистая программа и вредоносная будут схожи более чем на 90%, из чего можно сделать вывод, что они являются похожими. Понятно, что без выделения, оценки и предобработки такой информации сравнивать шаблоны функциональности опасно, т.к. в определенных условиях велика вероятность ложного срабатывания. Логические блоки, содержащие такую «шумовую информацию», в дальнейшем будем называть стандартными логическими блоками. В данном изобретении выделение логических блоков, содержащих шумовую информацию, из шаблонов функциональности выполняется средством обработки шаблонов функциональности 130, изображенным на Фиг.1.Among all the information describing the file and the functionality gathered in the template, there may be “noise” information, such as the compiler’s base libraries, statically linked functions of standard libraries, etc. If you do not remove the noise information, then it is quite possible that a clean program and a malicious program will be more than 90% similar, from which we can conclude that they are similar. It is clear that without isolating, evaluating, and pre-processing such information, comparing functionality templates is dangerous, because under certain conditions, the probability of false positives is high. Logical blocks containing such “noise information” will be called standard logical blocks in the future. In the present invention, the allocation of logic blocks containing noise information from the functionality templates is performed by the functionality template processing means 130 depicted in FIG. 1.

Одним из алгоритмов выделения стандартных логических блоков является выделение по «значению информативности», т.е. по величине, которая характеризует уровень неинформативных блоков с точки зрения нашего метода (объяснено ниже). Возьмем два набора файлов: Х - вредоносные программы, Y - чистые программы, изображенные на Фиг.6. Причем набор будет характеризоваться признаками:One of the algorithms for the allocation of standard logical blocks is the selection according to the "value of information content", i.e. in magnitude, which characterizes the level of non-informative blocks in terms of our method (explained below). Let's take two sets of files: X - malicious programs, Y - clean programs shown in Fig.6. Moreover, the set will be characterized by signs:

одинаковый компилятор для всех файлов в наборах;the same compiler for all files in sets;

схожие используемые системные библиотеки (± несколько библиотек);similar used system libraries (± several libraries);

схожее общее количество логических блоков (±15-20%) в паттерне;similar total number of logical blocks (± 15-20%) in the pattern;

схожее общее количество хешей в каждом паттерне (±15-20%).similar total number of hashes in each pattern (± 15-20%).

Выбраны именно эти признаки, т.к. в большинстве своем шумом является базовый код библиотек компилятора или стандартные библиотеки компилятора. Чтобы автоматически найти шумовые блоки для компилятора Visual C++, нужно взять набор файлов, скомпилированных этим компилятором.These characteristics are selected, because most of the noise is the base code of the compiler libraries or standard compiler libraries. To automatically find noise blocks for the Visual C ++ compiler, you need to take the set of files compiled by this compiler.

Общее количество логических блоков и количество хешей в шаблоне функциональности - это наиболее общие параметры похожести. Т.е. если один шаблон функциональности состоит из одного хеша в одном логическом блоке, а другой - из сотни хешей в сотне логических блоков. Вероятность появления в таких файлах схожих (шумящих) участков мала. Как уже упоминалось, хеш - это хеш-функция, взятая от имени события.The total number of logical blocks and the number of hashes in the functionality template are the most common similarity parameters. Those. if one functionality template consists of one hash in one logical block, and another of hundreds of hashes in hundreds of logical blocks. The likelihood of similar (noisy) sections appearing in such files is small. As already mentioned, a hash is a hash function taken on behalf of an event.

Выделим все уникальные логические блоки на таком наборе, т.е. уникальный набор хешей. Для каждого уникального блока определим количество файлов на каждом наборе, в которых он присутствует. Для каждого блока получим 4 значения:Select all the unique logical blocks on such a set, i.e. a unique set of hashes. For each unique block, we determine the number of files on each set in which it is present. For each block we get 4 values:

Х - общее количество вредоносных паттернов 610;X - total number of malicious patterns 610;

Y - общее количество чистых паттернов 620;Y is the total number of pure patterns 620;

х - количество вредоносных паттернов 630, содержащих логический блок;x is the number of malicious patterns 630 containing a logical block;

у - количество чистых паттернов 640, содержащих логический блок.y is the number of pure patterns 640 containing a logical block.

Таким образом, для каждого блока можно вычислить значение информативности iGain(X,Y,x,y), которое будет показывать, насколько значим данный логический блок для разделения множества {X+Y} на два подмножества Х и Y, то есть насколько этот логический блок определяет принадлежность шаблона функциональности к той или иной коллекции. Следует заметить, что действительно вредоносный блок не будет встречаться на множестве чистых шаблонов функциональности 620, а если же это блок компилятора (одна и та же MD5-сумма), то он будет встречаться и в чистых, и во вредоносных файлах, то есть у него будет очень низкая информативность. При низкой информативности существует возможность принадлежности блока к шумовой информации. Впоследствии данный неинформативный блок удаляется из базы данных. При удалении элемента все логические связи необходимо сохранить для сохранения путей. Интересны такие блоки, которые присутствуют во множестве {X+Y} более чем в половине файлов и при этом будут иметь достаточно низкую информативность:Thus, for each block, you can calculate the value of informativeness iGain (X, Y, x, y), which will show how significant this logical block is for dividing the set {X + Y} into two subsets of X and Y, that is, how logical is this the block determines whether the functionality template belongs to a particular collection. It should be noted that a truly malicious block will not occur on many clean templates of 620 functionality, but if it is a compiler block (the same MD5 sum), then it will be found in both clean and malicious files, that is, it there will be very low information content. With low information content, it is possible that the unit belongs to noise information. Subsequently, this uninformative block is deleted from the database. When deleting an element, all logical connections must be saved to save the paths. Such blocks are interesting that are present in the set {X + Y} in more than half of the files and at the same time will have a fairly low information content:

{x+y}≥{X+Y}/2;{x + y} ≥ {X + Y} / 2;

iGain≤ε, где ε - заданный достаточно малый порог информативности.iGain≤ε, where ε is a given sufficiently small threshold for information content.

После выделения стандартных логических блоков переходим к алгоритмам сравнения информационных паттернов.After selecting the standard logical blocks, we move on to algorithms for comparing information patterns.

Например, есть логический блок, состоящий из следующих функций (runtime компилятора Visual C++):For example, there is a logical unit consisting of the following functions (runtime of the Visual C ++ compiler):

_except_handler3_except_handler3

_set_app_type_set_app_type

_р_fmode_f_fmode

_р_commode_p_commode

_adjust_fdiv_XREFF_adjust_fdiv_XREFF

_setusermatherr_setusermatherr

_initterm_initterm

_getmainargs_getmainargs

_initterm_initterm

_acmdln_XREFF_acmdln_XREFF

GetStartupInfoAGetStartupInfoA

GetModuleHandleAGetModuleHandleA

exitexit

_XcptFilter_XcptFilter

В коллекции существуют:In the collection there are:

вредоносные паттерны Х=67;malicious patterns X = 67;

чистые паттерны Y=84.pure patterns Y = 84.

При этом в чистых паттернах такой блок встретился у=71 раз, а во вредоносных х=62 раза. Получается, что iGain(P,N,p,n)=1134·10^-6.Moreover, in pure patterns, such a block occurred y = 71 times, and in malicious x = 62 times. It turns out that iGain (P, N, p, n) = 1134 · 10 ^-6 .

Рассмотрим другой логический блок, состоящий из таких функций (получение API функций для поднятия привилегий):Consider another logical block consisting of such functions (obtaining API functions for raising privileges):

"ADVAPIXX.DLL""ADVAPIXX.DLL"

GetModuleHandleAGetModuleHandleA

"ADJUSTTOKENPRIVILEGES""ADJUSTTOKENPRIVILEGES"

GetProcAddressGetProcAddress

"LOOKUPPRIVILEGEVALUEA""LOOKUPPRIVILEGEVALUEA"

GetProcAddressGetProcAddress

"OPENPROCESSTOKEN""OPENPROCESSTOKEN"

GetProcAddressGetProcAddress

GetCurrentProcessGetCurrentProcess

CloseHandleClosehandle

Для тех же подмножеств чистых и вредоносных паттернов такой блок среди чистых паттернов встретился у=2 раза, а среди вредоносных х=65 раз. Получается, что iGain(P,N,p,n)=81461·10^-6.For the same subsets of pure and malicious patterns, such a block among pure patterns occurred y = 2 times, and among malicious ones x = 65 times. It turns out that iGain (P, N, p, n) = 81461 · 10 ^-6 .

Между значениями iGain для этих логических блоков видна количественная разница.A quantitative difference is visible between the iGain values for these logical units.

Первый логический блок встречается достаточно большое количество раз, как в чистых файлах, так и во вредоносных, поэтому он имеет низкую информативность для разделения множества М=Х+Y на 2 подмножества Х и Y. Т.е. он не информативен и не будет использоваться при создании шаблона функциональности.The first logical block occurs quite a large number of times, both in clean files and in malicious ones, therefore it has low information content for dividing the set M = X + Y into 2 subsets of X and Y. That is, it is not informative and will not be used when creating a functionality template.

Второй логический блок часто встречается во вредоносных файлах, но редко в чистых, поэтому имеет высокое значение информативности для разделения множества на подмножества, поэтому он может быть использован при сравнении.The second logical block is often found in malicious files, but rarely in clean ones, therefore it has a high information content value for dividing the set into subsets, therefore it can be used in comparison.

Сравнение шаблонов функциональности между собой осуществляется средством сравнения шаблонов функциональности 140, изображенным на Фиг.1, и заключается в сравнении между собой логических блоков (или ребер, в зависимости от типа шаблона функциональности). Так как прямого соответствия между блоками в шаблонах функциональности нет, то встает задача эвристическим путем отобрать только те блоки, которые потенциально могут быть похожими. При этом для разных типов паттернов эвристика отбора разная. В результате сравнения шаблона функциональности с шаблонами функциональности из коллекций известных файлов 150, изображенных на Фиг.1, относящийся к шаблону функциональности исполняемый файл будет помещен в соответствующую коллекцию файлов.The comparison of the functionality templates between themselves is carried out by means of comparing the functionality templates 140 shown in FIG. 1, and consists in comparing logical blocks (or edges, depending on the type of functionality template). Since there is no direct correspondence between blocks in the functionality templates, the task is to heuristically select only those blocks that can potentially be similar. Moreover, for different types of patterns, the selection heuristic is different. As a result of comparing the functionality template with the functionality templates from the collections of known files 150 shown in FIG. 1, the executable file related to the functionality template will be placed in the corresponding file collection.

Сравнение простых шаблонов:Comparison of simple templates:

Схема сравнения представлена на Фиг.7. Логические блоки в простых шаблонах сортируются по количеству элементов в блоке и сравниваются между собой только те блоки, в которых количество элементов примерно равное (отклонение задается параметром). Так же ограничивающим фактором является количество вызовов внутри логического блока других блоков. Получаем матрицу 710 M×N, где М и N - количество логических блоков в простом шаблоне, а в каждой ячейке лежит значение похожести между соответствующими логическими блоками.The comparison scheme is presented in Fig.7. Logical blocks in simple templates are sorted by the number of elements in the block and only those blocks in which the number of elements are approximately equal are compared (the deviation is set by the parameter). A limiting factor is the number of calls inside the logical block of other blocks. We get the matrix 710 M × N, where M and N are the number of logical blocks in a simple template, and in each cell lies the value of similarity between the corresponding logical blocks.

После сравнения логических блоков друг с другом происходит поиск наилучшего варианта сочетания похожих блоков. Наилучшим считается тот вариант, при котором, сопоставив «похожие» логические блоки двух простых шаблонов между собой, получается наилучший общий процент похожести паттернов. При этом, для ускорения сравнения, поиск наилучшего варианта происходит только по тем логическим блокам, которые могут быть потенциально похожи. Если искать наилучший вариант без ускорения по всем ячейкам таблицы, то сложность алгоритма была бы К!, где К=Max(M,N), т.к. каждый из логических блоков сравнивается со всеми блоками другого простого шаблона.After comparing the logical blocks with each other, a search is made for the best option for combining similar blocks. The best option is that in which, by comparing the "similar" logical blocks of two simple patterns with each other, the best overall percentage of similarity of patterns is obtained. At the same time, to speed up the comparison, the search for the best option occurs only for those logical blocks that may be potentially similar. If you look for the best option without acceleration in all cells of the table, then the complexity of the algorithm would be K !, where K = Max (M, N), because each of the logical blocks is compared with all blocks of another simple template.

На Фиг.8 изображен аналогичный пример сравнения простых шаблонов, где 810 - это логические блоки первого шаблона функциональности, 820 - логические блоки второго шаблона функциональности. Поставив им взаимно-однозначное соответствие со значением похожести между блоками обоих файлов, необходимо найти самую оптимальную комбинацию похожестей. Например, для Фиг.8 самым оптимальным соответствием будет 1.1-2.3, 1.3-2.4, 1.4-2.2.On Fig depicts a similar example of a comparison of simple patterns, where 810 are the logical blocks of the first functionality template, 820 are the logical blocks of the second functionality template. Having put them in a one-to-one correspondence with the similarity value between the blocks of both files, it is necessary to find the most optimal combination of similarities. For example, for Fig. 8, the best match would be 1.1-2.3, 1.3-2.4, 1.4-2.2.

Сравнение краевых шаблонов:Comparison of edge patterns:

На Фиг.9 изображена таблица сравнения 910 логических путей (LP_x). Размер данной задачи увеличивается, т.к. на Фиг.8 было сравнение логических блоков, а на Фиг.9 изображено сравнение логических путей. Так как в структуре данных краевых шаблонов присутствует уже не один тип элементов, а два (ребра и логические пути), соответственно и сравнение проводится чуть сложнее. Сравнение происходит в два этапа. Сначала происходит сортировка логических путей по количеству в них ребер и элементов. После чего, выбираются только потенциально похожие пути и в сравнении и в поиске наилучшего варианта участвуют только они. Вторым этапом является сравнение непосредственно двух путей между собой, которое происходит таким же образом, как сравнение упорядоченных наборов логических блоков, только вместо логических блоков сравниваются ребра. Логический путь в данной ситуации становится эквивалентен упорядоченному набору логических блоков.Figure 9 shows a comparison table of 910 logical paths (LP _x ). The size of this task increases as Fig. 8 was a comparison of logical blocks, and Fig. 9 shows a comparison of logical paths. Since in the data structure of edge templates there is already not one type of elements, but two (edges and logical paths), respectively, and the comparison is a little more complicated. The comparison takes place in two stages. First, the logical paths are sorted by the number of edges and elements in them. After that, only potentially similar paths are chosen and only they are involved in the comparison and in the search for the best option. The second stage is a comparison of directly two paths with each other, which happens in the same way as comparing ordered sets of logical blocks, only edges are compared instead of logical blocks. The logical path in this situation becomes equivalent to an ordered set of logical blocks.

«Нижним» элементом сравнения (самым элементарным элементом сравнения) является логический блок или ребро в зависимости от типа шаблона функциональности соответственно. Логические блоки/ребра представляют собой последовательность элементов (событий) и могут сравниваться различными алгоритмами. Но для того чтобы сравнивать блоки между собой, нужно привести их к определенному нормализованному виду, готовому для сравнения. Приведение к нормализованному виду выполняется средством обработки шаблонов функциональности 130, изображенным на Фиг.1.The “lower” element of comparison (the most elementary element of comparison) is a logical block or edge, depending on the type of functionality template, respectively. Logical blocks / edges are a sequence of elements (events) and can be compared by various algorithms. But in order to compare the blocks with each other, you need to bring them to a certain normalized form, ready for comparison. Bringing to a normalized look is performed by the means of processing templates functionality 130, shown in figure 1.

Одним из вариантов нормализованного вида является представление логических блоков в виде бинарных строк. Алфавитом для строк будут являться хеш-функции от имен событий/элементов. В таком виде работа с элементами представляется наиболее удобной и логически понятной.One of the normalized options is to represent logical blocks in the form of binary strings. The alphabet for the strings will be the hash functions of the names of the events / elements. In this form, working with elements seems to be the most convenient and logically understandable.

Вызовы логических блоков, для наглядности, будут обозначаться как 00000000+номер логического блока, что будет показано далее на примере упорядоченных наборов логических блоков в виде 00000002 или 00000005.Calls of logical blocks, for clarity, will be designated as 00000000 + logical block number, which will be shown later on the example of ordered sets of logical blocks in the form of 00000002 or 00000005.

Выполнив преобразования на приведенном выше примере логических блоков/ ребер, имеем:Having performed the transformations on the above example of logical blocks / edges, we have:

упорядоченный набор логических блоков, изображенный на Фиг.5:an ordered set of logical blocks, depicted in Figure 5:

Logic Block #1:Logic Block # 1:

C24FA5F4 447D086B 6CC098F5 00000002 5C856C47 00000003 00000005, где 00000002, 00000003, 00000005 - это переходы между хешами. Одним из вариантов использования этих значений может быть восстановление логических путей. Возможна реализация хранения этих значений внутри упорядоченного набора логических блоков.C24FA5F4 447D086B 6CC098F5 00000002 5C856C47 00000003 00000005, where 00000002, 00000003, 00000005 are transitions between hashes. One way to use these values is to restore logical paths. It is possible to implement the storage of these values inside an ordered set of logical blocks.

Logic Block #2:Logic Block # 2:

FFF372BE 919 В6 ВСВ 553 В5С78FFF372BE 919 V6 BCB 553 V5C78

Logic Block #3:Logic Block # 3:

C79DC4E3 1DDA9F5D 553 В5С78 CCE95612 00000004 B09315F4C79DC4E3 1DDA9F5D 553 V5C78 CCE95612 00000004 B09315F4

Logic Block #4:Logic Block # 4:

HASH₁, HASH₂, HASH₃, HASH_i,…, HASH_N HASH ₁ , HASH ₂ , HASH ₃ , HASH _i , ..., HASH _N

наборов взаимосвязанных ребер:sets of interconnected ribs:

Logic Path #1 представляется в виде последовательности, изображенной на Фиг.10.Logic Path # 1 is represented as the sequence depicted in FIG. 10.

В таком виде шаблоны функциональности с легкостью можно подвергать различным математическим преобразованиям и сравнениям.In this form, functionality templates can easily be subjected to various mathematical transformations and comparisons.

Далее будут рассмотрены алгоритмы сравнения логических блоков/ ребер, которые уже были использованы для сравнения и при работе с которыми были получены положительные результаты.Next, we will consider algorithms for comparing logical blocks / edges that were already used for comparison and when working with which positive results were obtained.

Расстояние Хемминга - это мера различия между кодовыми комбинациями (двоичными векторами) в векторном пространстве кодовых последовательностей. В этом случае расстоянием Хэмминга между двумя двоичными последовательностями (векторами) Х и Y длины n называется число позиций, в которых они различны. В нашем случае вектора Х и Y определяются уникальными элементами в данных логических блоках;Hamming distance is a measure of the difference between code combinations (binary vectors) in the vector space of code sequences. In this case, the Hamming distance between two binary sequences (vectors) X and Y of length n is the number of positions in which they are different. In our case, the vectors X and Y are determined by unique elements in these logical blocks;

расстояние Левенштейна - это мера разницы двух последовательностей символов (строк) относительно минимального количества операций вставки, удаления и замены, необходимых для перевода одной строки в другую. В нашем случае строками являются логические блоки/ребра. Каждый элемент (HASH) внутри логического блока/ребра является одной буквой. Элементы не имеют веса, т.е. равнозначны. На Фиг.11 изображен процесс вычисления расстояния Левенштейна при сравнении двух блоков 1110 и 1120. Исходя из определения данного алгоритма, видно, что в данном случае расстояние Левенштейна равно 5.Levenshtein distance is a measure of the difference between two sequences of characters (lines) relative to the minimum number of insert, delete and replace operations necessary to translate one line to another. In our case, the rows are logical blocks / edges. Each element (HASH) inside a logical block / edge is a single letter. Elements have no weight, i.e. equivalent. Figure 11 shows the process of calculating the Levenshtein distance when comparing two blocks 1110 and 1120. Based on the definition of this algorithm, it is clear that in this case the Levenshtein distance is 5.

Алгоритм вычисления весов (iGain):Algorithm for calculating weights (iGain):

каждому элементу присваивается сгенерированный вес. Веса генерируются по множеству {X+Y}, где Х - общее количество вредоносных шаблонов функциональности, Y - общее количество чистых шаблонов функциональности;each element is assigned a generated weight. Weights are generated by the set {X + Y}, where X is the total number of malicious functionality templates, Y is the total number of pure functionality templates;

Х и Y выбираются либо сторонним алгоритмом, либо по косвенным признакам.X and Y are selected either by a third-party algorithm, or by indirect signs.

В итоге каждый ID получает вес w_id, показывающий, насколько он характеризует шаблон функциональности на Х подмножестве. Далее логические блоки/ребра сравниваются друг с другом и находится наилучший вариант. Шаблон функциональности/путь имеет общий вес W=W₁+W₂+…+W_n, где W_i- вес i-го логического блока/ребра.As a result, each ID receives the weight w _id , which shows how much it characterizes the functionality template on the X subset. Next, the logical blocks / edges are compared with each other and the best option is found. The functionality template / path has a total weight W = W ₁ + W ₂ + ... + W _n , where W _i is the weight of the ith logical block / edge.

W_i=W_id1+W_id2+…+W_idm - суммы весов всех элементов логического блока/ребра.W _i = W _id1 + W _id2 + ... + W _idm - the sum of the weights of all elements of the logical block / edge.

Когда происходит сравнение двух блоков, элементы делятся на 2 группы:When two blocks are compared, the elements are divided into 2 groups:

элементы есть в обоих логических блоках (причем с учетом количества). Их вес W₊;there are elements in both logical blocks (and taking into account the quantity). Their weight is W ₊ ;

элементов нет в обоих логических блоках. Их вес W_-.there are no elements in both logical blocks. Their weight is W _- .

W_cpi - вес сравнения. W_cp=W₊-W_-. Если W_cpi≤0, то W_cpi=0 и логические блоки считаются совершенно не похожими. Если W_cpi>0, то W_cpi/W_i·100 - процент похожести логических блоков. W_cpi/W·100 - процент похожести внутри шаблона функциональности/логического пути.W _cpi is the weight of the comparison. W _cp = W ₊ -W _- . If W _cpi ≤0, then W _cpi = 0 and logical blocks not considered absolutely similar. If W _cpi > 0, then W _cpi / W _i · 100 is the percentage of similarity of logical blocks. W _cpi / W · 100 - percentage of similarity inside the functionality / logical path template.

Итоговая величина, представляющая интерес:Total value of interest:

U_max=(W_cp1+W_cp2+…+W_cpn)·100/W.U _max = (W _cp1 + W _cp2 + ... + W _cpn ) · 100 / W.

Например, для семейства Trojan Downloader существуют функции, которые он часто использует (CreateProcess, UrIDownload и т.д.). Также для этого семейства существует набор функций, которые никогда не используются или случайно используются. Берется набор похожих программ Trojan Downloader различных версий (X). Для этих программ выделяем логические блоки, идентификаторы, используемые библиотеки. В качестве Y-множества берутся похожие по первичным характеристикам чистые программы или программы, которые не принадлежат данному семейству. Далее для каждого хеша вычисляются значения х и у (сколько раз он встречался среди вредоносных шаблонов функциональности и среди чистых шаблонов функциональности соответственно). Затем для каждого хеша вычисляется вес iGain. Т.к. набор семейства вредоносных программ был изначально точным, то iGain будет большим у тех функций, которые определяют это семейство. Для каждого логического блока высчитывается вес, как сумма весов хешей внутри логического блока. Затем для проверяемых хешей составляется сумма его вхождений в логические блоки программ из множеств Х и Y. Если вес положителен, то это значение - это процент от общего веса;For example, for the Trojan Downloader family there are functions that it often uses (CreateProcess, UrIDownload, etc.). Also for this family there is a set of functions that are never used or accidentally used. A set of similar Trojan Downloader programs of various versions (X) is taken. For these programs, we allocate logical blocks, identifiers, used libraries. As a Y-set, pure programs or programs that are similar in their primary characteristics and which do not belong to this family are taken. Then, for each hash, the values of x and y are calculated (how many times it was found among malicious functionality templates and among pure functionality templates, respectively). Then, for each hash, the weight of iGain is calculated. Because Since the malware family was initially accurate, iGain will be great for the functions that define this family. For each logical block, the weight is calculated as the sum of the weights of the hashes inside the logical block. Then, for the hashes to be verified, the sum of its occurrences in the logical blocks of the programs from the sets X and Y is compiled. If the weight is positive, then this value is a percentage of the total weight;

алгоритм выделения характеристической информации:algorithm for extracting characteristic information:

алгоритм детектирования основывается на выделении из шаблона функциональности только семейств поведенческо-определяющей информации в виде Логического блока или Логического пути. Т.е. выделяется семейство похожих шаблонов функциональности. Далее анализируется, какая информация является определяющей (характеризующей), какие логические блоки (Логические пути) встречаются постоянно и характерны для семейства, а какие видоизменяются или являются неинформативными (например, выполняющими какие-либо вспомогательные - не вредоносные действия, которые встречаются и в чистых программах). Анализ может проходить как автоматически (статистика частоты встречаемости), так и проводится аналитиком (или совместно, аналитику предоставляется статистическая информация, и он по ней генерирует запись).the detection algorithm is based on isolating from the functionality template only families of behavioral-determining information in the form of a Logical block or Logical path. Those. A family of similar functionality templates is highlighted. Next, it is analyzed which information is determining (characterizing), which logical blocks (logical paths) are found constantly and are characteristic of the family, and which are mutated or uninformative (for example, performing any auxiliary - not malicious actions, which are also found in clean programs). ) The analysis can take place either automatically (frequency statistics) or carried out by the analyst (or jointly, the statistician is provided with statistical information, and he generates a record from it).

Схематично выделение информации представлено на Фиг.12А и 12Б. Где А 1210 - вся информация из шаблона функциональности, Б 1220 - характеристическая информация, определяющая семейство, или набор вредоносных действий, информация.Schematically, the selection of information is presented in Fig.12A and 12B. Where A 1210 - all the information from the functionality template, B 1220 - characteristic information that defines the family, or a set of malicious actions, information.

Для двух шаблонов функциональности из семейства имеем возможность игнорировать неинтересную (не попадающую в объединение) информацию и проводить поиск только характеристической информации 1250 на Фиг.12. На основе этой характеристической информации 1320 на Фиг.13 могут составляться коллекции известных файлов 150, изображенных на Фиг.1.For two templates of functionality from the family, we can ignore uninteresting (not falling into the pool) information and search only for characteristic information 1250 in FIG. 12. Based on this characteristic information 1320 in FIG. 13, collections of known files 150 depicted in FIG. 1 can be compiled.

В заключение следует отметить, что приведенные в описании сведения являются только примерами, которые не ограничивают объем настоящего изобретения, определенного формулой. Специалисту в данной области становится понятным, что могут существовать и другие варианты осуществления настоящего изобретения, согласующиеся с сущностью и объемом настоящего изобретения.In conclusion, it should be noted that the information provided in the description are only examples that do not limit the scope of the present invention defined by the claims. One skilled in the art will recognize that there may be other embodiments of the present invention consistent with the spirit and scope of the present invention.

Claims

1. A method for determining the ownership of files in collections of known files based on file comparisons using functionality templates, comprising the steps of
(a) form patterns of functionality based on information about the executable file;
(b) remove the selected noise information from the functionality templates of the executable file;
(c) bring the blocks of the templates of the functionality of the executable file to a normalized form for subsequent comparison;
(d) comparing the normalized blocks of the templates of the functionality of the executable file with the blocks of the templates of the functionality of known files and, based on the results of the comparison, decide on whether the normalized block of templates of the functionality belongs to one of the templates of the known files functionality.

2. The method according to claim 1, in which information about the executable file can be collected by means of information collection tools such as a module emulating program execution; software virtualization module; application control module; a module that disassembles the program and allows you to facilitate further alignment between certain communication functions.

3. The method according to claim 1, in which the formation of patterns of functionality of the executable file is carried out in two stages: dynamic and static.

4. The method according to claim 3, in which the dynamic stage involves collecting information about the executable file using the information collection tools, and the result of the dynamic stage when forming the functionality templates for the executable file is a memory dump and an event log.

5. The method according to claim 3, in which the static step involves the direct tracing of a memory dump and the formation of functionality templates for the executable file.

6. The method according to claim 1, in which the functionality templates of the executable file are of two types:
functionality template that is based on a sequence of blocks and
A functionality template that is based on a sequence of edges.

7. The method according to claim 6, in which blocks are called sets of elements, such as API functions, links to lines of code and links to other logical blocks.

8. The method according to claim 6, in which edges are called sections between calls from one block to another.

9. The method according to claim 1, in which the selected noise information includes basic compiler libraries and statically linked functions of standard libraries.

10. The method according to claim 1, in which the comparison of patterns of functionality of the executable file with patterns from collections of known files is carried out according to different mathematical algorithms for each type of patterns of functionality, such as the Hamming distance; Levenshtein distance; algorithm for calculating weights; algorithm for extracting characteristic information.

11. A system for determining file ownership of collections of known files based on file comparisons using functionality templates, comprising
emulation tool that collects information about the executable file;
means for creating functionality templates for the executable file, which, on the basis of the information collected by the emulation tool, forms blocks and makes up functionality templates for the executable file from them;
means for processing the functionality templates of executable files, which removes noise information extracted from the functionality templates and brings the blocks of functionality templates to normalized form;
means for storing collections of functionality templates of known files;
means for comparing functionality templates of executable files, which compares normalized blocks of functionality templates received from means for processing functionality templates with functionality templates of known files from said storage medium of a collection in accordance with the classification of functionality templates for executable files.

12. The system according to claim 11, in which the emulation tool includes such information collection tools as a module emulating program execution; software virtualization module; application control module; a module that disassembles the program and allows you to facilitate further alignment between certain communication functions.

13. A machine-readable medium for determining whether files belong to collections of known files based on file comparisons using functionality templates on which a computer program is stored, the execution of which on a computer carries out the following steps:
(a) form patterns of functionality based on information about the executable file;
(b) remove the selected noise information from the functionality templates of the executable file;
(c) bring the blocks of the templates of the functionality of the executable file to a normalized form for subsequent comparison;
(d) comparing the normalized blocks of the templates of the functionality of the executable file with the blocks of the templates of the functionality of known files and, based on the results of the comparison, decide on whether the normalized block of templates of the functionality belongs to one of the templates of the known files functionality.