RU2778979C1

RU2778979C1 - Method and system for clustering executable files

Info

Publication number: RU2778979C1
Application number: RU2021108261A
Authority: RU
Inventors: Илья Сергеевич Померанцев
Original assignee: Общество с ограниченной ответственностью "Группа АйБи ТДС"
Filing date: 2021-03-29
Publication date: 2022-08-29

Abstract

FIELD: computing technology.

SUBSTANCE: method for clustering executable files implemented on a computer apparatus and containing the stages of: obtaining a set of executable files; determining the format of each executable file separately for each file format: finding repeating sequences of a set length in the files; determining the most frequent sequences; attributing files containing at least one most frequent sequence to one family; clearing all files attributed to this family from further processing; repeating the search for the most frequent sequences; attributing files containing at least one most frequent sequence to the next family and clearing said files from further processing until all files are attributed to some family or until the remaining files do not contain repeating sequences; in response to the remaining files not containing repeating sequences, attributing each of said files to a separate family.

EFFECT: ensured automatic clustering of executable files.

17 cl, 7 dwg

Description

ОБЛАСТЬ ТЕХНИКИFIELD OF TECHNOLOGY

Данное техническое решение относится к области вычислительной техники, а именно к способам и системам кластеризации исполняемых файлов, что может быть необходимо при определении принадлежности программного обеспечения (ПО) к известному семейству ПО, а также принадлежности ПО той или иной группе авторов ПО.This technical solution relates to the field of computer technology, namely to methods and systems for clustering executable files, which may be necessary when determining whether software belongs to a well-known software family, as well as whether software belongs to one or another group of software authors.

УРОВЕНЬ ТЕХНИКИBACKGROUND OF THE INVENTION

Данное техническое решение относится к области вычислительной техники, а именно к способам и системам кластеризации исполняемых файлов, что может быть необходимо при определении принадлежности программного обеспечения (ПО). Под определением принадлежности в данном случае можно понимать, как принадлежность ПО к известному семейству ПО, так и принадлежность ПО той или иной группе авторов ПО.This technical solution relates to the field of computer technology, namely to methods and systems for clustering executable files, which may be necessary when determining software ownership. Under the definition of belonging in this case, one can understand both the belonging of the software to a well-known software family, and the belonging of the software to one or another group of software authors.

Известно, что профессиональные киберпреступники тщательно разрабатывают стратегию атаки и редко ее меняют, притом обычно они длительное время используют одни и те же вредоносные программы, лишь незначительно модифицируя их.It is known that professional cybercriminals carefully develop their attack strategy and rarely change it, moreover, they usually use the same malware for a long time, only slightly modifying it.

С другой стороны, разработчики вредоносного ПО (ВПО), создающие инструментарий для киберпреступников, могут на протяжении длительного времени использовать одно и то же программное решение, например, функцию, реализующую алгоритм шифрования, в различных образцах ВПО, притом создаваемого для разных киберпреступных группировок и относящихся к разным семействам ВПО.On the other hand, malware developers who create tools for cybercriminals can use the same software solution for a long time, for example, a function that implements an encryption algorithm, in various malware samples, moreover, created for different cybercriminal groups and related to to different malware families.

Поэтому в области кибербезопасности знание о том, что данный образец вредоносного ПО (ВПО) принадлежит к определенному семейству ВПО, как и знание о том, что данный образец ВПО создан определенным автором или группой авторов, может быть весьма важно.Therefore, in the field of cybersecurity, the knowledge that a given malware sample belongs to a certain malware family, as well as the knowledge that a given malware sample was created by a certain author or group of authors, can be very important.

Хорошо известен такой метод обнаружения вредоносного ПО, как сигнатурный анализ. Этот метод основан на поиске в файлах уникальной последовательности байтов, то есть сигнатуры, которая характерна для данного конкретного вредоносного ПО. Для каждого нового образца ВПО специалисты антивирусной лаборатории выполняют анализ кода, на основании которого определяют сигнатуру этого нового ВПО. Полученную сигнатуру помещают в базу данных вирусных сигнатур, с которой работает антивирусная программа, и это впоследствии помогает антивирусу обнаруживать данное ВПО.A well-known malware detection method is signature analysis. This method is based on searching files for a unique byte sequence, i.e. a signature that is specific to this particular malware. For each new sample of malware, anti-virus laboratory specialists perform code analysis, on the basis of which they determine the signature of this new malware. The resulting signature is placed in the virus signature database that the antivirus program works with, and this subsequently helps the antivirus detect this malware.

Данный метод хорошо известен и киберпреступникам. Поэтому почти все современные вредоносные программы так или иначе постоянно модифицируются, притом целью подобной модификации не является изменение основной функциональности ВПО. В результате модификации файлы очередной версии ВПО должны приобрести, с точки зрения антивирусных сигнатурных анализаторов, новые свойства. Тогда они перестанут "опознаваться" как вредоносные при помощи существующих баз сигнатур, что повысит скрытность действий киберпреступников.This method is well known to cybercriminals as well. Therefore, almost all modern malicious programs are constantly modified in one way or another, and the purpose of such modification is not to change the main functionality of malware. As a result of the modification, the files of the next version of malware should acquire new properties from the point of view of anti-virus signature analyzers. Then they will no longer be "identified" as malicious using existing signature databases, which will increase the secrecy of cybercriminals' actions.

Помимо модификации также широко используется прием, называемый обфускацией. Он представляет собой приведение исходного текста или исполняемого кода программы к виду, сохраняющему ее функциональность, но затрудняющему анализ и понимание алгоритмов работы. Перечисленные изменения ВПО могут осуществляться как человеком, так и автоматически, например, так называемым полиморфным генератором, входящим в состав вредоносной программы.In addition to modification, a technique called obfuscation is also widely used. It is a reduction of the source text or executable code of the program to a form that preserves its functionality, but makes it difficult to analyze and understand the operation algorithms. The listed changes in malware can be carried out both by a human and automatically, for example, by the so-called polymorphic generator, which is part of a malicious program.

При этом важно, что основные функции ВПО, как правило, не подвергаются существенным изменениям. После модификации вредоносная программа будет "выглядеть" иначе для сигнатурных анализаторов, ее код может быть обфусцирован и непригоден для анализа человеком, но набор функций, который эта программа исполняла до модификации, с большой вероятностью после модификации останется прежним.At the same time, it is important that the main functions of HPE, as a rule, do not undergo significant changes. After modification, the malicious program will "look" differently for signature analyzers, its code may be obfuscated and unsuitable for human analysis, but the set of functions that this program performed before modification will most likely remain the same after modification.

Из уровня техники известен патент RU2728497A1 (Слипенчук П.В., Померанцев И.С., СПОСОБ И СИСТЕМА ОПРЕДЕЛЕНИЯ ПРИНАДЛЕЖНОСТИ ПРОГРАММНОГО ОБЕСПЕЧЕНИЯ ПО ЕГО МАШИННОМУ КОДУ, опубл. 29.07.2020. кл. G06F 8/74). В нем описан способ определения принадлежности ПО к определенному семейству программ по его машинному коду, в котором получают файл, содержащий машинный код ПО; определяют формат полученного файла; извлекают и сохраняют код функций, присутствующих в полученном файле; удаляют из сохраненного кода функции, которые являются библиотечными; выделяют в каждой функции команды; выделяют в каждой команде пару «действие, аргумент»; преобразуют каждую пару «действие, аргумент» в число; сохраняют, отдельно для каждой выделенной функции, полученные упорядоченные последовательности чисел; накапливают заранее заданное количество результатов анализа машинного кода и выявляют в них повторяющиеся последовательности чисел (паттерны); для каждого выявленного паттерна вычисляют параметр, характеризующий его частотность; на основе вычисленного набора параметров обучают классификатор определять принадлежность ПО по последовательности пар «действие, аргумент»; применяют обученный классификатор для последующего определения принадлежности ПО к определенному семейству программ.Patent RU2728497A1 is known from the prior art (Slipenchuk P.V., Pomerantsev I.S., METHOD AND SYSTEM FOR DETERMINING SOFTWARE OWNERSHIP BY ITS MACHINE CODE, publ. 29.07.2020. class G06F 8/74). It describes a method for determining whether software belongs to a particular family of programs by its machine code, in which a file containing the machine code of the software is obtained; determine the format of the received file; extract and save the code of the functions present in the received file; remove from the saved code functions that are library; allocate commands in each function; allocate a pair of "action, argument" in each command; convert each "action, argument" pair to a number; save, separately for each selected function, the resulting ordered sequences of numbers; accumulating a predetermined number of machine code analysis results and detecting repeating sequences of numbers (patterns) in them; for each detected pattern, a parameter characterizing its frequency is calculated; based on the calculated set of parameters, the classifier is trained to determine the ownership of the software by the sequence of pairs "action, argument"; a trained classifier is used to subsequently determine whether the software belongs to a particular family of programs.

Как легко видеть, для работы описанного способа необходим значительный массив файлов, содержащих машинный код, причем в этом массиве должны находиться представители всех семейств ПО, определению принадлежности к которым предполагается обучить классификатор. Соответственно, образцы ПО, которые относятся к семействам, не представленным в массиве файлов к моменту начала его использования, классификатор идентифицировать не сможет. А для того, чтобы обучить его этому, понадобится добавлять в массив несколько представителей нового семейства и повторять описанные процедуры заново, с нуля. В то же время формирование вручную такого массива файлов, содержащего достаточный для обучения классификатора ассортимент ПО, является чрезвычайно трудоемкой задачей.As it is easy to see, for the operation of the described method, a significant array of files containing machine code is required, and this array must contain representatives of all software families, the definition of belonging to which is supposed to train the classifier. Accordingly, the classifier will not be able to identify software samples that belong to families that are not represented in the array of files by the time it is used. And in order to teach him this, you will need to add several representatives of the new family to the array and repeat the described procedures again, from scratch. At the same time, the manual formation of such an array of files containing an assortment of software sufficient for training the classifier is an extremely time-consuming task.

По существу, именно проблема быстрого, автоматического формирования подобных массивов файлов, кластеризованных по принадлежности к определенным семействам, и сдерживает сегодня активное развитие решений, подобных описанному в RU2728497A1.In essence, it is the problem of fast, automatic formation of such arrays of files, clustered by belonging to certain families, that currently hinders the active development of solutions similar to that described in RU2728497A1.

В то же время, из уровня техники известно решение US20150178306A1 (YI YANG et al., Method and apparatus for clustering portable executable files, опубл. 06.25.2015. кл. G06F 7/30), в котором описан способ кластеризации «переносимых исполняемых» (Portable Executable, РЕ) файлов. Способ подразумевает извлечение из РЕ-файла характеристик РЕ-файла; генерацию для РЕ-файла идентификатора РЕ-файла на основе характеристик РЕ-файла; и кластеризацию базы РЕ-файлов по идентификаторам РЕ-файла.At the same time, the prior art solution US20150178306A1 (YI YANG et al., Method and apparatus for clustering portable executable files, publ. (Portable Executable, PE) files. The method involves extracting the characteristics of the PE file from the PE file; generating a PE file identifier for the PE file based on the characteristics of the PE file; and clustering the base of PE files by PE file IDs.

Прежде всего, следует отметить, что данный способ исходно ориентирован на анализ исполняемых файлов формата РЕ, используемых в основном в операционной системе Microsoft Windows. Соответственно, исполняемые файлы других форматов, не имеющие характеристик РЕ-файла, анализу описанным способом не подлежат. Тогда как вредоносное ПО весьма разнообразно; оно существует для множества операционных систем, таких как *nix-системы, Android, операционные системы Apple и т.д.First of all, it should be noted that this method is initially focused on the analysis of PE format executable files used mainly in the Microsoft Windows operating system. Accordingly, executable files of other formats that do not have the characteristics of a PE file are not subject to analysis by the described method. Whereas malware is quite diverse; it exists for many operating systems such as *nix systems, Android, Apple operating systems, etc.

Из уровня техники также известно решение ЕР 2743854 В1 (TAO YU, Clustering processing method and device for virus files, опубл. 25.03.2015. кл. G06F 21/56), раскрывающее способ кластеризации переносимых исполняемых (РЕ) файлов, в том числе вирусных файлов, включающий: статический анализ двоичных данных вирусных файлов, которые необходимо классифицировать, для получения переносимых данных исполняемой структуры вирусных файлов; сравнение данных переносимой исполняемой структуры вирусных файлов, которые необходимо классифицировать, и классификация вирусных файлов, которые имеют данные РЕ-структуры, удовлетворяющие заданному условию сходства, в одну и ту же категорию; и выполнение вторичной кластеризации вирусных файлов в каждой из категорий, классифицированных на предыдущем этапе.Also known from the prior art is the solution EP 2743854 B1 (TAO YU, Clustering processing method and device for virus files, publ. files, including: static analysis of binary data of virus files to be classified in order to obtain portable data of the executable structure of virus files; comparing the portable executable structure data of the virus files to be classified and classifying the virus files that have the PE structure data satisfying the predetermined similarity condition into the same category; and performing secondary clustering of virus files in each of the categories classified in the previous step.

Несложно видеть, что данное решение обладает теми же недостатками, что и предыдущее, то есть не предназначено для работы с исполняемыми файлами, не имеющими формата РЕ.It is easy to see that this solution has the same drawbacks as the previous one, that is, it is not designed to work with executable files that do not have the PE format.

Из уровня техники также известно решение ЕР 2946331 В1 (SOURABH SATISH et al, Classifying samples using clustering, опубл. 17.08.2016. кл. G06F 21/56). В данном решении раскрыт способ классификации файла, включающий создание набора образцов, связанных с файлом, содержащим маркированные и немаркированные образцы; сбор значений признаков из маркированных и немаркированных образцов; выбор подмножества функций; кластеризацию вместе маркированных и немаркированных выборок, имеющих, по меньшей мере, пороговую меру сходства среди собранных значений выбранного подмножества признаков для создания набора кластеров, причем каждый кластер имеет подмножество выборок из набора выборок; рекурсивно повторяя этапы выбора и кластеризации на подмножестве выборок в каждом кластере в наборе кластеров, пока не будет достигнуто по крайней мере одно условие остановки, при этом существует множество итераций этапов выбора и кластеризации, первая итерация кластеризации использует первую пороговая мера сходства и последующие итерации кластеризации используют все более строгие пороги подобия, итерации создают кластер, имеющий по крайней мере одну маркированную выборку и по крайней мере одну немаркированную выборку, а на этапе выбора выбираются разные подмножества признаков для разных итераций множества итераций к определению величины отклонения в значениях признаков образцов в кластере из набора кластеров, созданных итерацией кластеризации; и выбор подмножества признаков для последующей итерации кластеризации в ответ на определенную величину отклонения в значениях признаков выборок в кластере из набора кластеров, созданных итерацией кластеризации; и после достижения хотя бы одного условия остановки, распространение метки от по меньшей мере одного маркированному образца в кластере к по меньшей мере одному немаркированному образцу в кластере для классификации файла, связанного по меньшей мере с одним немаркированным образцом.Also known from the prior art is the solution EP 2946331 B1 (SOURABH SATISH et al, Classifying samples using clustering, publ. 17.08.2016. class G06F 21/56). This solution discloses a method for classifying a file, including creating a set of samples associated with a file containing marked and unmarked samples; collection of feature values from labeled and unlabeled samples; selection of a subset of functions; clustering together tagged and untagged samples having at least a threshold measure of similarity among the collected values of the selected subset of features to create a set of clusters, each cluster having a subset of samples from the set of samples; recursively repeating the selection and clustering steps on a subset of samples in each cluster in the cluster set until at least one stopping condition is reached, given that there are many iterations of the selection and clustering steps, the first clustering iteration uses the first similarity threshold measure and subsequent clustering iterations use increasingly stringent similarity thresholds, iterations create a cluster that has at least one labeled sample and at least one unlabeled sample, and in the selection step, different subsets of features are selected for different iterations of the set of iterations to determine the amount of deviation in the values of the features of the samples in the cluster from the set clusters created by clustering iteration; and selecting a feature subset for a subsequent clustering iteration in response to a determined amount of variance in feature values of the samples in the cluster from the set of clusters generated by the clustering iteration; and upon reaching at least one stop condition, propagating the label from the at least one tagged instance in the cluster to the at least one untagged instance in the cluster to classify the file associated with the at least one untagged instance.

Применительно к данному решению прежде всего нужно заметить, что оно требует наличия маркированных файлов, то есть таких, которые заранее помечены как вредоносные, соответственно, для работы с выборкой произвольного состава оно не подходит. Далее, решение подразумевает изучение данных, извлеченных из внутренней структуры файла, следовательно, непригодно для кластеризации стойко зашифрованных или обфусцированных исполняемых файлов.In relation to this solution, first of all, it should be noted that it requires the presence of marked files, that is, those that are previously marked as malicious, respectively, it is not suitable for working with a sample of arbitrary composition. Further, the solution involves examining data extracted from the internal structure of the file, and is therefore unsuitable for clustering strongly encrypted or obfuscated executables.

Настоящее решение создано для решения проблем, выявленных при анализе уровня техники и создании улучшенного комплексного метода кластеризации исполняемых файлов.The present solution is designed to solve the problems identified in the analysis of the prior art and the creation of an improved complex method for clustering executable files.

СУЩНОСТЬ ИЗОБРЕТЕНИЯSUMMARY OF THE INVENTION

В рамках настоящего описания, если это не оговорено непосредственно по месту применения, нижеперечисленные специальные термины используются в следующих значениях:Within the scope of this description, unless otherwise specified at the place of application, the following specific terms are used with the following meanings:

Библиотечные функции - функции, реализующие наиболее распространенные, типовые действия. Библиотечные функции широко используются самыми разными программами, поэтому их наличие в коде или в рабочих процессах программы не специфично ни для какой-то одной программы, ни для какого-то определенного разработчика.Library functions - functions that implement the most common, typical actions. Library functions are widely used by a wide variety of programs, so their presence in code or in program workflows is not specific to any one program or to any particular developer.

Длина секвенции - это количество команд, составляющих данную секвенцию. Важно, что эта величина не связана со способом записи команд (в машинном коде и в исходном коде программы применяются разные способы записи), и не зависит от того, сколько байт необходимо для записи команд, составляющих секвенцию. Иными словами, выражение «секвенция длиной 40» означает, что данная секвенция состоит из 40 команд. О том, сколько бит или байт нужно для ее записи, это выражение ничего не сообщает.The sequence length is the number of instructions that make up the given sequence. It is important that this value is not related to the way the commands are written (in machine code and in the source code of the program, different ways of writing are used), and does not depend on how many bytes are needed to write the commands that make up the sequence. In other words, the expression "sequence of length 40" means that this sequence consists of 40 commands. About how many bits or bytes are needed to write it, this expression does not say anything.

Кластеризация файлов - разделение неупорядоченного множества файлов на кластеры (подмножества), где каждый кластер или подмножество соответствует определенному семейству ПО. Задача идентификации семейства, то есть определение функционального назначения входящих в семейство программ, в данном случае не ставится. Основной задачей кластеризации является отнесение каждого файла, входящего в исходное неупорядоченное множество, к какому-либо из семейств, причем количество семейств, необходимое для кластеризации этого множества, перед началом кластеризации неизвестно.File clustering is the division of an unordered set of files into clusters (subsets), where each cluster or subset corresponds to a specific software family. The task of identifying the family, that is, determining the functional purpose of the programs included in the family, is not set in this case. The main task of clustering is to assign each file included in the initial unordered set to one of the families, and the number of families required for clustering this set is unknown before clustering.

Обфускация или запутывание кода - это приведение кода программы к виду, сохраняющему функциональность программы, но затрудняющему его статический анализ, понимание алгоритмов работы.Obfuscation or obfuscation of the code is bringing the program code to a form that preserves the functionality of the program, but makes it difficult to analyze it statically and understand the operation algorithms.

Семейство ПО - множество программ, объединенных одним функциональным назначением (например, шифровальщики, загрузчики и т.д.) и\или общим коллективом разработчиков, а также базовым алгоритмом исполнения. Программы, составляющие одно семейство, отличаются друг от друга разного рода модификациями, в результате чего их общеизвестные характеристики, такие как контрольная сумма, объем файла, имя файла, и т.д., различаются. Однако, все программы, входящие в одно семейство, предназначены для одной цели и все они для достижения этой цели совершают одни и те же основные шаги алгоритма. В рамках данного описания предполагается, что программы, имеющие идентичное функциональное назначение, но различные форматы, относятся к различным семействам, например, «шифровальщики формата РЕ» и «шифровальщики формата Mach-О» это не одно, а два различных семейства.A software family is a set of programs united by one functional purpose (for example, cryptographers, loaders, etc.) and/or a common team of developers, as well as a basic execution algorithm. Programs that make up the same family differ from each other by various modifications, as a result of which their well-known characteristics, such as checksum, file size, file name, etc., differ. However, all programs included in the same family are designed for the same goal, and all of them perform the same basic steps of the algorithm to achieve this goal. For the purposes of this description, it is assumed that programs having identical functionality but different formats belong to different families, for example, "PE ciphers" and "Mach-O ciphers" are not one, but two different families.

Секвенция - последовательность, «цепочка» команд, следующих одна за другой в коде исследуемого файла. При этом каждая команда сама по себе может иметь достаточно сложную структуру, состоять, скажем, из действия и одного или нескольких аргументов. Команды могут иметь различную длину, обычно выражаемую количеством байт, необходимых для их записи. Как уже упоминалось, длина секвенции не зависит от длины составляющих ее команд, и выражается не в битах или байтах, а в единицах. Говоря, например, о секвенции длиной 50, подразумевают, что эта секвенция представляет собой последовательность 50 следующих друг за другом команд.A sequence is a sequence, a "chain" of commands following one after another in the code of the file under study. Moreover, each command itself can have a rather complex structure, consisting, say, of an action and one or more arguments. Commands can be of various lengths, usually in terms of the number of bytes required to write them. As already mentioned, the length of a sequence does not depend on the length of its constituent commands, and is expressed not in bits or bytes, but in units. Speaking, for example, of a sequence of length 50, it is implied that this sequence is a sequence of 50 consecutive commands.

Сэмпл (от англ. sample, «образец, экземпляр») - не функциональная, используемая только для исследования часть какого-либо файла, например, файла, содержащего код программы. Из исходного файла программы, готового к запуску и исполнению, сэмпл получают путем удаления из этого файла различных фрагментов кода, не представляющих интереса в рамках исследования. Например, сэмпл может быть получен из исходного файла путем удаления из него библиотечных функций и устойчивых конструкций.A sample (from the English sample, “sample, instance”) is a non-functional part of a file used only for research, for example, a file containing program code. From the source file of the program, ready for launch and execution, the sample is obtained by deleting various code fragments from this file that are not of interest in the framework of the study. For example, a sample can be obtained from a source file by removing library functions and stable constructs from it.

Устойчивые конструкции - это фрагменты кода, которые можно найти в самых разных программах. Такие конструкции встречаются не только в ПО какого-то определенного назначения или определенного автора, а практически повсеместно. В качестве примера устойчивых конструкций могут быть названы, например, прологи функций.Durable constructs are pieces of code that can be found in a wide variety of programs. Such constructions are found not only in software of a certain purpose or a certain author, but almost everywhere. As an example of stable constructions, for example, prologues of functions can be named.

Функция - фрагмент кода, к которому можно обратиться из другого места программы. В большинстве случаев с функцией связывается идентификатор (имя). С именем функции неразрывно связан адрес первой инструкции (оператора), входящей в функцию, которой передается управление при обращении к функции. После выполнения функции управление возвращается обратно в адрес возврата, то есть ту точку программы, где данная функция была вызвана.A function is a piece of code that can be accessed from somewhere else in the program. In most cases, an identifier (name) is associated with a function. The function name is inextricably linked with the address of the first instruction (operator) included in the function, to which control is transferred when the function is called. After the function has been executed, control returns to the return address, that is, the point in the program where the function was called.

Технической проблемой, на решение которой направлено заявленное техническое решение, является создание компьютерно-реализуемого способа и системы кластеризации исполняемых файлов, которые охарактеризованы в независимых пунктах формулы. Дополнительные варианты реализации настоящего решения представлены в зависимых пунктах формулы изобретения.The technical problem to be solved by the claimed technical solution is the creation of a computer-implemented method and system for clustering executable files, which are described in independent claims. Additional embodiments of the present solution are presented in dependent claims.

Технический результат заключается в обеспечении автоматической кластеризации исполняемых файлов.The technical result consists in providing automatic clustering of executable files.

В предпочтительном варианте реализации заявлен компьютерно-реализуемый способ кластеризации исполняемых файлов, заключающийся в выполнении шагов, на которых с помощью вычислительного устройства:In a preferred implementation, a computer-implemented method for clustering executable files is claimed, which consists in performing steps in which, using a computing device:

получают множество исполняемых файлов,get a lot of executable files,

определяют формат каждого исполняемого файла,define the format of each executable file,

по отдельности для каждого формата файлов:separately for each file format:

находят в файлах повторяющиеся секвенции заданной длины,find repeating sequences of a given length in files,

определяют наиболее частотные секвенции,determine the most frequent sequences,

файлы, которые содержат по меньшей мере одну наиболее частотную секвенцию, относят к одному семейству,files that contain at least one most frequent sequence belong to the same family,

исключают все файлы, относящиеся к данному семейству, из дальнейшей обработки,exclude all files belonging to this family from further processing,

повторяют поиск наиболее частотных секвенций, отнесение файлов, содержащих по меньшей мере одну наиболее частотную секвенцию, к очередному семейству и исключение данных файлов из дальнейшей обработки до тех пор, пока все файлы не будут отнесены к какому-либо семейству, либо пока не останутся файлы, не содержащие повторяющихся секвенций,repeat the search for the most frequent sequences, assigning files containing at least one most frequent sequence to the next family and excluding these files from further processing until all files are assigned to any family, or until there are files left, not containing repeating sequences,

в ответ на то, что остались файлы, не содержащие повторяющихся секвенций, каждый из этих файлов относят к отдельному семейству.in response to the fact that there are files that do not contain repeated sequences, each of these files is assigned to a separate family.

В частном варианте реализации способа длина секвенции выбирается в зависимости от количества файлов, в которых была найдена данная секвенция.In a particular embodiment of the method, the sequence length is selected depending on the number of files in which the given sequence was found.

В другом частном варианте реализации способа длина секвенции является фиксированной, заранее заданной величиной.In another particular embodiment of the method, the sequence length is a fixed, predetermined value.

Еще в одном частном варианте реализации способа рассматривают только секвенции, энтропия которых превышает заранее заданный порог.In another particular embodiment of the method, only sequences are considered whose entropy exceeds a predetermined threshold.

В частном варианте реализации способа наиболее частотные секвенции находят посредством поиска по хеш-таблице.In a particular embodiment of the method, the most frequent sequences are found by means of a hash table search.

В другом частном варианте реализации способа решение о принадлежности каждого файла к данному одному семейству принимается по результатам вычисления для этого файла значения взвешивающей функции, представляющей собой сумму коэффициентов по меньшей мере двух наиболее частотных секвенций, найденных в данном файле.In another particular embodiment of the method, the decision on whether each file belongs to a given one family is made based on the results of calculating for this file the value of the weighting function, which is the sum of the coefficients of at least two of the most frequent sequences found in the given file.

Еще в одном частном варианте реализации способа каждый коэффициент взвешивающей функции вычисляют как отношение количества файлов, в которых хотя бы один раз присутствует данная секвенция, к исходному количеству файлов.In another particular embodiment of the method, each coefficient of the weighting function is calculated as the ratio of the number of files in which this sequence is present at least once to the initial number of files.

В другом частном варианте реализации способа в ответ на то, что остались файлы, не содержащие повторяющихся секвенций, эти файлы не относят ни к одном семейству.In another particular implementation of the method, in response to the fact that there are files that do not contain repeated sequences, these files do not belong to any family.

В другом предпочтительном варианте реализации заявлен компьютерно-реализуемый способ кластеризации исполняемых файлов, заключающийся в выполнении шагов, на которых с помощью вычислительного устройства:In another preferred implementation, a computer-implemented method for clustering executable files is claimed, which consists in performing the steps in which, using a computing device:

запускают каждый исполняемый файл в виртуальной среде, соответствующей его формату,run each executable file in a virtual environment corresponding to its format,

из каждого исполняемого файла создают сэмпл, в котором из кода исключены библиотечные функции и устойчивые конструкции, и сохраняют полученный сэмпл,a sample is created from each executable file, in which library functions and stable constructions are excluded from the code, and the resulting sample is saved,

после получения сэмплов, соответствующих всем полученным файлам:after receiving samples corresponding to all received files:

находят в сэмплах повторяющиеся секвенции заданной длины,find repeating sequences of a given length in samples,

файлы, соответствующие сэмплам, которые содержат по меньшей мере одну наиболее частотную секвенцию, относят к одному семейству,files corresponding to samples that contain at least one most frequent sequence belong to the same family,

исключают файлы и сэмплы, соответствующие данному семейству, из дальнейшей обработки.exclude files and samples corresponding to this family from further processing.

повторяют поиск наиболее частотных секвенций, отнесение файлов, соответствующих сэмплам, содержащим по меньшей мере одну наиболее частотную секвенцию, к очередному семейству и исключение данных файлов и сэмплов из дальнейшей обработки до тех пор, пока все файлы не будут отнесены к какому-либо семейству, либо пока не останутся сэмплы, не содержащие повторяющихся секвенций,repeat the search for the most frequent sequences, assigning files corresponding to samples containing at least one most frequent sequence to the next family and excluding these files and samples from further processing until all files are assigned to any family, or until there are samples left that do not contain repeating sequences,

в ответ на то, что остались сэмплы, не содержащие повторяющихся секвенций, каждый из файлов, соответствующих этим сэмплам, относят к отдельному семейству.in response to the fact that there are samples that do not contain repeated sequences, each of the files corresponding to these samples is assigned to a separate family.

В частном варианте реализации способа длина секвенции выбирается в зависимости от количества сэмплов, в которых была найдена данная секвенция.In a particular embodiment of the method, the length of the sequence is selected depending on the number of samples in which this sequence was found.

В другом возможном частном варианте реализации способа рассматривают только секвенции, энтропия которых превышает заранее заданный порог.In another possible particular embodiment of the method, only sequences are considered whose entropy exceeds a predetermined threshold.

Еще в одном возможном частном варианте реализации способа наиболее частотные секвенции находят посредством поиска по хеш-таблице.In another possible private embodiment of the method, the most frequent sequences are found by searching the hash table.

Возможен также частный вариант реализации способа, при котором решение о принадлежности каждого файла к данному семейству принимается по результатам вычисления для этого файла значения взвешивающей функции, представляющей собой сумму коэффициентов по меньшей мере двух наиболее частотных секвенций, найденных в сэмпле, соответствующем данному файлу.It is also possible to implement a particular variant of the method, in which the decision on whether each file belongs to a given family is made based on the results of calculating for this file the value of the weighting function, which is the sum of the coefficients of at least two of the most frequent sequences found in the sample corresponding to the given file.

В другом частном варианте реализации способа каждый коэффициент взвешивающей функции вычисляют как отношение количества сэмплов, в которых хотя бы один раз присутствует данная секвенция, к исходному количеству сэмплов.In another particular embodiment of the method, each coefficient of the weighting function is calculated as the ratio of the number of samples in which this sequence is present at least once to the initial number of samples.

В еще одном частном варианте реализации способа в ответ на то, что остались сэмплы, не содержащие повторяющихся секвенций, файлы, соответствующие этим сэмплам, не относят ни к одном семейству.In another particular embodiment of the method, in response to the fact that there are samples that do not contain repeating sequences, the files corresponding to these samples do not belong to any family.

Заявленное решение также осуществляется за счет системы кластеризации исполняемых файлов, содержащей:The claimed solution is also implemented through a system of clustering executable files, containing:

- долговременную память, выполненную с возможностью хранения используемых файлов и данных;- long-term memory configured to store used files and data;

- вычислительное устройство, выполненное с возможностью выполнения описанного способа.- a computing device configured to perform the described method.

ОПИСАНИЕ ЧЕРТЕЖЕЙDESCRIPTION OF THE DRAWINGS

Реализация изобретения будет описана в дальнейшем в соответствии с прилагаемыми чертежами, которые представлены для пояснения сути изобретения и никоим образом не ограничивают область изобретения. К заявке прилагаются следующие чертежи:The implementation of the invention will be described hereinafter in accordance with the accompanying drawings, which are presented to explain the essence of the invention and in no way limit the scope of the invention. The following drawings are attached to the application:

Фиг. 1А-Б иллюстрируют предпочтительный вариант реализации компьютерно-реализуемого способа кластеризации исполняемых файлов.Fig. 1A-B illustrate a preferred implementation of a computer-implemented method for clustering executable files.

Фиг. 2А-В иллюстрируют другой предпочтительный вариант реализации компьютерно-реализуемого способа кластеризации исполняемых файлов.Fig. 2A-B illustrate another preferred implementation of a computer-implemented method for clustering executable files.

Фиг. 3 иллюстрирует шаги изготовления вспомогательной программы, используемой в одном из предпочтительных вариантов реализации способа.Fig. 3 illustrates the manufacturing steps of the helper program used in one of the preferred embodiments of the method.

Фиг. 4 иллюстрирует пример общей схемы компьютерного устройства.Fig. 4 illustrates an example general layout of a computing device.

ДЕТАЛЬНОЕ ОПИСАНИЕ ИЗОБРЕТЕНИЯDETAILED DESCRIPTION OF THE INVENTION

В приведенном ниже подробном описании реализации изобретения приведены многочисленные детали реализации, призванные обеспечить отчетливое понимание настоящего изобретения. Однако, квалифицированному в предметной области специалисту, будет очевидно каким образом можно использовать настоящее изобретение, как с данными деталями реализации, так и без них. В других случаях хорошо известные методы, процедуры и компоненты не были описаны подробно, чтобы не затруднять понимание особенностей настоящего изобретения.In the following detailed description of the implementation of the invention, numerous implementation details are provided to provide a clear understanding of the present invention. However, one skilled in the art will appreciate how the present invention can be used, both with and without these implementation details. In other cases, well-known methods, procedures and components have not been described in detail so as not to obscure the features of the present invention.

Кроме того, из приведенного изложения будет ясно, что изобретение не ограничивается приведенной реализацией. Многочисленные возможные модификации, изменения, вариации и замены, сохраняющие суть и форму настоящего изобретения, будут очевидными для квалифицированных в предметной области специалистов.Furthermore, it will be clear from the foregoing that the invention is not limited to the present implementation. Numerous possible modifications, changes, variations and substitutions that retain the spirit and form of the present invention will be apparent to those skilled in the subject area.

Настоящее изобретение направлено на обеспечение реализации компьютерно-реализуемого способа и системы кластеризации исполняемых файлов.The present invention is directed to providing an implementation of a computer-implemented method and system for clustering executable files.

Как представлено на Фиг. 1А, заявленный компьютерно-реализуемый способ кластеризации исполняемых файлов (100) может быть реализован следующим образом:As shown in FIG. 1A, the claimed computer-implemented method for clustering executable files (100) can be implemented as follows:

Способ начинается на шаге (ПО), на котором получают множество исполняемых файлов, подлежащих анализу. Это могут быть уже откомпилированные файлы, наподобие ЕХЕ или DLL, содержащие программу или динамическую библиотеку, файлы электронных документов со встроенным языком сценариев (например, языком VBA), такие как файлы DOC или XLS, а также файлы, содержащие исходный код программ на любых интерпретируемых языках программирования, например, таких как JavaScript, Python и т.д. Во всех этих файлах или в некоторой их части также могут быть дополнительно зашифрованы фрагменты кода или данных, все эти файлы или их часть также могут быть обфусцированы, в том числе с использованием стойких к раскрытию алгоритмов.The method starts at a step (SW) in which a plurality of executable files to be analyzed are obtained. These may be pre-compiled files such as EXE or DLL containing a program or dynamic link library, electronic document files with an embedded scripting language (such as VBA), such as DOC or XLS files, as well as files containing program source code in any interpretable language. programming languages such as JavaScript, Python, etc. In all these files or in some part of them, fragments of code or data can also be additionally encrypted, all these files or part of them can also be obfuscated, including using algorithms that are resistant to disclosure.

Исполняемые файлы могут относиться к любому известному формату, не ограничиваясь только форматом РЕ; например, это также могут быть, без ограничений, исполняемые файлы форматов Mach-O, ELF, COM, COFF и других.Executable files can be in any known format, not limited to the PE format; for example, it can also be, without limitation, executable files of the Mach-O, ELF, COM, COFF and other formats.

Далее на шаге (120) определяют формат каждого исполняемого файла и выбирают соответствующее формату средство анализа кода. Определение формата файла выполняют любым известным способом, например, посредством скрипта, который поочередно сравнивает сигнатуру проверяемого файла с известными сигнатурами различных форматов файлов и при совпадении сигнатур выдает сообщение о фактическом формате файла (который в общем случае может отличаться от формата, ассоциированного в используемой операционной системе с данным расширением файла). Для файлов, содержащих исходный код на одном из интерпретируемых языков программирования, на этом же шаге определяют язык программирования на котором написан код, и выбирают соответствующее языку средство анализа кода. Определение языка программирования также может быть выполнено любым известным способом, например, посредством программы Linguist (https://github.corn/github/linguist) или посредством программы Ohcount (https://github.com/blackducksoftware/ohcount).Next, at step (120), the format of each executable file is determined and a code analysis tool corresponding to the format is selected. Determination of the file format is performed in any known way, for example, by means of a script that compares in turn the signature of the file being checked with the known signatures of various file formats and, if the signatures match, displays a message about the actual file format (which in general may differ from the format associated in the operating system used with the given file extension). For files containing source code in one of the interpreted programming languages, at the same step, the programming language in which the code is written is determined, and the code analysis tool corresponding to the language is selected. Programming language detection can also be done by any known method, such as the Linguist program (https://github.corn/github/linguist) or the Ohcount program (https://github.com/blackducksoftware/ohcount).

Затем способ переходит к шагу (130) на котором в файлах находят повторяющиеся секвенции заданной длины. Данный шаг далее будет подробно описан применительно к Фиг. 1Б.The method then proceeds to step (130) where repeated sequences of a given length are found in the files. This step will be described in detail with reference to FIG. 1B.

Шаг начинается с этапа выбора (131), на котором в зависимости от того, содержит ли данный файл исходный код или нет, выбирают последовательность дальнейших действий.The step begins with the selection step (131), where, depending on whether the given file contains source code or not, a sequence of further actions is selected.

В ответ на то, что файл содержит исходный код, далее способ переходит к этапу (133).In response to the fact that the file contains source code, the method then proceeds to step (133).

В ответ на то, что файл не содержит исходного кода (то есть файл содержит машинный или операционный код), способ переходит к этапу (132), на котором дизассемблируют содержащийся в файле код. Дизассемблирование выполняют специализированной программой-дизассемблером, которая преобразует машинный или операционный код в набор инструкций на языке ассемблер или языке IL. Программа-дизассемблер может быть любой известной, например, IDA Pro, Sourcer или другой. В результате работы данной программы машинный код оказывается размечен: в нем явным образом выделены границы функций.In response to the fact that the file does not contain source code (ie, the file contains machine or operational code), the method proceeds to step (132) where the code contained in the file is disassembled. Disassembly is performed by a specialized disassembler program that converts machine or operational code into a set of instructions in assembly language or IL language. The disassembler program can be any known one, for example, IDA Pro, Sourcer or another. As a result of the operation of this program, the machine code turns out to be marked up: the boundaries of functions are explicitly distinguished in it.

Затем способ переходит к этапу (133), на котором извлекают и сохраняют в виде списка функции. Это выполняют выбранным ранее на шаге (120) средством анализа кода, выбранным для данного формата файла. Это средство представляет собой программу-парсер, алгоритм действия которой в случае исходного кода основан на синтаксисе того языка программирования, на котором написан данный код. В случае машинного кода используют программу-парсер, алгоритм действия которой основан на спецификации используемой вычислительной архитектуры.The method then proceeds to step (133) where the functions are retrieved and stored as a list. This is done by the code analyzer selected earlier in step (120) selected for the given file format. This tool is a parser program, the algorithm of which, in the case of a source code, is based on the syntax of the programming language in which this code is written. In the case of machine code, a parser program is used, the algorithm of which is based on the specification of the computing architecture used.

Весь код, находящийся в выделенных дизассемблером границах функций, сохраняют, например, в виде отдельного файла, имеющего текстовый формат.Весь остальной (оставшийся за пределами этих границ) код в описываемом способе считается данными программы и не используется.All code located within the boundaries of functions selected by the disassembler is saved, for example, as a separate file having a text format. The rest (remaining outside these boundaries) code in the described method is considered program data and is not used.

Аналогично поступают при анализе исходного кода. Парсером находят в анализируемом исходном коде границы, фрагменты кода внутри которых являются функциями. Для этого может использоваться, например, тот общеизвестный факт, что в большинстве языков программирования высокого уровня тело функции заключается в блочный оператор. Соответственно, одним из возможных алгоритмов работы парсера может быть выявление в коде пар слов или символов, которыми в синтаксисе данного языка обозначается блочный оператор, дополнительная проверка того факта, что найденный блочный оператор содержит именно тело функции, а также перемещение кода, составляющего тела найденной функции, в файл, где сохраняют результаты обработки. Проверка того факта, что блочный оператор содержит именно тело функции может выполняться, например, путем поиска в строке, предшествующей открытию блочного оператора, символьного выражения, соответствующего в синтаксисе данного языка программирования имени функции.The same is done when analyzing the source code. The parser finds boundaries in the analyzed source code, code fragments inside of which are functions. For this, one can use, for example, the well-known fact that in most high-level programming languages, the body of a function is enclosed in a block statement. Accordingly, one of the possible algorithms for the parser can be to identify pairs of words or symbols in the code, which in the syntax of this language denotes a block operator, additional verification of the fact that the found block operator contains exactly the body of the function, as well as moving the code that makes up the body of the found function , to a file where the processing results are saved. Checking the fact that the block statement contains exactly the body of the function can be performed, for example, by searching in the line preceding the opening of the block statement, a symbolic expression corresponding to the function name in the syntax of this programming language.

Альтернативно, если исходный код написан на С-подобном языке программирования, например, на JavaScript, то каждая функция будет начинаться с заголовка функции, имеющего известный формат. Например, в коде следующего примера, написанном на языке JavaScript, заголовок каждой из двух функций может быть идентифицирован по наличию ключевого слова function. За заголовком в таком коде будет следовать тело функции, также заключенное в блочный оператор, в данном случае обозначаемый парой фигурных скобок, {и}:Alternatively, if the source code is written in a C-like programming language, such as JavaScript, then each function will begin with a function header in a known format. For example, in the JavaScript code for the following example, the header of each of the two functions can be identified by the presence of the function keyword. The header in such code will be followed by the body of the function, also enclosed in a block statement, in this case denoted by a pair of curly braces, { and }:

Действия над исходным кодом, оставшимся за пределами блочных операторов (то есть границ функций), выполняют в зависимости от того, позволяет ли синтаксис данного языка программирования исполнение кода за пределами функций. Например, для исходного кода, написанного на языке С или С#, такой код игнорируют. В рассматриваемом примере, для кода, написанного на языке JavaScript, такой код считают относящимся еще к одной функции, отдельной от всех уже выделенных. Код этой функции также переносят в список функций и в дальнейшем обрабатывают аналогично всему остальному коду, помещенному в этот список. Таким образом на этапе (133) получают и сохраняют в виде отдельного файла код всех функций, имеющихся в анализируемом файле.Actions on source code left outside block statements (that is, function boundaries) are performed depending on whether the syntax of a given programming language allows execution of code outside functions. For example, for source code written in C or C#, such code is ignored. In this example, for code written in JavaScript, such code is considered to be related to one more function, separate from all those already selected. The code of this function is also transferred to the list of functions and further processed in the same way as all other code placed in this list. Thus, at step (133), the code of all functions available in the analyzed file is obtained and stored as a separate file.

Затем способ переходит к этапу (134), на котором анализируют файл, содержащий найденные функции, и выделяют внутри функций команды. Для выделения команд используют результаты выполненного ранее дезассемблирования. Например, при дезассемблировании следующего фрагмента машинного кодаThe method then proceeds to step (134) where the file containing the found functions is parsed and commands are extracted within the functions. To extract commands, the results of the previously performed disassembly are used. For example, when disassembling the following piece of machine code

программой IDA Pro будет выделена следующая команда:IDA Pro will highlight the following command:

имеющая в машинном коде представлениеhaving a machine code representation

Указанный анализ может выполняться любым известным способом, позволяющим получить данный результат, например, скриптом, специально написанном для этой цели. Алгоритм такого скрипта должен разбирать анализируемый фрагмент машинного кода в соответствии со спецификацией используемой архитектуры; в данном примере это архитектура х86.Said analysis can be performed in any known way that allows to obtain this result, for example, by a script specially written for this purpose. The algorithm of such a script must parse the analyzed fragment of machine code in accordance with the specification of the architecture used; in this example, this is the x86 architecture.

Выявленные в машинном коде фрагменты, соответствующие отдельным командам, сохраняют, например, каждую на отдельной строке текстового файла. В одной из возможных реализаций описываемого способа аргументы, такие как аргумент 10000h, считают частью команды и сохраняют. В другой возможной реализации аргументы игнорируют, а сохраняют только команды, такие как команда mov dword_423F3C.The fragments identified in the machine code corresponding to individual commands are stored, for example, each on a separate line of a text file. In one possible implementation of the method described, arguments such as the 10000h argument are considered part of the command and stored. In another possible implementation, arguments are ignored and only commands are stored, such as the mov dword_423F3C command.

Что именно считают отдельной командой при анализе исходного кода, зависит от синтаксиса языка программирования, на котором написан исследуемый код. Так, в одном из возможных вариантов реализации началом команды может считаться начало строки, а окончанием - ближайший к началу строки символ, означающий в данном языке программирования окончание команды, как, например, символ «точка с запятой» (;) в языке С.What exactly is considered a separate command when analyzing the source code depends on the syntax of the programming language in which the code under study is written. So, in one of the possible implementation options, the beginning of the line can be considered the beginning of the command, and the end is the character closest to the beginning of the line, which means the end of the command in this programming language, such as the semicolon character (;) in the C language.

В альтернативном варианте реализации способа командой могут считать последовательность символов, расположенную на отдельной строке. Например, в рамках данного подхода в составе функции disabieSecurity из приведенного выше фрагмента исходного кода (1), написанного на языке JavaScript, будут выделены шесть команд, по числу строк:In an alternative implementation of the method, a command can be considered a sequence of characters located on a separate line. For example, within the framework of this approach, as part of the disabieSecurity function from the above source code fragment (1), written in JavaScript, six commands will be selected, according to the number of lines:

Как и при анализе машинного кода, фрагменты, соответствующие отдельным командам, сохраняют, например, каждую команду на отдельной строке текстового файла. После этого способ переходит к этапу (135), на котором из выделенных команд формируют с заданным шагом секвенции заданной длины.As in the analysis of machine code, the fragments corresponding to individual commands are stored, for example, each command on a separate line of a text file. After that, the method proceeds to step (135), in which sequences of a given length are formed from the selected commands with a given step.

Секвенцией в рамках данного описания будем называть последовательность, «цепочку» команд, следующих одна за другой в коде исследуемого файла. Понятно, что в зависимости от того, исходный код анализируют или машинный, команды могут иметь достаточно сложную структуру и различную длину, содержать различные аргументы, модификаторы и так далее. Все эти обстоятельства при формированиии секвекций не учитывают.Within the framework of this description, we will call a sequence a sequence, a “chain” of commands following one after another in the code of the file under study. It is clear that depending on whether the source code is analyzed or machine code, commands can have a rather complex structure and different lengths, contain various arguments, modifiers, and so on. All these circumstances are not taken into account in the formation of sevections.

Так, в одном из возможных вариантов реализации описываемого способа найденные команды могут быть для удобства предварительно преобразованы в числа. Это может быть выполнено, например, путем взятия хэш-функции от полного текста каждой команды, или любым другим общеизвестным способом преобразования строки в число. Понятно, что одинаковые команды при этом будут преобразованы в одинаковые числа. Кроме того, для данной цели может быть выбран такой алгоритм преобразования, при котором команды, текст которых различается незначительно, окажутся преобразованы в близкие по абсолютной величине числа, например, алгоритм нечеткого хэширования SSDeep.So, in one of the possible implementations of the described method, the found commands can be preliminarily converted into numbers for convenience. This can be done, for example, by taking a hash function of the full text of each command, or by any other well-known way of converting a string to a number. It is clear that the same commands will be converted to the same numbers. In addition, for this purpose, such a conversion algorithm can be chosen, in which commands, the text of which differs slightly, will be converted into numbers close in absolute value, for example, the fuzzy hashing algorithm SSDeep.

Для простоты примера, поясняющего формирование секвенций, условно обозначим отдельные команды буквами алфавита, притом будем считать, что каждая команда обозначена одной буквой, а одинаковые команды обозначены одинаковыми буквами. Конечно, в действительности количество команд, которые могут быть извлечены из одного исполняемого файла, значительно превышает количество букв любого алфавита. Итак, в данном примере в результате выполнения вышеописанного этапа (134) для какого-то исполняемого файла получили следующий набор команд:For the sake of simplicity of the example explaining the formation of sequences, we will conditionally denote individual commands by letters of the alphabet, moreover, we will assume that each command is denoted by one letter, and the same commands are denoted by the same letters. Of course, in reality, the number of commands that can be extracted from a single executable file greatly exceeds the number of letters in any alphabet. So, in this example, as a result of the execution of the above stage (134), for some executable file, we received the following set of commands:

Допустим, в соответствии с предварительно заданными условиями обработки, из этого набора команд должны быть с шагом 1 сформированы секвенции длиной 5.Suppose, in accordance with the predefined processing conditions, sequences of length 5 should be formed from this set of commands with a step of 1.

Важно подчеркнуть, что длина секвенции не зависит от длины или иных параметров составляющих ее команд; длина секвенции выражается не в битах или байтах, а в единицах. Говоря, например, о секвенции длиной 5, подразумевают, что эта секвенция представляет собой последовательность из пяти следующих друг за другом команд.It is important to emphasize that the length of a sequence does not depend on the length or other parameters of its constituent commands; sequence length is expressed not in bits or bytes, but in units. Speaking, for example, of a sequence of length 5, it is implied that this sequence is a sequence of five consecutive commands.

Параметр, названный шагом, в данном случае определяет, на сколько команд начало каждой следующей секвенции будет отстоять от начала предыдущей секвенции. В рамках примера (б), при шаге, равном 1, первая секвенция будет начинаться с команды А, а вторая с команды В. При шаге, равном 2, вторая секвенция будет начинаться с команды С, при шаге, равном 3, вторая секвенция будет начинаться с команды D, и так далее.The parameter, called the step, in this case determines how many commands the beginning of each next sequence will be from the beginning of the previous sequence. In the case of example (b), with step equal to 1, the first sequence will start with command A, and the second with command B. With step equal to 2, the second sequence will start with command C, with step equal to 3, the second sequence will start with command D, and so on.

При заданных условиях (длина 5, шаг 1) из набора команд (б) будут сформированы и сохранены, например, в базе данных, следующие секвенции:Under the given conditions (length 5, step 1), the following sequences will be formed and stored, for example, in a database, from the instruction set (b):

Условия формирования секвенций, длина и шаг, могут представлять собой фиксированные величины и быть заданы заранее, на этапе разработки системы, реализующей способ. В альтернативном варианте реализации описываемого способа длина секвенции и шаг ее формирования могут выбираться автоматически в зависимости от выполнения заранее заданных условий. Например, изначально была задана длина секвенции 100 команд, и в ходе анализа полученной выборки файлов установлено, что в полученной выборке файлов не оказалось ни одной повторяющейся секвенции, то есть нет ни одной секвенции, которая присутствовала бы по меньшей мере в двух различных файлах. Тогда длину секвенции уменьшают до 90 команд, заново формируют секвенции и проверяют, обнаружится ли хотя бы одна повторяющаяся в двух и более файлах секвенция.Conditions for the formation of sequences, length and step, can be fixed values and be set in advance, at the stage of developing a system that implements the method. In an alternative embodiment of the described method, the length of the sequence and the step of its formation can be selected automatically depending on the fulfillment of predetermined conditions. For example, the length of the sequence was initially set to 100 commands, and during the analysis of the resulting sample of files, it was found that there was not a single repeating sequence in the resulting sample of files, that is, there is not a single sequence that would be present in at least two different files. Then the length of the sequence is reduced to 90 commands, the sequences are re-formed and it is checked whether at least one sequence repeated in two or more files is found.

Альтернативно, если изначально была задана длина секвенции 100 команд, и в ходе анализа полученной выборки исполнимых файлов было установлено, что одна из секвенций присутствует во всех файлах, длина секвенции может быть автоматически увеличена до 110 команд, после чего секвенции формируют заново и проверяют, что ни одна из секвенций не присутствует во всех файлах, составляющих полученную выборку.Alternatively, if the length of the sequence was initially set to 100 commands, and during the analysis of the resulting sample of executable files it was found that one of the sequences is present in all files, the sequence length can be automatically increased to 110 commands, after which the sequences are formed anew and it is checked that none of the sequences is present in all files that make up the resulting sample.

Еще в одном возможном варианте реализации при невыполнении описанных требований к частотности секвенции может автоматически увеличиваться или уменьшаться шаг. Возможен также вариант при котором увеличивают одновременно шаг и длину секвенции, а также вариант, при котором увеличивают шаг и уменьшают длину секвенции.In another possible implementation, if the described requirements for the frequency of the sequence are not met, the step can be automatically increased or decreased. A variant is also possible in which the step and length of the sequence are increased simultaneously, as well as a variant in which the step is increased and the length of the sequence is reduced.

В одном из возможных вариантов реализации описанного способа на этапе (135) дополнительно вычисляют энтропию каждой сформированной секвенции, и секвенцию сохраняют в базе данных только при условии, что ее энтропия превышает заранее заданный порог. Смысл подобной проверки с том, чтобы исключить из обработки секвенции, где слишком много повторяющихся команд. При этом энтропия может быть вычислена любым общеизвестным образом, например, как средняя энтропия сообщения, по формуле Шеннона:In one of the possible embodiments of the described method, at step (135), the entropy of each generated sequence is additionally calculated, and the sequence is stored in the database only if its entropy exceeds a predetermined threshold. The meaning of such a check is to exclude sequences from processing where there are too many repetitive commands. In this case, the entropy can be calculated in any well-known way, for example, as the average message entropy, according to the Shannon formula:

где n - количество команд (длина) секвенции, a pi - вероятность присутствия определенной команды на i-й позиции внутри данной секвенции.where n is the number of commands (length) of the sequence, and pi is the probability of the presence of a certain command in the i-th position within this sequence.

На примере секвенций (7) несложно видеть, что энтропия секвенций, в которых нет повтояющихся команд, например, секвенции а в с d е, будет выше, чем энтропия секвенций, в которых повторяющиеся команды присутствуют, как, например, в секвенции D Е F D Е.On the example of sequences (7), it is easy to see that the entropy of sequences in which there are no repeated commands, for example, sequences a in c d e, will be higher than the entropy of sequences in which repeated commands are present, as, for example, in the sequence D E F D E.

В другом возможном варианте реализации описанного способа на этапе (135) сохраняют только те секвенции, энтропия которых оказывается больше первого заранее заданного порога, но меньше второго заданного порога. Таким образом отбирают и сохраняют секвенции, имеющие некоторое количество повторяющихся команд, но такие, где это количество не больше заданной величины.In another possible embodiment of the described method, at step (135), only those sequences are stored whose entropy is greater than the first predetermined threshold, but less than the second predetermined threshold. Thus, sequences are selected and stored that have a certain number of repeated commands, but those where this number is not more than a given value.

Следует отметить, что на этапе (135) для каждой секвенции также сохраняют в базе данных указание на файл, из которого она получена, например, путем указания имени файла или полного пути к файлу, также включающего его имя и расширение.It should be noted that at step (135), for each sequence, an indication of the file from which it was obtained is also stored in the database, for example, by specifying the file name or the full path to the file, also including its name and extension.

В результате выполнения этапа (135) получают базу данных, в которой, отдельно для каждого файла исходного множества, полученного на шаге (110), сохранены все секвенции, обнаруженные в данном файле и обладающие заданным значением энтропии. Возможен также вариант реализации описываемого способа, при котором энтропию секвенций не вычисляют, а в базе данных сохраняют, отдельно для каждого файла исходного множества, все найденные в нем секвенции.As a result of step (135), a database is obtained, in which, separately for each file of the initial set obtained at step (110), all sequences found in this file and having a given entropy value are stored. It is also possible to implement the described method, in which the entropy of sequences is not calculated, but stored in the database, separately for each file of the original set, all the sequences found in it.

Указанная база данных может быть выполнена любым подходящим образом, например, она может представлять собой хеш-таблицу или любую другую известную структуру данных, реализующую интерфейс ассоциативного массива и позволяющую хранить пары «секвенция - значение», где поле «значение» подразумевает хранение численной величины.Said database can be implemented in any suitable way, for example, it can be a hash table or any other well-known data structure that implements the associative array interface and allows storing sequence-value pairs, where the value field implies the storage of a numerical value.

На этом шаг (130), описанный применительно к Фиг. 1Б, завершается, и способ переходит, применительно к Фиг. 1 А, к шагу (140).At this step (130), described in relation to FIG. 1B ends and the method proceeds with reference to FIG. 1 A, to step (140).

На шаге (140) определяют наиболее частотные секвенции, то есть такие секвенции, которые чаще всего встречаются в полученном на шаге (ПО) множестве исполняемых файлов. Это может быть выполнено любым общеизвестным способом. Например, сначала для каждой очередной секвенции из общего множества секвенций подсчитывают, сколько раз она встречается в исследуемом множестве исполняемых файлов; подсчитанное количество сохраняют в поле «значение», соответствующем данной секвенции. Затем, когда подсчет для всех секвенций завершен, вычисляют собственно частотность, то есть отношение количества вхождений данной секвенции L к общему количеству секвенций К. Этот параметр может быть, как меньше единицы, если секвенция встречается не в каждом исполняемом файле, так и больше, если секвенция в среднем встречается несколько раз на протяжении каждого файла:At step (140), the most frequent sequences are determined, that is, those sequences that are most often found in the set of executable files obtained at the step (PO). This may be done in any conventional manner. For example, first, for each next sequence from the total set of sequences, it is calculated how many times it occurs in the studied set of executable files; the counted number is stored in the "value" field corresponding to this sequence. Then, when the counting for all sequences is completed, the actual frequency is calculated, that is, the ratio of the number of occurrences of this sequence L to the total number of sequences K. This parameter can be either less than one if the sequence is not found in every executable file, or more if the sequence occurs on average several times throughout each file:

Вычисленное для каждой секвенции значение частотности сохраняют в базе данных, в поле «значение», соответствующем данной секвенции. На этом шаг (140) завершается, и способ переходит к шагу (150).The frequency value calculated for each sequence is stored in the database in the "value" field corresponding to this sequence. This completes step (140) and the method proceeds to step (150).

На шаге (150) все файлы, которые содержат по меньшей мере одну наиболее частотную секвенцию, относят к одному семейству программ. Это может быть выполнено любым подходящим способом, например, поиском по хеш-таблице, применением методов машинного обучения и т.д.In step (150), all files that contain at least one most frequent sequence are assigned to the same program family. This can be done in any suitable manner, such as hash table lookups, machine learning techniques, and so on.

На данном шаге выбирают одну или несколько секвенций, имеющих наибольшее численно значение частотности. Поскольку база данных хранит для каждой секвенции также указание на файл, из которого она получена, не составляет труда определить, в каких именно файлах полученного на шаге (ПО) множества исполняемых файлов встречается эта одна или эти несколько секвенций. Считают, что эти файлы составляют одно семейство семейство программ.At this step, one or more sequences are selected that have the highest numerical frequency value. Since the database also stores for each sequence an indication of the file from which it was obtained, it is not difficult to determine in which particular files of the set of executable files obtained at the step (software) this one or these several sequences occur. These files are considered to constitute one family of programs.

В альтернативном варианте реализации решение о принадлежности каждого файла к данному семейству принимается по результатам вычисления для этого файла значения взвешивающей функции вида F(A, Б, В, Г, Д, k1, k2, k3, k4, k5), где А…Д - пять наиболее частотных секвенций, a k1…k5 - коэффициенты, имеющий смысл частотности каждой из этих секвенций. Если значение F оказывается выше порогового значения, выбранного заранее, то считают, что данный файл относится к данному семейству. Возможны также варианты реализации описываемого способа, в которых учитывается другое количество наиболее частотных секвенций, например, без ограничения, две, три и так далее.In an alternative implementation, the decision on whether each file belongs to a given family is made based on the results of calculating for this file the value of the weighting function of the form F(A, B, C, D, D, k1, k2, k3, k4, k5), where A ... D - the five most frequent sequences, and k1…k5 - coefficients, meaning the frequency of each of these sequences. If the value of F is higher than the threshold value selected in advance, then the given file is considered to belong to the given family. It is also possible to implement the described method, which takes into account a different number of the most frequent sequences, for example, without limitation, two, three, and so on.

При этом собственно взвешивающая функция F может вычисляться, например, какIn this case, the actual weighting function F can be calculated, for example, as

где bin(A) - бинарная величина, равная 1, если секвенция А найдена в данном файле, и равная 0, если не найдена, bin(Б) - аналогичная величина для секвенции Б, и т.д.where bin(A) is a binary value equal to 1 if sequence A is found in the given file, and equal to 0 if not found, bin(B) is a similar value for sequence B, etc.

В одном варианте реализации способа коэффициенты k1…k5 могут быть вычислены как частотность данной секвенции (9). В другом возможном варианте реализации они могут быть вычислены как как отношение количества файлов, в которых данная секвенция хотя бы один раз присутствует, к общему количеству файлов. Также возможны любые другие варианты вычисления функции F, учитывающие вклад каждого из коэффициентов и наличие в данном файле конкретных секвенций.In one embodiment of the method, the coefficients k1...k5 can be calculated as the frequency of this sequence (9). In another possible implementation, they can be calculated as the ratio of the number of files in which the given sequence is present at least once, to the total number of files. Any other options for calculating the function F are also possible, taking into account the contribution of each of the coefficients and the presence of specific sequences in this file.

Собственно отнесение файлов к одному семейству может означать, например, создание отдельной папки и копирование в эту папку соответствующих файлов. В альтернативном варианте реализации в текстовом файле, хранящем общий перечень имен файлов, полученных на шаге (110), напротив имен файлов, отнесенных к одному семейству, проставляется помета, например, «Семейство 1». Возможны также любые альтернативные варианты реализации данного шага, позволяющие указать на тот факт, что конкретные файлы, входившие в исходное множество исполнимых файлов, отнесены к одному семейству.The actual assignment of files to one family may mean, for example, creating a separate folder and copying the corresponding files to this folder. In an alternative implementation, in a text file that stores a general list of file names obtained in step (110), the names of files assigned to the same family are labeled, for example, "Family 1". Any alternative implementation of this step is also possible, allowing you to indicate the fact that specific files included in the original set of executable files belong to the same family.

На этом шаг (150) завершается и способ переходит к шагу (160), на котором исключают все файлы, отнесенные к данному семейству, из дальнейшей обработки. На шаге (150) анализ ведут по базе данных, которая хранит созданные на шаге (140) секвенции и соответствующие им численные параметры, имеющие смысл частотности. Поэтому исключение из дальнейшей обработки файлов, отнесенных к одному семейству, технически может означать, например, исключение из базы данных тех секвенций, для которых в базе хранятся указания на эти файлы.This completes step (150) and the method proceeds to step (160) where all files belonging to the given family are excluded from further processing. At step (150), the analysis is carried out in a database that stores the sequences created at step (140) and their corresponding numerical parameters that have the meaning of frequency. Therefore, exclusion from further processing of files related to the same family can technically mean, for example, exclusion from the database of those sequences for which the database stores references to these files.

В альтернативном варианте реализации для тех секвенций, которые относятся к подлежащим исключению файлам, в базе данных могут быть обнулены численные значения, имеющие смысл частотности. Вследствие этого, в силу особенностей алгоритма, используемого на шаге (150), данные секвенции и соответствующие им файлы окажутся исключены из дальнейшего анализа.In an alternative implementation for those sequences that relate to the files to be excluded, the database can be set to zero numerical values that have the meaning of frequency. As a result, due to the peculiarities of the algorithm used in step (150), the sequence data and their corresponding files will be excluded from further analysis.

На этом шаг (160) завершается и способ переходит к шагу (170). На данном шаге повторяют поиск наиболее частотных секвенций, отнесение файлов, содержащих по меньшей мере одну наиболее частотную секвенцию, к очередному семейству и исключение данных файлов из дальнейшей обработки до тех пор, пока все файлы не будут отнесены к какому-либо семейству, либо пока не останутся файлы, не содержащие повторяющихся секвенций.This completes step (160) and the method proceeds to step (170). At this step, the search for the most frequent sequences is repeated, the assignment of files containing at least one most frequent sequence to the next family and the exclusion of these files from further processing until all files are assigned to any family, or until there will be files that do not contain repeated sequences.

Иными словами, шаг (170) представляет собой совокупность последовательно выполняемых описанных выше шагов (150) и (160), причем они выполняются циклически до тех пор, пока все файлы из множества, полученного на шаге (110), не окажутся отнесены к какому-либо семейству.In other words, step (170) is a set of sequentially performed steps (150) and (160) described above, and they are performed cyclically until all files from the set obtained in step (110) are assigned to one or family.

В том случае, если по окончании шага (170) в исходном множестве исполняемых файлов не осталось ни одного файла, не содержащего повторяющихся секвенций, на этом описываемый способ (100) завершается.In the event that at the end of step (170) in the original set of executable files there is not a single file that does not contain repeating sequences, the described method (100) is completed.

Возможна ситуация, при которой по окончании шага (170) в исходном множестве исполняемых файлов осталось какое-то количество файлов, причем они не содержат ни одной повторяющейся секвенции. В этом случае способ переходит к шагу (180).It is possible that at the end of step (170) in the original set of executable files, a certain number of files remain, and they do not contain a single repeating sequence. In this case, the method proceeds to step (180).

На шаге (180) каждый из оставшихся файлов, в которых нет ни одной повторяющейся секвенции, относят к отдельному семейству. Иными словами, на шаге (180) создают столько новых семейств программ, сколько осталось файлов, не содержащих повторяющихся секвенций, и к каждому из этих новых семейств относят один из оставшихся файлов. На этом описываемый способ (100) завершается.At step (180), each of the remaining files, in which there is no repeating sequence, is assigned to a separate family. In other words, at step (180), as many new families of programs are created as there are files left that do not contain repeated sequences, and one of the remaining files is assigned to each of these new families. This completes the described method (100).

Альтернативно способу (100), описанному выше со ссылками на Фиг. 1А-Б, также возможна реализация заявленного компьютерно-реализуемого способа кластеризации исполняемых файлов, как это описано далее со ссылками на Фиг. 2А-В.As an alternative to method (100) described above with reference to FIG. 1A-B, it is also possible to implement the claimed computer-implemented method for clustering executable files, as described below with reference to FIG. 2A-B.

Как представлено на Фиг. 2А, заявленный компьютерно-реализуемый способ кластеризации исполняемых файлов (200) может быть реализован следующим образом:As shown in FIG. 2A, the claimed computer-implemented method for clustering executable files (200) can be implemented as follows:

Способ начинается на шаге (202), на котором получают множество исполняемых файлов, подлежащих анализу. Совершенно аналогично описанному выше шагу (НО), это могут быть уже откомпилированные файлы, наподобие ЕХЕ или DLL, содержащие программу или динамическую библиотеку, файлы электронных документов со встроенным языком сценариев (например, языком VBA), такие как файлы DOC или XLS, а также файлы, содержащие исходный код программ на любых интерпретируемых языках программирования, например, таких как JavaScript, Python и т.д. Во всех этих файлах или в некоторой их части также могут быть дополнительно зашифрованы фрагменты кода или данных, все эти файлы или их часть также могут быть обфусцированы, в том числе с использованием стойких к раскрытию алгоритмов. Исполнимые файлы могут относиться к любому известному формату, не ограничиваясь только форматом РЕ; например, это также могут быть, без ограничений, исполняемые файлы форматов Mach-O, ELF, COM, COFF и других.The method starts at step (202) where a plurality of executable files to be analyzed are obtained. Completely similar to the above step (BUT), these can be already compiled files like EXE or DLL containing a program or dynamic link library, electronic document files with an embedded scripting language (like VBA language) like DOC or XLS files, as well as files containing the source code of programs in any interpreted programming languages, such as JavaScript, Python, etc. In all these files or in some part of them, fragments of code or data can also be additionally encrypted, all these files or part of them can also be obfuscated, including using algorithms that are resistant to disclosure. Executable files can be in any known format, not limited to the PE format; for example, it can also be, without limitation, executable files of the Mach-O, ELF, COM, COFF and other formats.

Затем способ (200) переходит к шагу (204), на котором определяют формат каждого исполняемого файла и выбирают соответствующее формату средство анализа кода. Это выполняют полностью аналогично описанному выше шагу (120).The method (200) then proceeds to step (204) where the format of each executable file is determined and a code analyzer corresponding to the format is selected. This is done in exactly the same way as step (120) described above.

После завершения шага (204) способ (200) переходит к шагу (206), на котором каждый исполняемый файл запускают в изолированной среде, соответствующей его формату. Собственно изолированная среда может быть любой подходящей для запуска данного файла, выбор ее может осуществляться автоматически на основании сведений о формате файла, полученных на предыдущем шаге (204).After step (204) is completed, the method (200) proceeds to step (206), where each executable file is launched in an isolated environment corresponding to its format. The sandbox itself can be any suitable environment for running the file, and can be selected automatically based on the file format information obtained in the previous step (204).

После этого способ переходит к шагу (208), на котором из исполняемого файла создают сэмпл. Сэмпл это не функциональная, используемая только для исследования часть исполняемого файла. Шаг (208) будет подробно описан ниже со ссылкой на Фиг. 2Б.Thereafter, the method proceeds to step (208) where a sample is created from the executable file. A sample is a non-functional, research-only part of an executable file. Step (208) will be described in detail below with reference to FIG. 2B.

Как представлено на Фиг. 2Б, шаг (208) начинается на этапе (210), на котором снимают дампы и извлекают из дампов машинный код. В самом деле, любой исполняемый файл, запущенный в изолированной среде, порождает в рабочей памяти среды некоторое количество процессов. На данном этапе, обращаясь к рабочей памяти среды, получают дампы, то есть "отпечатки" рабочей памяти процессов, порожденных анализируемым файлом. При необходимости такие дампы могут делаться очень часто, буквально на каждый такт тактовой частоты основного процессора компьютерной системы, что позволяет детально изучить происходящие в памяти процессы. Снятие дампов может выполняться любым известным способом, например, при помощи утилиты ProcDump.As shown in FIG. 2B, step (208) begins at step (210) where dumps are taken and machine code is extracted from the dumps. Indeed, any executable running in an isolated environment spawns a number of processes in the working memory of the environment. At this stage, by accessing the working memory of the environment, dumps are obtained, that is, "fingerprints" of the working memory of the processes generated by the analyzed file. If necessary, such dumps can be done very often, literally for every clock cycle of the main processor of a computer system, which allows you to study in detail the processes taking place in memory. Dumping can be done by any known method, for example, using the ProcDump utility.

Поскольку каждый дамп памяти представляет собой текст машинного кода, располагающегося в памяти в момент снятия дампа, вышеописанным образом на этапе (210) получают машинный код программы, находящейся в исследуемом файле. При этом то обстоятельство, что источником данных выступает память рабочих процессов, позволяет получать "чистый", не зашифрованный и не обфусцированный машинный код даже в том случае, если исследуемый файл был стойко зашифрован, обфусцирован и т.д. Какой бы модификации ни был подвергнут ранее исследуемый файл, в тот момент, когда порожденные им процессы находятся в памяти, машинный код этих процессов характеризует именно те действия, для выполнения которых файл и был создан.Since each memory dump is a text of the machine code in memory at the time the dump was taken, the machine code of the program in the file under investigation is obtained in the above manner at step (210). At the same time, the fact that the data source is the memory of working processes makes it possible to obtain "clean", not encrypted and not obfuscated machine code, even if the file under study was strongly encrypted, obfuscated, etc. Whatever modification the previously examined file was subjected to, at the moment when the processes generated by it are in memory, the machine code of these processes characterizes exactly those actions for which the file was created.

Извлеченный из дампов машинный код сохраняют, например, в виде файла текстового формата. На этом этап (210) завершается и способ переходит к этапу (220), на котором из полученного машинного кода извлекают и сохраняют код всех присутствующих в нем функций. Это выполняется программой-дизассемблером, а также программой-парсером, алгоритм действия которой основан на спецификации соответствующей данному файлу вычислительной архитектуры, такой, например, как архитектура х86.The machine code extracted from the dumps is saved, for example, as a text format file. At this stage (210) is completed and the method proceeds to stage (220), where the code of all functions present in it is extracted and stored from the received machine code. This is done by a disassembler program, as well as a parser program, the algorithm of which is based on the specification of the computing architecture corresponding to this file, such as the x86 architecture.

На этапе (220) из машинного кода извлекают и сохраняют в виде списка функции. Это выполняют полностью аналогично описанному выше этапу (133). По окончании этапа (220) получают и сохраняют в виде отдельного файла код всех функций, имеющихся в анализируемом файле, после чего способ переходит к этапу (230).At step (220), functions are extracted from the machine code and stored as a list of functions. This is performed in a completely analogous manner to step (133) described above. At the end of step (220), the code of all functions available in the analyzed file is received and stored as a separate file, after which the method proceeds to step (230).

На этапе (230) находят и удаляют из полученного кода функций те функции, которые являются библиотечными функциями. Для этого анализируют файл, полученный в ходе этапа (220) и содержащий найденные функции. При помощи сигнатурного анализа из списка исключают те функции, которые относятся к библиотечным. Библиотечные функции представляют собой стандартный инструментарий. Они широко используются самыми разными программами, поэтому их наличие в машинном коде или в рабочих процессах не специфично для какой-то одной определенной программы. Исключение библиотечных функций позволяет существенно упростить анализ и в то же время получить более качественный результат за счет того, что в коде останутся только те команды, которые наилучшим образом характеризуют конкретную программу.At step (230), those functions that are library functions are found and removed from the received function code. To do this, analyze the file obtained during step (220) and containing the found functions. With the help of signature analysis, those functions that are related to library ones are excluded from the list. Library functions are standard tools. They are widely used by a wide variety of programs, so their presence in native code or in workflows is not specific to any one particular program. Exclusion of library functions makes it possible to significantly simplify the analysis and at the same time obtain a better result due to the fact that only those instructions that best characterize a particular program will remain in the code.

Сигнатурный анализ и удаление библиотечных функций выполняют вспомогательной программой-скриптом. Алгоритм этого скрипта может представлять собой, например, поочередное сравнение каждой из функций с подготовленным заранее набором сигнатур (регулярных выражений). Каждая из этих сигнатур соответствует определенной, заранее описанной в виде сигнатуры библиотечной функции; при обнаружении функции, соответствующей какой-либо сигнатуре, весь код, составляющий тело и заголовок такой функции, из файла исключают.Signature analysis and removal of library functions are performed by an auxiliary script program. The algorithm of this script can be, for example, comparing each of the functions in turn with a set of signatures (regular expressions) prepared in advance. Each of these signatures corresponds to a specific, pre-described in the form of a library function signature; when a function is found that matches any signature, all code that makes up the body and header of such a function is excluded from the file.

По завершении обработки скриптом файл, содержащий оставшиеся функции, сохраняют и таким образом получают текст кода функций, присутствующих в анализируемом файле, и притом не являющихся библиотечными. На этом этап (230) завершается и способ переходит к этапу (240), на котором из оставшегося кода удаляют участки кода, которые соответствуют устойчивым конструкциям, то есть фрагменты кода, которые одинаково часто встречаются в любых программах, например, прологи функций. Это делают посредством заблаговременно изготовленной программы.Upon completion of processing by the script, the file containing the remaining functions is saved and thus the text of the code of the functions that are present in the analyzed file, and, moreover, are not library functions, is obtained. At this point, step (230) ends and the method proceeds to step (240), where code sections that correspond to stable constructs are removed from the remaining code, that is, code fragments that are equally common in any programs, for example, function prologues. This is done through a pre-made program.

Изготовление названной программы (300) может быть выполнено, как это показано на Фиг. З, следующим образом.The production of the named program (300) can be performed as shown in FIG. Z, as follows.

На шаге (310) отбирают достаточное количество, например, 1000 экземпляров программ, относящихся к различным семействам. В состав данной вспомогательной подборки могут быть включены как заведомо вредоносные программы, например, датастилеры, загрузчики, «черви», так и заведомо легитимные, например, текстовые редакторы, программы просмотра изображений, архиваторы и так далее. В состав подборки также могут быть включены файлы электронных документов со встроенным языком сценариев (например, языком VBA), такие как файлы DOC или XLS, а также файлы, содержащие исходный код программ на любых интерпретируемых языках программирования, например, таких как JavaScript, Python и т.д.At step (310) a sufficient number, for example, 1000 copies of programs belonging to different families, are selected. This auxiliary collection can include both obviously malicious programs, for example, datastealers, downloaders, worms, and obviously legitimate ones, for example, text editors, image viewers, archivers, and so on. The collection may also include electronic document files with an embedded scripting language (for example, VBA), such as DOC or XLS files, as well as files containing the source code of programs in any interpreted programming languages, such as JavaScript, Python, and etc.

На следующем шаге (320) обрабатывают указанную вспомогательную подборку алгоритмом, который выявляет повторяющиеся последовательности символов и подсчитывает количество вхождений данной последовательности символов в общем массиве кода. При этом задают минимальную длину последовательности, например, равной 20 символам, а максимальную длину последовательности не ограничивают. В альтернативном варианте воплощения задают как минимальную, так и максимальную длину последовательности, например, 15 и 250 символов соответственно.At the next step (320), the specified auxiliary selection is processed by an algorithm that detects repeated character sequences and counts the number of occurrences of this character sequence in the total code array. In this case, the minimum length of the sequence is set, for example, equal to 20 characters, and the maximum length of the sequence is not limited. In an alternative embodiment, both the minimum and maximum sequence lengths are specified, eg, 15 and 250 characters, respectively.

Описанный алгоритм возвращает список найденных повторяющихся последовательностей символов, где каждой последовательности поставлено в соответствие количество ее вхождений во вспомогательной подборке файлов.The described algorithm returns a list of found repeating sequences of characters, where each sequence is assigned the number of its occurrences in the auxiliary collection of files.

Затем на шаге (330) из списка устойчивых последовательностей, возвращенного алгоритмом, отбирают наиболее частотные последовательности. Например, отбирают те последовательности, которые встречаются в среднем не реже чем один раз в каждой программе, т.е. в данном примере, такие, которые встретились 1000 раз или больше.Then, in step (330), the most frequent sequences are selected from the list of stable sequences returned by the algorithm. For example, those sequences are selected that occur on average at least once in each program, i.e. in this example, those that occur 1000 times or more.

На следующем шаге (340) создают программу или скрипт, который будет удалять из любого кода именно те последовательности, которые были отобраны на предыдущем шаге.At the next step (340), a program or script is created that will remove exactly those sequences that were selected in the previous step from any code.

Таким образом получают программу, способную обрабатывать произвольные фрагменты кода, удаляя из них те последовательности символов, которые включены в "список удаления" как часто встречающиеся в разных программах.Thus, a program is obtained that is capable of processing arbitrary code fragments, removing from them those sequences of characters that are included in the "removal list" as frequently occurring in different programs.

Возвращаясь к Фиг. 2Б, после обработки программой (300) на этапе (240) получают файл, в котором из всех функций, выявленных в анализируемом файле, удалены устойчивые конструкции, то есть такие фрагменты кода, которые достаточно часто встречаются в любых, произвольно выбранных программах.Returning to Fig. 2B, after processing by the program (300) at step (240), a file is obtained in which stable constructions have been removed from all functions identified in the analyzed file, that is, such code fragments that are quite often found in any arbitrarily chosen programs.

На этом этап (240) завершается и способ переходит к этапу (245). На этом этапе анализируют файл, полученный на предыдущем этапе, и выделяют внутри каждой функции команды. Этап (245) выполняется полностью аналогично описанному выше этапу (134), и в результате его для каждого файла исходного множества получают сэмпл, в котором из кода удалены как библиотечные функции, так и устойчивые конструкции.At this point, step (240) ends and the method proceeds to step (245). At this stage, the file obtained in the previous stage is analyzed and the commands are isolated within each function. Step (245) is performed in exactly the same way as step (134) described above, and as a result, for each source set file, a sample is obtained in which both library functions and stable constructs are removed from the code.

После завершения этапа (245) способ переходит, применительно к Фиг. 2А, к этапу (248), на котором сохраняют полученные сэмплы. Следует отметить, что на этапе (248) для каждого сэмпла также сохраняют в базе данных указание на файл, из которого он получен, например, путем указания имени файла или полного пути к файлу, также включающего его имя и расширение, или как-либо иначе, любым известным способом. То есть тем или иным образом но для каждого сохраняемого сэмпла сохраняют также сведения о том, из какого именно файла данных сэмпл получен.After step (245) is completed, the method proceeds, with reference to FIG. 2A to step (248) where the received samples are stored. It should be noted that at step (248), for each sample, an indication of the file from which it is obtained is also stored in the database, for example, by specifying the file name or the full path to the file, also including its name and extension, or otherwise , by any known method. That is, in one way or another, but for each saved sample, information is also stored about which particular data file the sample was obtained from.

На этом этап (248) завершается и способ переходит к этапу (250), на котором анализируют сэмплы и относят соответствующие им файлы к определенным семействам. Выполнение этапа (250) подробно описано ниже со ссылкой на Фиг. 2В.At this stage (248) ends and the method proceeds to stage (250), where the samples are analyzed and the files corresponding to them are assigned to certain families. The execution of step (250) is described in detail below with reference to FIG. 2B.

Как представлено на Фиг. 2В, выполнение этапа (250) начинается на шаге (255), на котором находят в полученных на предыдущем этапе сэмплах повторяющиеся секвенции заданной длины. Это выполняют аналогично описанному ранее со ссылкой на Фиг. 1Б этапу (135), с той лишь разницей, что секвенции в данном случае формируются из команд, следующих одна за другой в коде исследуемого сэмпла, а не в коде исследуемого файла. Иными словами, на шаге (255) выполняют те же самые действия, что были описаны ранее для этапа (135), но применительно к сэмплам, а не к файлам.As shown in FIG. 2B, the execution of step (250) begins at step (255), where repeated sequences of a given length are found in the samples obtained in the previous step. This is done in the same way as described previously with reference to FIG. 1B to stage (135), with the only difference that the sequences in this case are formed from the commands following one after another in the code of the sample under study, and not in the code of the file under study. In other words, in step (255), the same actions are performed as previously described for step (135), but with respect to samples, and not to files.

Полностью аналогично описанному выше, в результате выполнения шага (255) получают базу данных, в которой, отдельно для каждого сэмпла, сохранены все секвенции, обнаруженные в данном сэмпле и обладающие заданным значением энтропии. Возможен также вариант реализации описываемого способа, при котором энтропию секвенций не вычисляют, а в базе данных сохраняют, отдельно для каждого сэмпла, все найденные в нем секвенции. Понятно, что на основании этих сведений не составляет труда любым известным способом построить и обратную зависимость, то есть найти, в каких именно сэмплах присутствует данная секвенция. На этом шаг (255) завершается и способ переходит к шагу (260).Completely similarly to the above, as a result of step (255) a database is obtained in which, separately for each sample, all sequences found in this sample and having a given entropy value are stored. It is also possible to implement the described method, in which the sequence entropy is not calculated, but stored in the database, separately for each sample, all the sequences found in it. It is clear that on the basis of this information, it is not difficult to build an inverse relationship in any known way, that is, to find in which samples this sequence is present. This completes step (255) and the method proceeds to step (260).

На шаге (260) определяют наиболее частотные секвенции, то есть такие секвенции, которые чаще всего встречаются в полученном множестве сэмплов. Это может быть выполнено аналогично описанному выше шагу (140). Вычисленное для каждой секвенции значение частотности сохраняют в базе данных, в поле «значение», соответствующем данной секвенции. На этом шаг (260) завершается, и способ переходит к шагу (265).In step (260), the most frequent sequences are determined, that is, those sequences that are most often found in the resulting set of samples. This can be done in a similar manner to step (140) above. The frequency value calculated for each sequence is stored in the database in the "value" field corresponding to this sequence. This completes step (260) and the method proceeds to step (265).

На шаге (265) файлы, соответствующие тем сэмплам, которые содержат по меньшей мере одну наиболее частотную секвенцию, относят к одному семейству. Это выполняют аналогично описанному выше шагу (150), с той лишь разницей, что по меньшей мере одну наиболее частотную секвенцию находят в сэмплах, а к одному семейству программ относят соответствующие этим сэмплам файлы.At step (265), files corresponding to those samples that contain at least one most frequent sequence are assigned to the same family. This is performed similarly to step (150) described above, with the only difference being that at least one of the most frequent sequences is found in the samples, and files corresponding to these samples are assigned to one program family.

Как и было описано выше, на данном шаге выбирают одну или несколько секвенций, имеющих наибольшее численно значение частотности. Поскольку база данных хранит для каждой секвенции также указание на сэмпл или сэмплы, где она встречается, а для каждого сэмпла на этапе (248) было сохранено указание на соответствующий ему файл, не составляет труда определить, каким именно файлам полученного на шаге (110) множества исполняемых файлов соответствуют эта одна или эти несколько секвенций. Считают, что эти файлы составляют одно семейство семейство программ.As described above, at this step, one or more sequences are selected that have the highest numerical frequency value. Since the database also stores for each sequence an indication of the sample or samples where it occurs, and for each sample at step (248) an indication of the corresponding file was stored, it is not difficult to determine which files of the set obtained at step (110) executable files correspond to this one or these several sequences. These files are considered to constitute one family of programs.

На этом шаг (265) завершается и способ переходит к шагу (270), на котором исключают файлы и сэмплы, соответствующие данному семейству, из дальнейшей обработки. Поскольку анализ ведут по базе данных, которая хранит созданные секвенции и соответствующие им численные параметры, имеющие смысл частотности, исключение из дальнейшей обработки сэмплов, отнесенных к одному семейству, технически может означать, например, исключение из базы данных тех секвенций, для которых в базе хранятся указания на эти сэмплы.This completes step (265) and the method proceeds to step (270), which excludes files and samples corresponding to this family from further processing. Since the analysis is carried out in a database that stores the created sequences and their corresponding numerical parameters that have the meaning of frequency, the exclusion from further processing of samples assigned to the same family can technically mean, for example, the exclusion from the database of those sequences for which the database contains pointers to these samples.

В альтернативном варианте реализации для тех секвенций, которые относятся к подлежащим исключению сэмплам, в базе данных могут быть обнулены численные значения, имеющие смысл частотности. Вследствие этого, в силу особенностей используемого алгоритма, данные секвенции и соответствующие им сэмплы окажутся исключены из дальнейшего анализа. Понятно также, что после выполнения данной операции наиболее частотными секвенциями, оставшимися в базе данных, будут секвенции, имеющие показатели частотности, следующие за показателями только что исключенных секвенцийIn an alternative implementation for those sequences that relate to samples to be excluded, the database can be reset to zero numerical values that have the meaning of frequency. As a result, due to the peculiarities of the algorithm used, the sequence data and their corresponding samples will be excluded from further analysis. It is also clear that after performing this operation, the most frequent sequences remaining in the database will be sequences with frequency scores following those of the sequences just excluded.

На этом шаг (270) завершается и способ переходит к шагу (275). На данном шаге повторяют поиск наиболее частотных секвенций, отнесение файлов, соответствующих сэмплам, содержащим по меньшей мере одну наиболее частотную секвенцию, к очередному семейству и исключение данных файлов и сэмплов из дальнейшей обработки до тех пор, пока все файлы не будут отнесены к какому-либо семейству, либо пока не останутся сэмплы, не содержащие повторяющихся секвенций.This completes step (270) and the method proceeds to step (275). At this step, the search for the most frequent sequences is repeated, the assignment of files corresponding to samples containing at least one most frequent sequence to the next family and the exclusion of these files and samples from further processing until all files are assigned to any family, or until there are no samples that do not contain repeating sequences.

Иными словами, шаг (275) представляет собой совокупность последовательно выполняемых описанных выше шагов (260)…(270), причем они выполняются циклически до тех пор, пока все файлы из множества, полученного на шаге (110), не окажутся отнесены к какому-либо семейству.In other words, step (275) is a set of sequentially performed steps (260) ... (270) described above, and they are performed cyclically until all files from the set obtained at step (110) are assigned to some or family.

В том случае, если по окончании шага (275) в анализируемом множестве сэмплов не осталось ни одного сэмпла, не содержащего повторяющихся секвенций, на этом описываемый способ (200) завершается.In the event that at the end of step (275) in the analyzed set of samples there is not a single sample left that does not contain repeating sequences, the described method (200) ends here.

Возможна ситуация, при которой по окончании шага (275) в анализируемом множестве сэмплов осталось какое-то количество сэмплов, причем они не содержат ни одной повторяющейся секвенции. В одном из возможных вариантов реализации способа файлы, соответствующие этим секвенциям, не относят ни к одному из семейств, способ (200) также завершается. В другом возможном варианте реализации в подобном случае способ переходит к шагу (280).A situation is possible in which, at the end of step (275), a certain number of samples remain in the analyzed set of samples, and they do not contain a single repeating sequence. In one of the possible implementations of the method, the files corresponding to these sequences do not belong to any of the families, the method (200) also ends. In another possible implementation in such a case, the method proceeds to step (280).

На шаге (280) каждый из файлов, соответствующих сэмплам, в которых нет ни одной повторяющейся секвенции, относят к отдельному семейству. Иными словами, на шаге (280) создают столько новых семейств программ, сколько осталось сэмплов, не содержащих повторяющихся секвенций, и к каждому из этих новых семейств относят один из оставшихся файлов.At step (280), each of the files corresponding to the samples, in which there is no repeating sequence, is assigned to a separate family. In other words, at step (280), as many new families of programs are created as there are samples left that do not contain repeated sequences, and one of the remaining files is assigned to each of these new families.

На этом описываемый способ (200) завершается.This completes the described method (200).

Ниже приведены дополнительные варианты реализации настоящего изобретения.The following are additional embodiments of the present invention.

В еще одном частном варианте реализации способа длина секвенции выбирается в зависимости от количества сэмплов, в которых была найдена данная секвенция.In another particular embodiment of the method, the length of the sequence is selected depending on the number of samples in which this sequence was found.

- долговременную память, выполненную с возможностью хранения упомянутых файлов, а также базы данных;- long-term memory configured to store said files, as well as a database;

На Фиг. 4 далее будет представлена общая схема вычислительного устройства (400), обеспечивающего обработку данных, необходимую для реализации заявленного решения.On FIG. 4, the general scheme of the computing device (400) will be presented below, providing the data processing necessary to implement the claimed solution.

В общем случае устройство (400) содержит такие компоненты, как: один или более процессоров (401), по меньшей мере одну память (402), средство хранения данных (403), интерфейсы ввода/вывода (404), средство В/В (405), средства сетевого взаимодействия (406).In general, the device (400) contains such components as: one or more processors (401), at least one memory (402), data storage medium (403), input/output interfaces (404), I/O means ( 405), networking tools (406).

Процессор (401) устройства выполняет основные вычислительные операции, необходимые для функционирования устройства (400) или функциональности одного или более его компонентов. Процессор (401) исполняет необходимые машиночитаемые команды, содержащиеся в оперативной памяти (402).The processor (401) of the device performs the basic computing operations necessary for the operation of the device (400) or the functionality of one or more of its components. The processor (401) executes the necessary machine-readable instructions contained in the main memory (402).

Память (402), как правило, выполнена в виде ОЗУ и содержит необходимую программную логику, обеспечивающую требуемый функционал.The memory (402) is typically in the form of RAM and contains the necessary software logic to provide the desired functionality.

Средство хранения данных (403) может выполняться в виде HDD, SSD дисков, рейд массива, сетевого хранилища, флэш-памяти, оптических накопителей информации (CD, DVD, MD, Blue-Ray дисков) и т.п.Средство (403) позволяет выполнять долгосрочное хранение различного вида информации, например, вышеупомянутых файлов, промежуточных данных, списков, баз данных и т.п.The data storage facility (403) can be implemented in the form of HDD, SSD disks, raid array, network storage, flash memory, optical information storage devices (CD, DVD, MD, Blue-Ray disks), etc. The facility (403) allows perform long-term storage of various types of information, such as the above-mentioned files, intermediate data, lists, databases, etc.

Интерфейсы (404) представляют собой стандартные средства для подключения и работы, например, USB, RS232, RJ45, LPT, COM, HDMI, PS/2, Lightning, Fire Wire и т.п.Interfaces (404) are standard means for connection and operation, such as USB, RS232, RJ45, LPT, COM, HDMI, PS/2, Lightning, Fire Wire, etc.

Выбор интерфейсов (404) зависит от конкретного исполнения устройства (400), которое может представлять собой персональный компьютер, мейнфрейм, сервер, серверный кластер, тонкий клиент, смартфон, ноутбук и т.п.The choice of interfaces (404) depends on the specific implementation of the device (400), which can be a personal computer, mainframe, server, server cluster, thin client, smartphone, laptop, and the like.

В качестве средств В/В данных (405) может использоваться клавиатура. Помимо клавиатуры, в составе средств В/В данных также может использоваться: джойстик, дисплей (сенсорный дисплей), проектор, тачпад, манипулятор мышь, трекбол, световое перо, динамики, микрофон и т.п.The keyboard may be used as data I/O (405). In addition to the keyboard, the following I/O devices can also be used: joystick, display (touchscreen), projector, touchpad, mouse, trackball, light pen, speakers, microphone, etc.

Средства сетевого взаимодействия (406) выбираются из устройств, обеспечивающий сетевой прием и передачу данных, например, Ethernet карту, WLAN/Wi-Fi модуль, Bluetooth модуль, BLE модуль, NFC модуль, IrDa, RFID модуль, GSM модем и т.п.С помощью средств (406) обеспечивается организация обмена данными по проводному или беспроводному каналу передачи данных, например, WAN, PAN, ЛВС (LAN), Интранет, Интернет, WLAN, WMAN или GSM.Means of network interaction (406) are selected from devices that provide network reception and transmission of data, for example, an Ethernet card, WLAN/Wi-Fi module, Bluetooth module, BLE module, NFC module, IrDa, RFID module, GSM modem, etc. With the help of means (406) the organization of data exchange over a wired or wireless data transmission channel, for example, WAN, PAN, LAN (LAN), Intranet, Internet, WLAN, WMAN or GSM, is provided.

Компоненты устройства (400) сопряжены посредством общей шины передачи данных (410).The components of the device (400) are coupled via a common data bus (410).

В настоящих материалах заявки были представлены два варианта предпочтительного раскрытия осуществления заявленного технического решения, которые не должны использоваться как ограничивающее иные, частные воплощения его реализации, которые не выходят за рамки испрашиваемого объема правовой охраны и являются очевидными для специалистов в соответствующей области техники.In these application materials, two options for the preferred disclosure of the implementation of the claimed technical solution were presented, which should not be used as limiting other, private embodiments of its implementation, which do not go beyond the requested scope of legal protection and are obvious to specialists in the relevant field of technology.

Claims

1. A method for clustering executable files performed on a computer device, a method comprising steps in which:

get a lot of executable files,

define the format of each executable file,

separately for each file format:

find repeating sequences of a given length in files by converting the file into source code, analyzing the source code in order to determine the boundaries in it, code fragments inside which are functions, then selecting commands inside the functions from which sequences of a given length are formed,

determine the most frequent sequences,

files that contain at least one most frequent sequence belong to the same family,

exclude all files belonging to this family from further processing,

repeat the search for the most frequent sequences, assigning files containing at least one most frequent sequence to the next family, and excluding these files from further processing until all files are assigned to any family or until there are files left that do not containing repeating sequences,

in response to the fact that there are files that do not contain repeated sequences, each of these files is assigned to a separate family.

2. The method according to claim 1, characterized in that the length of the sequence is selected depending on the number of files in which this sequence was found.

3. The method according to claim. 1, characterized in that the length of the sequence is a fixed, predetermined value.

4. The method according to claim 1, characterized in that only sequences are considered whose entropy exceeds a predetermined threshold.

5. The method according to claim 1, characterized in that the most frequent sequences are found by searching the hash table.

6. The method according to claim 1, characterized in that the decision on whether each file belongs to this one family is made based on the results of calculating for this file the value of the weighting function, which is the sum of the coefficients of at least two of the most frequent sequences found in this file.

7. The method according to claim 6, characterized in that each coefficient of the weighting function is calculated as the ratio of the number of files in which the given sequence is present at least once to the initial number of files.

8. The method according to claim 1, characterized in that, in response to the fact that there are files that do not contain repeating sequences, these files do not belong to any family.

9. A method for clustering executable files, performed on a computer device, a method comprising steps in which:

get a lot of executable files,

define the format of each executable file,

run each executable file in a virtual environment corresponding to its format,

a sample is created from each executable file, in which library functions and stable constructions are excluded from the code, and the resulting sample is saved,

after receiving samples corresponding to all received files:

find repeating sequences of a given length in samples by analyzing the machine code of the sample in order to determine the boundaries in it, code fragments inside which are functions, then selecting commands inside the functions, from which sequences of a given length are formed,

determine the most frequent sequences,

files corresponding to samples that contain at least one most frequent sequence belong to the same family,

exclude files and samples corresponding to this family from further processing,

repeat the search for the most frequent sequences, assigning files corresponding to samples containing at least one most frequent sequence to the next family and excluding these files and samples from further processing until all files are assigned to any family or until there will be no samples that do not contain repeated sequences,

in response to the fact that there are samples that do not contain repeated sequences, each of the files corresponding to these samples is assigned to a separate family.

10. The method according to claim 9, characterized in that the length of the sequence is selected depending on the number of samples in which this sequence was found.

11. The method according to claim 9, characterized in that the length of the sequence is a fixed, predetermined value.

12. The method according to claim 9, characterized in that only sequences are considered whose entropy exceeds a predetermined threshold.

13. The method according to claim 9, characterized in that the most frequent sequences are found by searching the hash table.

14. The method according to claim 9, characterized in that the decision on whether each file belongs to a given family is made based on the results of calculating for this file the value of the weighting function, which is the sum of the coefficients of at least two of the most frequent sequences found in the sample corresponding to this file .

15. The method according to claim 14, characterized in that each coefficient of the weighting function is calculated as the ratio of the number of samples in which the given sequence is present at least once to the initial number of samples.

16. The method according to claim 9, characterized in that, in response to the fact that there are samples that do not contain repeating sequences, the files corresponding to these samples do not belong to any family.

17. System for clustering executable files, comprising:

- long-term memory configured to store used files and data;

- a computing device configured to perform the method according to any one of paragraphs. 1-16.