RU2663474C1

RU2663474C1 - Method of searching for similar files placed on data storage devices

Info

Publication number: RU2663474C1
Application number: RU2018103840A
Authority: RU
Inventors: Дмитрий Сергеевич Смирнов; Владимир Алексеевич Иванов; Михаил Юрьевич Конышев; Сергей Владимирович Радаев; Алексей Аркадьевич Двилянский; Иван Владимирович Иванов
Priority date: 2018-01-31
Filing date: 2018-01-31
Publication date: 2018-08-06

Abstract

FIELD: computer equipment.SUBSTANCE: invention relates to computer technology for information retrieval. Technical result is achieved through a comparison that occurs with a subgroup of previously processed files, while for this purpose the downloaded file is represented as a random Markov process, the probability of occurrence of sequences of bits of a size less than or equal to a given connectivity is calculated by dividing the number of occurrences of the bit sequences by the file size in bits, and perform a comparison only with those files in which the module of the difference in their sizes and the size of the file being checked is less than the calculated limit of the maximum possible file size change, and if the probabilistic distance between the downloaded file and any previously processed file from the obtained subgroup is less than the maximum possible change in the probability distance, then these files are recognized as similar.EFFECT: technical result is to increase the efficiency of searching for such files.1 cl, 3 dwg

Description

Изобретение относится к способам поиска информации, размещенной на локальных и удаленных устройствах хранения данных. В частности, изобретение относится к способам поиска на локальных и удаленных устройствах хранения данных файлов, похожих структурно на выбранный файл.The invention relates to methods for finding information hosted on local and remote data storage devices. In particular, the invention relates to methods for searching on local and remote storage devices for file data similar structurally to a selected file.

Известен способ поиска похожих электронных документов, размещенных на устройствах хранения данных, при помощи сравнения семантических сетей по патенту RU 2571539, кл. G06F 017/30. Известный способ включает следующую последовательность действий. Осуществляют загрузку двух электронных документов с устройств хранения данных, определяют параметры поиска путем задания правил формирования множества уникальных слов, формируют множество взвешенных уникальных слов и взвешенных связей между ними, строят семантическую сеть и производят поиск похожих по смыслу документов путем сравнения семантических сетей. При этом дополнительно задают правила формирования стилистических образов документов путем определения размера матриц частот переходов и выбора элементов матриц частот переходов. И, наконец, сравнивают матрицы частот переходов документов на схожесть путем вычисления коэффициента сходства.There is a method of searching for similar electronic documents located on data storage devices by comparing semantic networks according to patent RU 2571539, cl. G06F 017/30. The known method includes the following sequence of actions. Two electronic documents are loaded from data storage devices, the search parameters are determined by setting the rules for the formation of many unique words, a set of weighted unique words and weighted links between them are formed, a semantic network is constructed and documents similar in meaning are searched by comparing semantic networks. At the same time, rules are set for the formation of stylistic images of documents by determining the size of transition frequency matrices and selecting elements of transition frequency matrices. And finally, the matrices of frequencies of transitions of documents to similarity are compared by calculating the similarity coefficient.

Наиболее близким по технической сущности и выполняемым функциям аналогом (прототипом) к заявляемому изобретению является способ поиска похожих файлов с использованием гибкой свертки (патент RU №2580036, МПК G06 F 21/14). Он включает следующую последовательность действий. Выделяют множество признаков из файлов. Разделяют множество выделенных признаков файла, по меньшей мере, на два подмножества, в одном из которых есть как минимум один изменяемый признак, а в другом есть как минимум один неизменяемый признак. Получают свертку каждого из вышеописанных подмножеств признаков файла. Создают свертку файла как комбинацию сверток каждого из вышеописанных подмножеств признаков файла. Сравнивают свертку, по меньшей мере, одного файла с набором заранее созданных сверток файлов. Признают файл похожим на файлы из множества похожих файлов, имеющих одинаковую свертку, если при сравнении свертка указанного файла совпадает со сверткой файла из указанного множества.The closest in technical essence and functions performed analog (prototype) to the claimed invention is a method for searching for similar files using flexible convolution (patent RU No. 2580036, IPC G06 F 21/14). It includes the following sequence of actions. Many features are extracted from files. A set of selected file attributes is divided into at least two subsets, one of which has at least one mutable attribute, and the other has at least one immutable attribute. Get a convolution of each of the above subsets of the characteristics of the file. Create a convolution of the file as a combination of convolutions of each of the above subsets of the characteristics of the file. Compare the convolution of at least one file with a set of pre-created convolution files. A file is recognized as being similar to files from a plurality of similar files having the same convolution if, when comparing, the convolution of the specified file coincides with the convolution of the file from the specified set.

В данной области техники существует техническая проблема, заключающаяся в том, что поиск похожих файлов осуществляют сравнением со всеми ранее обработанными файлами, что приводит к значительному снижению скорости поиска.There is a technical problem in the art in that similar files are searched by comparing with all previously processed files, which leads to a significant decrease in the search speed.

Техническая проблема решается разработкой способа поиска подобных файлов, размещенных на устройствах хранения данных, обеспечивающего при его реализации возможность повысить скорость поиска подобных файлов различных форматов путем сравнения загруженного файла не со всеми ранее обработанными файлами, а с подгруппой ранее обработанных файлов. Для этого представляют загруженный файл в виде случайного марковского процесса, для чего задают значение максимально возможного изменения вероятностного расстояния, а также задают связность используемой марковской цепи, которая показывает максимальный размер битовой последовательности, для которой учитывают корреляционные свойства. Далее рассчитывают вероятности появления последовательностей бит размером, меньшим и равным заданной связности, путем деления количества появлений последовательностей бит на размер файла в битах и определяют максимально возможное изменение размера файла, используя полученный ряд вероятностей и исходный размер файла. Производят сравнение только с теми файлами, у которых модуль разницы их размеров и размера проверяемого файла меньше рассчитанной границы максимально возможного изменения размера файла. И если вероятностное расстояние между загруженным файлом и каким-либо ранее обработанным файлом из полученной подгруппы меньше максимально возможного изменения вероятностного расстояния, то данные файлы признают подобными.The technical problem is solved by developing a method for searching for similar files located on data storage devices, which, when implemented, provides the opportunity to increase the speed of searching for similar files of various formats by comparing the downloaded file not with all previously processed files, but with a subgroup of previously processed files. To do this, present the downloaded file in the form of a random Markov process, for which they set the value of the maximum possible change in the probabilistic distance, and also set the connectivity of the used Markov chain, which shows the maximum size of the bit sequence for which correlation properties are taken into account. Next, the probabilities of the appearance of sequences of bits with a size smaller and equal to a given connection are calculated by dividing the number of occurrences of the sequences of bits by the file size in bits and the maximum possible change in file size is determined using the obtained series of probabilities and the original file size. Comparison is made only with those files for which the module of the difference in their sizes and the size of the checked file is less than the calculated border of the maximum possible file size change. And if the probability distance between the downloaded file and any previously processed file from the obtained subgroup is less than the maximum possible change in the probability distance, then these files are recognized as similar.

Перечисленная новая совокупность существенных признаков обеспечивает возможность повышения скорости поиска подобных файлов.The listed new set of essential features provides an opportunity to increase the search speed of such files.

Проведенный анализ уровня техники позволил установить, что аналоги, характеризующиеся совокупностью признаков, тождественных всем признакам заявленного технического решения, отсутствуют, что указывает на соответствие заявленного способа условию патентоспособности «новизна».The analysis of the prior art made it possible to establish that analogues that are characterized by a set of features identical to all the features of the claimed technical solution are absent, which indicates the compliance of the claimed method with the condition of patentability “novelty”.

Результаты поиска известных решений в данной и смежных областях техники с целью выявления признаков, совпадающих с отличительными от прототипа признаками заявленного объекта, показали, что они не следуют явным образом из уровня техники. Из уровня техники также не выявлена известность отличительных существенных признаков, обусловливающих тот же технический результат, который достигнут в заявляемом способе. Следовательно, заявленное изобретение соответствует условию патентоспособности «изобретательский уровень».Search results for known solutions in this and related fields of technology in order to identify features that match the distinctive features of the claimed object from the prototype showed that they do not follow explicitly from the prior art. The prior art also did not reveal the fame of the distinctive essential features that determine the same technical result that is achieved in the claimed method. Therefore, the claimed invention meets the condition of patentability "inventive step".

Заявленный способ поясняется чертежами, на которых показаны:The claimed method is illustrated by drawings, which show:

фиг. 1 - древовидная структура взаимосвязей вероятностей двоичных векторов различной длины;FIG. 1 - tree structure of the relationships of probabilities of binary vectors of various lengths;

фиг. 2 - блок-схема реализации способа поиска подобных файлов, размещенных на устройствах хранения данных;FIG. 2 is a flowchart of an implementation of a method for searching for such files located on data storage devices;

фиг. 3 - сравнение результатов имитационного моделирования для способа-прототипа и заявленного способа.FIG. 3 - comparison of the results of simulation for the prototype method and the claimed method.

Реализация заявленного способа поиска подобных файлов, размещенных на устройствах хранения данных, поясняется на фиг. 2:An implementation of the claimed method for searching for such files located on data storage devices is illustrated in FIG. 2:

Блок №1 - осуществляют загрузку файла с устройства хранения данных.Block No. 1 - download a file from a data storage device.

Блок №2 - задают связность используемой марковской цепи, которая показывает максимальный размер битовой последовательности, для которой учитываются корреляционные свойства. Иными словами, задают количество уровней в древовидной структуре взаимосвязей вероятностей, представленной на фиг. 1. Чем больше количество используемых уровней, тем больше точность оценки статистических свойств, но и тем выше требования к ресурсоемкости. Также задают параметр М_кр - максимально возможное изменение вероятностного расстояния.Block No. 2 - set the connectivity of the used Markov chain, which shows the maximum size of the bit sequence for which correlation properties are taken into account. In other words, the number of levels in the tree structure of the probability relationships shown in FIG. 1. The greater the number of levels used, the greater the accuracy of the estimation of statistical properties, but also the higher the requirements for resource intensity. Also set the parameter M _cr - the maximum possible change in the probability distance.

Блоки №3, 4, 5, 6 - определяют значения вероятностей для каждого уровня в древовидной структуре взаимосвязей вероятностей, представленной на фиг. 3, меньше или равного заданной связности используемой марковской цепи. Например, для связности марковской цепи, равной двум, определяют вероятности на первом уровне: р(0), и вероятности на втором уровне: р(00), р(01), р(10), р(11). Определение каждой вероятности происходит следующим образом: количество появлений каждой комбинации последовательности бит в файле делится на длину файла в битах.Blocks No. 3, 4, 5, 6 - determine the values of the probabilities for each level in the tree structure of the relationships of probabilities presented in FIG. 3, less than or equal to the given connectivity of the used Markov chain. For example, for a connected Markov chain equal to two, probabilities are determined at the first level: p (0), and probabilities at the second level: p (00), p (01), p (10), p (11). Each probability is determined as follows: the number of occurrences of each combination of a sequence of bits in a file is divided by the length of the file in bits.

Блок №7 - определяют максимально возможное изменение размера файла, используя полученный ряд вероятностей и исходный размер файла L:Block No. 7 - determine the maximum possible change in file size using the obtained series of probabilities and the original file size L:

Блок №8 - выделяют подгруппу ранее обработанных файлов, у которых размер L' удовлетворяет условию:Block number 8 - allocate a subgroup of previously processed files in which the size L 'satisfies the condition:

Блоки №9, 10, 11, 12, 13, 14, 15 - осуществляют расчет вероятностного расстояния между загруженным файлом и каждым файлом их выделенной подгруппы. Расчет производится с использованием модифицированной нормированной метрики Евклида:Blocks No. 9, 10, 11, 12, 13, 14, 15 - calculate the probabilistic distance between the downloaded file and each file of their selected subgroup. The calculation is performed using a modified normalized Euclidean metric:

где K - связность используемой марковской цепи.where K is the connectivity of the used Markov chain.

Сравнивают вероятностное расстояние между загруженным файлом и каждым файлом их выделенной подгруппы с максимально возможным изменением вероятностного расстояния М_кр, и если оно меньше максимально возможного, то принимают решение о подобности сравниваемых файлов.The probability distance between the downloaded file and each file of their selected subgroup is compared with the maximum possible change in the probability distance M _cr , and if it is less than the maximum possible, then a decision is made on the similarity of the compared files.

Промышленная применимость изобретения обусловлена тем, что устройство, реализующее предложенный способ, может быть осуществлено с помощью современной элементной базы с достижением указанного в изобретении назначения.Industrial applicability of the invention is due to the fact that a device that implements the proposed method can be implemented using a modern elemental base to achieve the destination specified in the invention.

Правомерность теоретических предпосылок проверялась с помощью машинного моделирования способа-прототипа и заявленного способа поиска подобных файлов, размещенных на устройствах хранения данных.The validity of the theoretical assumptions was checked using machine modeling of the prototype method and the claimed method of searching for similar files located on data storage devices.

Показателем эффективности способа поиска подобных файлов, размещенных на устройствах хранения данных, является скорость поиска.An indicator of the effectiveness of the method for searching for such files located on data storage devices is the speed of search.

Для оценки качества функционирования разработанного способа были проведены эксперименты по поиску подобных файлов различных типов. С этой целью исследованы файлы с расширениями txt, pcm и dat, по которым затем осуществлялся поиск. Тестовый массив файлов составлял более 1000000 файлов. Результаты, представленные на фиг. 3, подтверждают существенное повышение скорости поиска при использовании разработанного способа.To assess the quality of functioning of the developed method, experiments were conducted to search for similar files of various types. For this purpose, files with the extensions txt, pcm and dat were examined, which were then searched. The test file array was over 1,000,000 files. The results presented in FIG. 3, confirm a significant increase in search speed when using the developed method.

Claims

A method of searching for such files located on data storage devices, which consists in downloading a file from a data storage device, creating a convolution thereof and comparing the obtained convolution with convolutions of previously processed files to determine the similarity of the files, characterized in that the comparison does not occur with all previously processed files, and with a subgroup of previously processed files, for this purpose they represent the downloaded file as a random Markov process, for which they set the value of the maximum possible change of the probability distance, and also determine the connectivity of the used Markov chain, which shows the maximum size of the bit sequence for which correlation properties are taken into account, and then calculate the probability of occurrence of sequences of bits with a size less than or equal to a given connection by dividing the number of occurrences of bit sequences by the file size in bits, determine the maximum possible change in file size, using the obtained series of probabilities and the original file size, and they compare only those files whose modulus of difference in their sizes and the size of the scanned file is less than the calculated border of the maximum possible change in file size, and if the probability distance between the downloaded file and any previously processed file from the obtained subgroup is less than the maximum possible change in the probability distance, then these files are recognized as similar.