RU2601191C1

RU2601191C1 - Method of identifying arrays of binary data

Info

Publication number: RU2601191C1
Application number: RU2015127124/08A
Authority: RU
Inventors: Владимир Владимирович Рябоконь; Евгений Викторович Лебеденко
Priority date: 2015-07-06
Filing date: 2015-07-06
Publication date: 2016-10-27

Abstract

FIELD: data processing.

SUBSTANCE: invention relates to data processing. Method of identifying arrays of binary data involves automated comparison of arrays of binary data by obtaining sets of identification data consisting of sequencies of minimum values for each hash function as a set of identification data of the analyzed documents, as well as obtaining sets of identification data for reference documents to detect matches between these sets.

EFFECT: technical result is higher accuracy of evaluation of similarity of binary data arrays.

1 cl, 2 dwg

Description

Область техники, к которой относится изобретениеFIELD OF THE INVENTION

Изобретение относится к области цифровых вычислений и обработки данных, а именно к автоматизированному сравнению массивов бинарных данных, и может быть использовано при разработке новых и совершенствовании существующих систем проведения анализа и контроля информационных объектов.The invention relates to the field of digital computing and data processing, namely to automated comparison of arrays of binary data, and can be used to develop new and improve existing systems for analysis and control of information objects.

Уровень техникиState of the art

Известен способ определения степени сходства документов, описанный в патенте US 5909677 A от 18 июня 1996 г., заключающийся в представлении документа, написанного на естественном языке, в виде набора последовательностей, полученных разбиением текста на слова или группы слов, применении хэш-функций к полученным последовательностям и сравнении результатов между собой для получения оценки сходства. Существенным недостатком данного способа является низкая точность оценки сходства массивов бинарных данных.A known method for determining the degree of similarity of documents described in patent US 5909677 A dated June 18, 1996, which consists in presenting a document written in natural language in the form of a set of sequences obtained by breaking the text into words or groups of words, applying hash functions to the received sequences and comparing the results among themselves to obtain an assessment of similarity. A significant disadvantage of this method is the low accuracy of assessing the similarity of binary data arrays.

Также известен способ обнаружения дубликатов файлов, описанный в патенте US 8015162 B2 от 4 августа 2006 г., заключающийся в сравнении результатов хэширования последовательностей, полученных разбиением текста на слова или группы слов, с результатами хэширования документов, доступных в сети, для поиска полных или частичных дубликатов. Недостатком данного способа является низкая точность оценки сходства массивов бинарных данных.There is also a known method for detecting duplicate files described in US patent 8015162 B2 of August 4, 2006, which consists in comparing the results of hashing sequences obtained by breaking text into words or groups of words, with the results of hashing documents available on the network to search for full or partial duplicates. The disadvantage of this method is the low accuracy of assessing the similarity of arrays of binary data.

Наиболее близким по технической сущности и выполняемым функциям аналогом (прототипом) к заявляемому является "Способ автоматизированного анализа текстовых документов" (патент РФ №2474870, G06F 17/00, 18.11.2011), заключающийся в том, что сначала преобразуют в заранее заданный формат все электронные файлы эталонных документов, выделяя в каждом из них осмысленные фрагменты, именуемые клаузами, и сохраняют преобразованные электронные файлы эталонных документов в базе данных, преобразуют каждый электронный файл анализируемого документа в заранее заданный формат, выявляют совпадение выделенных клауз в электронном файле анализируемого документа с выделенными клаузами в электронных файлах эталонных документов, подсчитывают относительное число клауз в электронном файле анализируемого документа, совпавших с соответствующими клаузами каждого из электронных файлов эталонных документов, сравнивают найденные относительные числа совпадений с заранее заданным пороговым значением для выявления наличия в электронном файле анализируемого документа отрывков текста какого-либо из эталонных документов.The closest in technical essence and performed functions analogue (prototype) to the claimed one is the "Method for automated analysis of text documents" (RF patent No. 2474870, G06F 17/00, 11/18/2011), which consists in first converting all to a predetermined format electronic files of reference documents, highlighting in each of them meaningful fragments, called clauses, and save the converted electronic files of reference documents in a database, convert each electronic file of the analyzed document to pre-set format, reveal the coincidence of the selected clauses in the electronic file of the analyzed document with the selected clauses in the electronic files of the reference documents, calculate the relative number of clauses in the electronic file of the analyzed document that match the corresponding clauses of each of the electronic files of the reference documents, compare the found relative numbers of matches with the predefined threshold value for detecting the presence in the electronic file of the analyzed document of fragments of the text of any of the reference x documents.

Характерной особенностью способа-прототипа является возможность обработки и сравнения документов в виде массивов бинарных данных, при этом используются пары рассчитанных значений хэш-функции последовательности и ее позиции в массиве, где под позицией понимается указание на начало данной последовательности, отсчитанное от конца массива. Недостатком способа-прототипа является низкая точность оценки сходства массивов бинарных данных, обусловленная использованием всего одной хэш-функции и случайным выбором ее значений в каждом документе.A characteristic feature of the prototype method is the ability to process and compare documents in the form of arrays of binary data, using pairs of calculated values of the hash function of the sequence and its position in the array, where the position refers to the beginning of the sequence counted from the end of the array. The disadvantage of the prototype method is the low accuracy of assessing the similarity of arrays of binary data due to the use of only one hash function and a random selection of its values in each document.

Раскрытие изобретенияDisclosure of invention

Задачей изобретения является разработка способа идентификации массивов бинарных данных, позволяющего повысить точность оценки сходства массивов бинарных данных. Эта задача решается тем, что способ идентификации массивов бинарных данных, заключающийся в том, что преобразуют в заранее заданный формат все электронные файлы эталонных документов; сохраняют преобразованные электронные файлы эталонных документов в базе данных; преобразуют каждый электронный файл анализируемого документа в заранее заданный формат; выявляют совпадение идентификационных данных; сравнивают найденные относительные числа совпадений с заранее заданным пороговым значением, согласно изобретению дополнен следующими операциями: разбивают последовательность бинарных данных на подблоки заданной длины с перекрытием на один символ данных; формируют набор независимых хэш-функций путем использования списка неповторяющихся начальных значений параметров хэш-функции; выбирают идентификационные данные путем вычисления значений независимых хэш-функций от подблоков заданной длины и выбора подблоков с минимальным значением хэш-функции; получают наборы идентификационных данных H_L, состоящие из последовательно расположенных минимальных значений для каждой хэш-функции, в виде набора идентификационных данных анализируемых документов и получают такие же наборы идентификационных данных H_Li для эталонных документов, после чего выявляют совпадения между этими наборами.The objective of the invention is to develop a method for identifying arrays of binary data to improve the accuracy of assessing the similarity of arrays of binary data. This problem is solved in that a method for identifying arrays of binary data, which consists in converting all electronic files of reference documents into a predetermined format; save converted electronic files of reference documents in a database; convert each electronic file of the analyzed document into a predetermined format; Identify identity match comparing the found relative numbers of matches with a predetermined threshold value, according to the invention is supplemented with the following operations: a sequence of binary data is divided into subblocks of a given length with overlapping one data symbol; form a set of independent hash functions by using a list of non-repeating initial values of the parameters of the hash function; select identification data by calculating the values of independent hash functions from subblocks of a given length and selecting subblocks with a minimum value of the hash function; receive identification data sets H _L , consisting of sequentially located minimum values for each hash function, in the form of a set of identification data of analyzed documents and obtain the same identification data sets H _Li for reference documents, after which matches between these sets are revealed.

При этом под хэш-функцией понимается функция, получающая на вход строку переменной длины и преобразующая ее в строку фиксированной, обычно меньшей длины (Брюс Шнайер "Прикладная криптография. Протоколы, алгоритмы, исходные тексты на языке Си". - М.: Триумф, 2002, раздел 2.4). В отличие от использования одной хэш-функции, применение набора независимых хэш-функций обеспечивает равновероятность выбора подблоков бинарных данных для анализируемого и эталонного документов (электронный документ "Несколько простых хеш-функций и их свойства", http://habrahabr.ru/post/219139/). Перекрытие подблоков бинарных данных осуществляют для того, чтобы оценка сходства не зависела от позиции и размера подблоков в последовательности бинарных данных, что непосредственно влияет на точность оценки их сходства (Andrew Tridgell "Efficient Algorithms for Sorting and Synchronization", 02.1999, раздел 3).At the same time, a hash function refers to a function that receives a string of variable length and converts it to a fixed, usually shorter string (Bruce Schneier "Applied Cryptography. Protocols, Algorithms, C Source Texts." - M .: Triumph, 2002 , section 2.4). In contrast to the use of a single hash function, the use of a set of independent hash functions ensures equal probability of choosing binary data subunits for the analyzed and reference documents (electronic document "Several simple hash functions and their properties", http://habrahabr.ru/post/ 219139 /). Overlapping binary data subunits is performed so that the similarity assessment does not depend on the position and size of the subunits in the binary data sequence, which directly affects the accuracy of their similarity assessment (Andrew Tridgell "Efficient Algorithms for Sorting and Synchronization", 02.1999, section 3).

Перечисленные существенные признаки в совокупности обеспечивают повышение точности оценки сходства массивов бинарных данных.The essential features listed above together provide an increase in the accuracy of assessing the similarity of binary data arrays.

Проведенный анализ уровня техники позволил установить, что аналоги, характеризующиеся совокупностью признаков, тождественных всем признакам заявленного способа идентификации массивов бинарных данных, отсутствуют. Следовательно, заявленное изобретение соответствует условию патентоспособности "новизна".The analysis of the prior art made it possible to establish that there are no analogues that are characterized by a set of features identical to all the features of the claimed method for identifying arrays of binary data. Therefore, the claimed invention meets the condition of patentability "novelty."

Результаты поиска известных решений в данной и смежных областях техники с целью выявления признаков, совпадающих с отличительными от прототипа признаками заявленного способа, показали, что они не следуют явным образом из уровня техники. Из уровня техники также не выявлена известность влияния предусматриваемых существенными признаками заявленного изобретения преобразований на решение задачи изобретения. Следовательно, заявленное изобретение соответствует условию патентоспособности "изобретательский уровень".Search results for known solutions in this and related fields of technology in order to identify features that match the distinctive features of the prototype of the claimed method showed that they do not follow explicitly from the prior art. The prior art also did not reveal the popularity of the impact provided by the essential features of the claimed invention transformations on the solution of the problem of the invention. Therefore, the claimed invention meets the condition of patentability "inventive step".

"Промышленная применимость" способа обусловлена наличием элементной базы, на основе которой могут быть выполнены устройства, реализующие данный способ с достижением указанного в изобретении назначения.The "industrial applicability" of the method is due to the presence of an element base, on the basis of which devices can be made that implement this method with the achievement of the purpose specified in the invention.

Краткое описание чертежейBrief Description of the Drawings

Заявленный способ поясняется чертежами, на которых показано:The claimed method is illustrated by drawings, which show:

фиг. 1 - функциональная схема способа идентификации массивов бинарных данных;FIG. 1 is a functional diagram of a method for identifying arrays of binary data;

фиг. 2 - график точности оценки сходства массивов бинарных данных.FIG. 2 is a graph of the accuracy of assessing the similarity of binary data arrays.

Осуществление способаThe implementation of the method

Заявленный способ поясняется функциональной схемой (фиг. 1), где сначала считывают эталонные и анализируемые документы из ячеек долговременной памяти в ячейки оперативной памяти подблоками заданного размера W, после считывания каждого подблока позиция для считывания следующего подблока увеличивается на 1 от начала документа. Таким образом, в ячейках оперативной памяти осуществляется разбиение последовательности бинарных данных сравниваемых документов на подблоки заданной длины с перекрытием на один символ данных.The claimed method is illustrated by a functional diagram (Fig. 1), where first the reference and analyzed documents are read from the long-term memory cells to the RAM cells with subunits of a given size W, after reading each subunit, the position for reading the next subunit is increased by 1 from the beginning of the document. Thus, in the RAM cells, the sequence of binary data of the compared documents is divided into subblocks of a given length with overlapping by one data symbol.

Набор из L независимых хэш-функций формируют с помощью полинома h_i=a _i·x+b mod P, где i изменяется от 1 до L, коэффициенты а _х различны для каждой сгенерированной хэш-функции и считываются из ячеек долговременной памяти, в которых хранится список неповторяющихся начальных значений параметров хэш-функции, а P - наиболее близкое к максимальному четырехбайтному значению ячейки памяти простое число, которое также считывают из ячеек долговременной памяти. Таким образом обеспечивается формирование L различных независимых преобразователей входных данных, хранящихся в оперативной памяти, с фиксированным размером выходных данных.A set of L independent hash functions is formed using the polynomial h _i = a _i · x + b mod P, where i varies from 1 to L, the coefficients a _{x are} different for each generated hash function and are read from long-term memory cells in which a list of non-repeating initial values of the parameters of the hash function is stored, and P is the prime number closest to the maximum four-byte value of the memory cell, which is also read from the long-term memory cells. This ensures the formation of L various independent converters of input data stored in RAM with a fixed size of the output data.

Выбор идентификационных данных осуществляется следующим образом:Identification data is selected as follows:

- бинарные данные в ячейках оперативной памяти, соответствующие подблокам заданного размера W, модифицируют L раз для каждого подблока отдельно с использованием набора сформированных независимых хэш-функций;- binary data in RAM cells corresponding to subblocks of a given size W is modified L times for each subblock separately using a set of generated independent hash functions;

- при каждой модификации данных полученные значения хэш-функций сохраняют в ячейках долговременной памяти;- at each data modification, the obtained values of the hash functions are stored in the cells of long-term memory;

- полученные значения хэш-функций измеряют и сравнивают между собой для выбора минимального значения каждой хэш-функции h_i;- the obtained values of the hash functions are measured and compared with each other to select the minimum value of each hash function h _i ;

- полученный набор идентификационных данных H_L, состоящий из последовательно расположенных минимальных значений для каждой хэш-функции из сформированного набора, сохраняют в ячейках долговременной памяти. При этом ячейки долговременной памяти для хранения идентификационных данных эталонных документов организованы в виде базы данных.- the obtained set of identification data H _L , consisting of sequentially located minimum values for each hash function from the generated set, is stored in the long-term memory cells. At the same time, long-term memory cells for storing identification data of reference documents are organized in the form of a database.

Для выявления совпадения идентификационных данных последовательно сравнивают полученный набор идентификационных данных H_L для анализируемого документа с наборами идентификационных данных эталонных документов, для чего:To identify the identity match, the obtained set of identification data H _L for the analyzed document is successively compared with the identification data sets of the reference documents, for which:

- считывают из ячеек долговременной памяти вычисленное значение H_L для анализируемого документа;- read from the cells of long-term memory the calculated value of H _L for the analyzed document;

- последовательно считывают из ячеек долговременной памяти базы данных каждый набор идентификационных данных эталонных документов H_Li, где i изменяется от 1 до K, где K - количество эталонных документов;- each set of identification data of reference documents H _Li is sequentially read from the cells of the long-term memory of the database, where i varies from 1 to K, where K is the number of reference documents;

- последовательно для каждого считанного набора идентификационных данных проверяют равенство H_L и H_Li.- consistently for each readout set of identification data verify the equality of H _L and H _Li .

Подсчет относительного числа идентификационных данных в электронном файле анализируемого документа, совпавших с соответствующими идентификационными данными каждого из электронных файлов эталонных документов осуществляется следующим образом:The calculation of the relative number of identification data in the electronic file of the analyzed document that coincided with the corresponding identification data of each of the electronic files of the reference documents is carried out as follows:

- последовательно для каждого считанного набора идентификационных данных измеряют d_i - количество одинаковых значений хэш-функций между наборами идентификационных данных H_L и H_Li;- sequentially for each read set of identification data, d _i is the number of identical hash functions between the sets of identification data H _L and H _Li ;

- последовательно вычисляют нормированное значение оценки сходства массивов бинарных данных R_i=(L-d_i)/L.- sequentially calculate the normalized value of the assessment of the similarity of the arrays of binary data R _i = (Ld _i ) / L.

Для сравнения найденного относительного числа совпадений с пороговым значением вычисляют индекс i наиболее похожего набора идентификационных данных, для которого оценка сходства R=max(R_i). Результатом идентификации считают индекс i наиболее похожего эталонного документа и оценка сходства набора идентификационных данных R. При низкой оценке степени сходства, определяющейся заданными граничными условиями, идентификацию считают неуспешной, т.к. анализируемый документ не соответствует ни одному из эталонных.To compare the found relative number of matches with the threshold value, the index i of the most similar set of identification data is calculated, for which the similarity score is R = max (R _i ). The identification result is considered the index i of the most similar reference document and the assessment of the similarity of the identification data set R. With a low assessment of the degree of similarity determined by the given boundary conditions, the identification is considered unsuccessful, because the analyzed document does not match any of the reference.

Осуществление задачи изобретения по повышению точности оценки сходства массивов бинарных данных может быть представлено в виде графика (фиг. 2) сравнения результатов экспериментов для заявленного способа и способа-прототипа. При этом по оси абсцисс показаны нормированные значения количества изменений между эталонным и анализируемым документом, по оси ординат - нормированное значение оценки сходства. Из графика видно, что при 10000 экспериментах в среде MatLAB с различными документами и различной степенью изменений S_n между ними средневыборочное значение полученной оценки степени сходства R_n для заявленного способа и способа-прототипа совпадает, однако дисперсия оценки заявленного способа при L=100 существенно меньше и оценка сходства имеет точность 3σ, где

.The implementation of the objectives of the invention to improve the accuracy of assessing the similarity of arrays of binary data can be represented in the form of a graph (Fig. 2) comparing the experimental results for the claimed method and the prototype method. In this case, the normalized values of the number of changes between the reference and the analyzed document are shown along the abscissa axis, and the normalized similarity score is shown along the ordinate axis. The graph shows that when 10,000 experiments in the MatLAB environment with different documents and varying degrees of changes S _n between them, the average sample value of the obtained assessment of the degree of similarity R _n for the claimed method and the prototype method coincides, however, the variance of the estimation of the claimed method at L = 100 is significantly less and the similarity score has an accuracy of 3σ, where

.

Claims

A method for identifying arrays of binary data, which consists in converting all electronic files of reference documents into a predetermined format; save converted electronic files of reference documents in a database; convert each electronic file of the analyzed document into a predetermined format; Identify identity match comparing the found relative numbers of matches with a threshold value, characterized in that when converting to a given format, a sequence of binary data is divided into subblocks of a given length with overlapping one data symbol, a set of independent hash functions is formed by using a list of non-repeating initial values of the hash function parameters, select identification data by calculating the values of the independent hash functions from subblocks of a given length and selecting subblocks with a minimum value of the hash function and, obtain sets of identification data H _L , consisting of sequentially located minimum values for each hash function, in the form of a set of identification data of analyzed documents and obtain the same sets of identification data H _Li for reference documents, after which matches between these sets are revealed.