RU2409850C1

RU2409850C1 - Address method of detecting identification features in information streams

Info

Publication number: RU2409850C1
Application number: RU2009120255/08A
Authority: RU
Inventors: Дмитрий Викторович Комолов (RU); Дмитрий Викторович Комолов; Александр Алексеевич Архипенко (RU); Александр Алексеевич Архипенко
Priority date: 2009-05-27
Filing date: 2009-05-27
Publication date: 2011-01-20
Also published as: RU2009120255A

Abstract

FIELD: information technology.

SUBSTANCE: to create a base of reference information features, a set of N_i reference information features is selected, symbols contained in said features and which are different from each other are selected and then used to form an alphabet of symbols, the number of symbols contained in the alphabet is calculated, the volume of the base of reference information features is calculated, for the i-th reference feature, the number mi of symbols which form it and its morphologic coefficient di are calculated, the address of the reference information feature A_i=f(d_i) is calculated using a hash function, the morphologic coefficient is cut and a secondary hash function is calculated therefrom, the obtained compact binary presentation of the information feature is recorded in the base of reference information features on the address A_i, to extract information features from each fragment of the received information stream, a group of binary characters lying between two spaces is selected and then decoded to the type of information feature, its morphologic coefficient and address are calculated, the obtained morphologic coefficient is compared with the morphologic coefficient stores on a similar address in the base of reference information features, presence or absence of identification features to be detected in each fragment of the information stream is determined.

EFFECT: reduction of the amount of memory for storing reference information features.

5 dwg

Description

Изобретение относится к области информатики и вычислительной техники и может использоваться для обработки информационных потоков и обнаружения в них заданных эталонных информационных признаков. Способ может быть использован в устройствах контроля информационных потоков для мониторинга информационного трафика.The invention relates to the field of computer science and computer technology and can be used to process information flows and detect specified standard information signs in them. The method can be used in information flow control devices for monitoring information traffic.

Известен способ, реализованный в устройстве обработки информации для информационного поиска (Патент РФ №2096825, МПК ⁶ G06F 17/00, G06F 17/30, опубликованный 20.11.1997). Данный способ заключается в том, что предварительно формируют базу эталонных информационных значений, подлежащих выявлению в информационном потоке, запоминают их, запоминают количество символов в обрабатываемом текстовом фрагменте (ТФ), запоминают количество символов в словах (словосочетаниях), запоминают количество цифр и специальных символов в ТФ, запоминают предварительно выделенные комбинации символов, соответствующие структурным признакам ТФ, задают правила выделения ТФ из информационного потока. Далее принимают информационный поток, запоминают по предварительно заданным правилам очередной ТФ. Выделяют из ТФ слова и словосочетания, для чего используют предварительно запомненные структурные признаки. Запоминают ТФ, для чего записывают в память слова и словосочетания последовательно, аналогично позициям в выделенном ТФ. Сравнивают запомненные слова и словосочетания с выделенным ТФ, для чего выбирают методом прямого перебора из памяти слова (словосочетания), определяют количество и вид символов в выбранном слове на предмет наличия только цифр и (или) спецзнаков, сравнивают количество символов с эталонным значением и запоминают данные сравнения. Запоминают данные о количестве повторений данного слова в ТФ (о количестве одинаковых слов), запоминают данные о количестве совпадений символьной структуры. Сравнивают выделенный признак с эталонным, содержащимся в базе эталонных информационных признаков. В случае их совпадения считают обнаруженным искомый признак.A known method implemented in an information processing device for information retrieval (RF Patent No. 2096825, IPC ⁶ G06F 17/00, G06F 17/30, published on November 20, 1997). This method consists in preliminarily forming a base of reference information values to be detected in the information stream, storing them, storing the number of characters in the processed text fragment (TF), storing the number of characters in words (phrases), storing the number of numbers and special characters in TF, remember pre-selected combinations of characters corresponding to the structural features of TF, set the rules for selecting TF from the information stream. Next, an information stream is received, stored according to predefined rules of the next TF. Words and phrases are isolated from TF, for which they use previously memorized structural features. TFs are memorized, for which words and phrases are written in memory sequentially, similarly to the positions in the selected TFs. The memorized words and phrases are compared with the selected TF, for which they select the method of direct search from the memory of the word (phrases), determine the number and type of characters in the selected word for the presence of only numbers and (or) special characters, compare the number of characters with the reference value and store data comparisons. Remember the data on the number of repetitions of a given word in the TF (the number of identical words), remember the data on the number of matches of the symbol structure. The selected feature is compared with the reference one contained in the database of reference information features. In case of their coincidence, the sought feature is considered detected.

Недостатками данного способа являются:The disadvantages of this method are:

1) относительно низкая скорость обработки информации вследствие использования алгоритмов последовательного поиска;1) relatively low speed of information processing due to the use of sequential search algorithms;

2) значительные затраты объемов памяти для хранения эталонных информационных признаков.2) significant costs of memory for storing reference information features.

Второй недостаток объясняется тем, что при повышении интенсивности трафика увеличивается время обработки необходимой текстовой единицы (слова, словосочетания и т.п.), вследствие чего увеличивается общее время обработки всего массива информационных признаков. Увеличение объемов памяти и необходимость увеличения вычислительного ресурса приводят к неоправданным экономическим затратам.The second drawback is due to the fact that with an increase in traffic intensity, the processing time of the necessary text unit (words, phrases, etc.) increases, as a result of which the total processing time of the entire array of information features increases. The increase in memory and the need to increase the computing resource lead to unjustified economic costs.

В значительной степени первый недостаток устраняет способ обработки информации для обнаружения идентификационных признаков в информационных потоках (Патент РФ №2282889, МПК⁶ G06F 17/30, опубликованный 27.08.2006, Бюл. №24). Данный способ является наиболее близким по технической сущности и выбран в качестве прототипа.To a large extent, the first drawback eliminates the method of processing information to detect identification signs in information flows (RF Patent No. 2282889, IPC ⁶ G06F 17/30, published August 27, 2006, Bull. No. 24). This method is the closest in technical essence and is selected as a prototype.

Способ-прототип заключается в том, что предварительно формируют базу эталонных информационных признаков (БЭИП), подлежащих выявлению в информационном потоке, принимают информационный поток, последовательно выделяют и запоминают фрагменты принимаемого информационного потока, из которых выделяют по установленным правилам информационные признаки, сравнивают их с эталонными информационными признаками из БЭИП и по результатам сравнения фиксируют наличие или отсутствие в каждом фрагменте информационного потока идентификационных признаков, подлежащих выявлению. Для формирования БЭИП выбирают совокупность из N_i эталонных информационных признаков, выделяют содержащиеся в них и отличающиеся друг от друга символы. Затем из выделенных символов формируют алфавит символов (АС), вычисляют число S содержащихся в нем символов, присваивают j-му, где j=1, 2, …, S, символу номер n_j его позиции в алфавите символов и рассчитывают для заданного значения коэффициента заполнения К БЭИП ее объем N_k=N/K. После этого для i-го, где i=1, 2, …, N, эталонного информационного признака вычисляют число m_i, образующих его символов и его морфологический коэффициент d_i, а также рассчитывают с использованием хеш-функции заданного вида f(d_i) адрес эталонного информационного признака A_i=f(d_i). Затем запоминают i-й эталонный информационный признак в БЭИП на позиции, соответствующей его адресу A_i. Для выделения из каждого фрагмента принимаемого информационного потока информационных признаков выделяют в нем группу двоичных знаков, находящихся между примыкающими друг к другу двумя пробелами, декодируют ее к виду информационного признака, вычисляют его морфологический коэффициент и адрес. После этого сравнивают выделенный и декодированный информационный признак с эталонными информационными признаками, запомненными по этому адресу в БЭИП.The prototype method consists in preliminarily forming a base of reference information signs (BEIP) to be detected in the information stream, accepting the information stream, sequentially extracting and storing fragments of the received information stream, from which information signs are allocated according to established rules, and comparing them with the reference information signs from BEIP and according to the results of comparison record the presence or absence in each fragment of the information flow of identification Cove, to be detected. For the formation of BEIP, a set of N _i reference information signs is selected, the symbols contained in them and distinguished from each other are distinguished. Then, the alphabet of characters (AC) is formed from the selected characters, the number S of the characters contained in it is calculated, assigned to the jth, where j = 1, 2, ..., S, the character number n _j its position in the character alphabet and calculated for a given coefficient value filling K BEIP its volume N _k = N / K. After that, for the i-th, where i = 1, 2, ..., N, of the reference information attribute, calculate the number m _{i of} its constituent characters and its morphological coefficient d _i , and also calculate using a hash function of a given form f (d _i ) the address of the reference information attribute A _i = f (d _i ). Then, the i-th reference information attribute is stored in the BEIP at the position corresponding to its address A _i . To extract information signs from each fragment of the received information stream, a group of binary characters located between two spaces adjacent to each other is allocated in it, decode it to the type of information sign, and its morphological coefficient and address are calculated. After that, the selected and decoded information sign is compared with the reference information signs stored at this address in the BEIP.

Для i-го, где i=1, 2, …, N, эталонного информационного признака вычисляют его морфологический коэффициент d_i по формуле:For the i-th, where i = 1, 2, ..., N, of the reference information feature, its morphological coefficient d _{i is} calculated by the formula:

где n_j - номер позиции j-го символа в алфавите символов;where n _j is the position number of the j-th character in the alphabet of characters;

m_i - число символов, образующих i-й признак;m _i - the number of characters forming the i-th sign;

S - число символов АС;S is the number of characters AC;

j=1, 2, …, m_i - позиция символа в i-м признаке.j = 1, 2, ..., m _i is the position of the symbol in the i-th attribute.

В качестве хеш-функции для вычисления адреса признака A_i=f(d_i) используют функцию видаAs a hash function for calculating the address of the attribute A _i = f (d _i ), use a function of the form

Недостатком данного способа являются значительные затраты объемов памяти для хранения эталонных информационных признаков.The disadvantage of this method is the significant cost of memory for storing reference information features.

Техническим результатом реализации предлагаемого способа является сокращение объема памяти для хранения эталонных информационных признаков.The technical result of the implementation of the proposed method is to reduce the amount of memory for storing reference information features.

Данный технический результат достигается тем, что к функциональным действиям способа-прототипа при формировании БЭИП дополнительно осуществляют сокращение размерности морфологического коэффициента d_i, представленного в двоичной форме записи, за счет деления на двоичную постоянную R, равную размерности произведения (N_k×10), при этом получают новый морфологический коэффициент сокращенной размерности

, из которого осуществляют расчет вторичной хеш-функции с основанием модуля, равным N_k. Полученное компактное бинарное значение единого размера для любых по символьной длине словарных признаков записывают в БЭИП по адресу A_i=f(d_i). В дальнейшем при мониторинге информационного трафика фрагменты информационного потока декодируют к виду компактного бинарного значения признака, сравнивают с признаком в двоичной форме записи, хранящимся по адресу A_i=f(d_i) и по результатам сравнения фиксируют наличие или отсутствие в каждом фрагменте информационного потока идентификационных признаков, подлежащих выявлению.This technical result is achieved by the fact that the functional actions of the prototype method during the formation of BEIP additionally reduce the dimension of the morphological coefficient d _i presented in binary form by dividing by the binary constant R equal to the dimension of the product (N _k × 10), when this gives a new morphological coefficient of reduced dimension

, from which the secondary hash function is calculated with the base of the module equal to N _k . The obtained compact binary value of a uniform size for any dictionary characters according to the symbolic length is recorded in the BEIP at the address A _i = f (d _i ). Further, when monitoring information traffic, fragments of the information stream are decoded to the form of a compact binary value of the sign, compared with the sign in binary form stored at the address A _i = f (d _i ), and the presence or absence of identifying information in each fragment of the information stream is recorded by comparison signs to be identified.

Благодаря новой совокупности существенных признаков заявленного способа достигается сокращение объема памяти, требуемого для хранения признаков в БЭИП.Thanks to the new set of essential features of the claimed method, a reduction in the amount of memory required for storing the features in BEIP is achieved.

Проведенный анализ уровня техники обработки информации позволил установить, что аналоги, характеризующиеся совокупностью признаков, тождественных всем признакам технического решения, отсутствуют в доступных источниках информации, что указывает на соответствие заявленного способа условию патентоспособности «новизна».The analysis of the level of information processing technology has made it possible to establish that analogues that are characterized by a combination of features that are identical to all the features of a technical solution are not available in available sources of information, which indicates the compliance of the claimed method with the condition of patentability “novelty”.

Введенные в совокупности отличительные признаки: сокращение морфологического коэффициента, представленного в двоичной форме, за счет бинарного деления на двоичную постоянную R, равную размерности произведения (N_k×10), а также бинарное вычисление вторичной хеш-функции с основанием модуля, равным N_k, в двоичном представлении от сокращенного морфологического коэффициента

в аналогах не встречаются. Следовательно, заявляемый способ соответствует критерию «изобретательский уровень».The distinctive features introduced in the aggregate: reduction of the morphological coefficient presented in binary form due to binary division by a binary constant R equal to the dimension of the product (N _k × 10), as well as binary calculation of the secondary hash function with the base of the module equal to N _k , in binary representation from the reduced morphological coefficient

in counterparts are not found. Therefore, the claimed method meets the criterion of "inventive step".

Заявленный способ поясняется чертежами, на которых показаны:The claimed method is illustrated by drawings, which show:

на фиг.1 - блок-схема, поясняющая адресный способ обнаружения идентификационных признаков в информационных потоках;figure 1 is a block diagram explaining the address method for detecting identification features in information flows;

на фиг.2 - таблица «алфавита символов» и их кодовых значений;figure 2 - table "alphabet characters" and their code values;

на фиг.3 - сводная таблица словарных признаков отбора, количества символов в том признаке, адресов и морфологических для словарных признаков;figure 3 is a summary table of dictionary features of the selection, the number of characters in that feature, addresses and morphological for dictionary features;

на фиг.4 - сводная таблица словарных признаков с их морфологическими коэффициентами, рассчитанными по способу-прототипу и предлагаемому способу с требуемыми значениями памяти для хранения двоичных последовательностей этих морфологических коэффициентов;figure 4 is a summary table of vocabulary signs with their morphological coefficients calculated by the prototype method and the proposed method with the required memory values for storing binary sequences of these morphological coefficients;

на фиг.5 - сводная таблица словарных признаков, их адресов в десятичной форме записи для адресации в БЭИП, сокращенных по предлагаемому способу морфологических коэффициентов

, вторичных хеш-функций от

и их двоичное представление для БЭИП.figure 5 is a summary table of vocabulary signs, their addresses in decimal form for addressing in BEIP, abbreviated by the proposed method of morphological coefficients

secondary hash functions from

and their binary representation for BEIP.

Заявленное техническое решение достигается введением в способ-прототип дополнительных функциональных блоков и связей между функциональными блоками. На фигуре 1 представлена блок-схема, поясняющая адресный способ обнаружения идентификационных признаков в информационных потоках. В качестве функциональных блоков данной блок-схемы выступают следующие элементы:The claimed technical solution is achieved by introducing additional functional blocks and connections between functional blocks into the prototype method. The figure 1 presents a block diagram explaining the address method for detecting identification signs in information flows. The following elements act as functional blocks of this block diagram:

1 - функциональный блок сегментации информационного потока по словам;1 - functional block segmentation of the information flow according to the words;

2 - функциональный блок расчета морфологического коэффициента выделенного слова;2 - functional block for calculating the morphological coefficient of the selected word;

3 - функциональный блок расчета адреса БЭИП;3 - functional block for calculating the BEIP address;

4 - функциональный блок сокращения морфологического коэффициента;4 - functional block reduction of the morphological coefficient;

5 - функциональный блок расчета компактной записи эталонного информационного признака;5 is a functional block for calculating a compact record of a reference information feature;

6 - база эталонных информационных признаков (БЭИП);6 - base of reference information features (BEIP);

7 - функциональный блок сравнения информационных признаков.7 is a functional block comparing informational features.

Функциональные блоки 1, 2, 3, 6, 7 полностью реализуют действия, описанные в способе-прототипе, при этом информационные выходы функционального блока 2 являются по способу-прототипу информационными входами функциональных блоков 6 и 7. Для реализации нового технического решения в блок-схему способа-прототипа дополнительно введены функциональные блоки 4 и 5. В предлагаемом способе на вход функционального блока 1 поступают на этапе заполнения БЭИП отобранные словарные признаки, а при контроле информационного трафика - информационный поток. В функциональном блоке 1 осуществляется сегментация информационного потока по словам. Информационный выход функционального блока 1 является информационным входом функционального блока 2, где осуществляется расчет морфологического коэффициента для выделенного в блоке 1 слова. Информационные выходы функционального блока 2 являются информационными входами функциональных блоков 3 и 4. В функциональном блоке 3 осуществляется расчет адреса БЭИП для представленного из блока 2 морфологического коэффициента. Информационный выход функционального блока 3 является информационным входом блока 6. В функциональном блоке 4 осуществляется сокращение морфологического коэффициента, представленного функциональным блоком 2. Информационный выход функционального блока 4 является информационным входом функционального блока 5, где осуществляется преобразование сокращенного морфологического коэффициента к компактной форме записи в виде двоичной последовательности ограниченного объема. Информационный выход функционального блока 5 является информационным входом блока 6, в который по адресу, рассчитанному в функциональном блоке 3, осуществляется запись значения двоичной последовательности ограниченного объема из функционального блока 5. После заполнения БЭИП система адресного обнаружения идентификационных признаков в информационных потоках готова к использованию. Информационный поток через последовательность функциональных преобразований в функциональных блоках 1, 2, 4, 5 в виде двоичной последовательности ограниченного объема попадает из функционального блока 5 на вход функционального блока 7. Одновременно с этим на вход функционального блока 7 из блока 6 по адресу, рассчитанному в функциональном блоке 2, осуществляется считывание хранящейся в функциональном блоке 6 двоичной последовательности ограниченного объема. В функциональном блоке 7 осуществляется сравнение обеих двоичных последовательностей, и на выход выдается значение о наличии или отсутствии в текущем фрагменте информационного потока отобранных словарных признаков.Functional blocks 1, 2, 3, 6, 7 fully implement the actions described in the prototype method, while the information outputs of functional block 2 are the prototype method of information inputs of functional blocks 6 and 7. To implement a new technical solution to the flowchart The prototype method additionally introduced functional blocks 4 and 5. In the proposed method, selected vocabulary attributes are received at the stage of filling the BEIP at the input of the functional unit 1, and when monitoring information traffic, an information stream. In the functional unit 1, the information flow is segmented by words. The information output of the functional block 1 is the information input of the functional block 2, where the morphological coefficient for the word selected in block 1 is calculated. The information outputs of function block 2 are the information inputs of function blocks 3 and 4. In function block 3, the BEIP address is calculated for the morphological coefficient presented from block 2. The information output of function block 3 is the information input of block 6. In function block 4, the morphological coefficient represented by functional block 2 is reduced. The information output of functional block 4 is the information input of functional block 5, where the reduced morphological coefficient is converted to a compact binary form limited volume sequences. The information output of function block 5 is the information input of block 6, into which the value of a binary sequence of a limited volume is written from function block 5 to the address calculated in function block 3. After filling in the BEIP, the system for address detection of identification signs in information streams is ready for use. The information stream through a sequence of functional transformations in functional blocks 1, 2, 4, 5 in the form of a binary sequence of limited volume gets from the functional block 5 to the input of the functional block 7. At the same time, the input of the functional block 7 from block 6 at the address calculated in the functional block 2, the reading is stored in the functional block 6 of the binary sequence of a limited volume. In the functional block 7, the comparison of both binary sequences is carried out, and the output is given the value of the presence or absence in the current fragment of the information stream of the selected dictionary features.

Рассмотрение заявленного способа целесообразно провести на примере действий, реализованных способом-прототипом, и дополнить необходимыми действиями для получения заявленного технического решения.Consideration of the claimed method, it is advisable to carry out the example of actions implemented by the prototype method, and supplement with the necessary actions to obtain the claimed technical solution.

Пусть в качестве словарных признаков отбора информационных сообщений выбрано N=100 словарных признаков. Словарные признаки: "банк", "железо", "маска", "машина", "рама", "самолет", "человек", "1985-подъем", - взяты из способа-прототипа. Из указанных N выбранных признаков выделяют содержащиеся в них и отличные друг от друга символы и формируют «алфавит символов» (АС) с присвоением каждому символу порядкового номера в АС. Будем считать, что в составе всех N признаков содержатся символы, сведенные в таблицу, изображенную на фигуре 2. Каждому символу из АС соответствует кодовое значение номера позиции n_j.Let N = 100 dictionary features be selected as dictionary features for selecting informational messages. Vocabulary signs: "bank", "iron", "mask", "machine", "frame", "plane", "man", "1985 lift" - are taken from the prototype method. From the indicated N selected features, the characters contained in them and distinct from each other are distinguished and form the “alphabet of characters” (AC) with the assignment of a character serial number in the AC. We assume that the composition of all N features contains the symbols summarized in the table shown in figure 2. Each character from the AC corresponds to the code value of the position number n _j .

Состав АС содержит совокупность отличающихся символов, достаточных для составления из них любого из N предварительно отобранных признаков.The composition of the AS contains a set of different characters, sufficient to compose from them any of the N pre-selected features.

Затем вычисляют для заданного значения коэффициента заполнения К БЭИП ее объем N_k=N/K, т.е. число строк в формируемой БЭИП. По аналогии со способом-прототипом K=0,2 соответственно число строк в базе эталонных информационных признаков будет равно:Then, for a given value of the duty factor K BEIP, its volume N _k = N / K, i.e. the number of lines in the generated BEIP. By analogy with the prototype method K = 0.2, respectively, the number of lines in the database of reference information features will be equal to:

.

Далее для каждого 1-го признака вычисляют его морфологический коэффициент d_i по формуле (1).Further, for each 1st characteristic, its morphological coefficient d _{i is} calculated by the formula (1).

Далее с учетом вычисленного значения морфологического коэффициента определяют адрес A_i каждого 1-го признака, используя заданную хеш-функцию (формулу 2), т.е. определяют позицию эталонного признака в БЭИП. Формулу 2 можно упростить до выражения:Further, taking into account the calculated value of the morphological coefficient, the address A _{i of} each 1st attribute is determined using the specified hash function (formula 2), i.e. determine the position of the reference characteristic in BEIP. Formula 2 can be simplified to the expression:

В данном случае адресация в БЭИП в отличие от адресации, рассчитываемой по формуле 2, будет отличаться на постоянную, равную 1, и занимать область адресов с 0 по 499. Обозначим получаемый по формуле 3 адрес как

. Так как в морфологическом коэффициенте d_i уже содержится информация о значении адреса

, неэффективно использовать избыточную информацию для хранения в БЭИП. Сократить морфологический коэффициент d_i предлагается делением морфологического коэффициента на постоянную R, равную размерности произведения (N_k×10). При N_k=500 постоянная R представляется величиной 3-го порядка - 1000 в десятичном представлении или в двоичном представлении - 1111101000. При представлении морфологического коэффициента d_i в десятичной форме записи для получения сокращенного морфологического коэффициента

осуществляется деление d_i на постоянную R, далее дробная часть от деления отбрасывается, а оставляется только целая часть от деления. Так, для словарного признака "маска", занимающего 2-ю позицию словарного признака в таблице фигуры 3, морфологический коэффициент, рассчитываемый по формуле 1, составит в десятичной форме записи значение, равное 6961520. Для данного значения морфологического коэффициента адрес

, рассчитываемый по формуле 3, составит значение:In this case, the addressing in BEIP, unlike the addressing calculated by formula 2, will differ by a constant equal to 1, and will occupy the address area from 0 to 499. We designate the address obtained by formula 3 as

. Since the morphological coefficient d _i already contains information about the value of the address

, it is inefficient to use redundant information for storage in BEIP. It is proposed to reduce the morphological coefficient d _i by dividing the morphological coefficient by a constant R equal to the dimension of the product (N _k × 10). When N _k = 500, the constant R is represented by a third-order value - 1000 in decimal or in binary representation - 1111101000. When representing the morphological coefficient d _i in decimal notation to obtain a reduced morphological coefficient

d _i is divided by a constant R, then the fractional part of the division is discarded, and only the integer part of the division is left. So, for the dictionary attribute “mask”, which occupies the 2nd position of the dictionary attribute in the table of figure 3, the morphological coefficient calculated by formula 1 will be in decimal notation a value equal to 6961520. For a given value of the morphological coefficient, the address

calculated by the formula 3, will be the value:

Так как в БЭИП доступно поле адресов от 0 до 499 при N_k=500, то последние значения размерности N_k не несут дополнительной информации, а уже содержатся в значении адреса

.Since the address field from 0 to 499 is available in BEIP at N _k = 500, the last values of dimension N _k do not carry additional information, but are already contained in the address value

.

Для 30 словарных признаков расчеты адресов, морфологических коэффициентов по способу-прототипу и сокращенных морфологических коэффициентов сведены в таблицу фигуры 3.For 30 vocabulary signs, the calculations of addresses, morphological coefficients according to the prototype method and reduced morphological coefficients are summarized in table 3.

Расчетные данные необходимого количества бит для хранения морфологических коэффициентов в БЭИП соответствующих словарных признаков сведены в таблицу фигуры 4. Так, для словарного признака "маска" в двоичной форме записи морфологический коэффициент имеет вид 11010100011100101110000, что соответствует 23 битам необходимого объема памяти. Для сокращенного морфологического коэффициента данное значение сократится до 13 бит и составит значение 1101100110001, что следует из решения:The calculated data of the required number of bits for storing the morphological coefficients in the BEIP of the corresponding vocabulary attributes is summarized in the table of figure 4. Thus, for the vocabulary tag “mask” in binary form, the morphological coefficient has the form 11010100011100101110000, which corresponds to 23 bits of the required memory size. For a reduced morphological coefficient, this value will be reduced to 13 bits and will be 1101100110001, which follows from the solution:

В среднем приведение морфологических коэффициентов к сокращенному виду повышает эффективность использования памяти для хранения эталонных информационных признаков на 10 бит для любого по длине словарного признака.On average, reducing morphological coefficients to an abbreviated form increases the efficiency of using memory for storing reference information attributes by 10 bits for any length of the vocabulary attribute.

Учитывая тот факт, что морфологические коэффициенты имеют различную длину, а следовательно, и требуют различные объемы памяти для хранения своих двоичных значений, необходимо привести отображение сокращенных морфологических признаков к компактному виду. При этом компактный вид отображения сокращенных признаков должен быть единого размера для любых по символьной длине словарных признаков. Для этого предлагается осуществить расчет вторичной хеш-функции от сокращенного морфологического коэффициента по формуле:Given the fact that morphological coefficients have different lengths and, therefore, require different amounts of memory to store their binary values, it is necessary to bring the display of abbreviated morphological characters to a compact form. At the same time, the compact form of displaying abbreviated features should be of the same size for any dictionary characters with respect to the symbolic length. For this, it is proposed to calculate the secondary hash function of the reduced morphological coefficient by the formula:

Выбор основания модуля N_k для расчета вторичной хеш-функции основывается на уже реализованной в способе-прототипе функциональным блоком 2 процедуре вычисления хеш-функции. Значение N_k также определяет V_m - необходимое значение объема информации в битах для хранения компактного бинарного значения признака по формуле:The choice of the base of the module N _k for calculating the secondary hash function is based on the hash function calculation procedure already implemented in the prototype method by function block 2. The value of N _k also determines V _m - the necessary value of the amount of information in bits for storing a compact binary value of the attribute according to the formula:

В данном случае значение V_m составляет 8,965784, а при округлении до целого - 9 бит.In this case, the value of V _m is 8.965784, and when rounded to the nearest integer - 9 bits.

Двоичные представления эталонных информационных признаков, записываемых в БЭИП по адресам, соответствующим словарным признакам отбора информации, представлены в таблице фигуры 5.Binary representations of the reference information signs recorded in the BEIP at the addresses corresponding to the dictionary signs of the selection of information are presented in the table of figure 5.

Обоснование положительного эффекта предлагаемого способа осуществлено следующим образом. Показателем эффективности обнаружения идентификационных признаков в информационных потоках является необходимый объем памяти БЭИП, следовательно, более эффективным является способ, требующий меньший объем памяти для хранения эталонных информационных признаков в БЭИП. Для N словарных признаков, представленных в таблице на фигуре 4, условная эффективность использования памяти для хранения эталонных информационных признаков η(M_i) при реализации M_i способа определяется по формуле:The rationale for the positive effect of the proposed method is as follows. An indicator of the effectiveness of detection of identification features in information flows is the required amount of BEIP memory; therefore, a more efficient method is that requiring a smaller amount of memory for storing reference information signs in BEIP. For N vocabulary features presented in the table in figure 4, the conditional memory efficiency for storing reference information features η (M _i ) when implementing the M _i method is determined by the formula:

где N - количество словарных признаков;where N is the number of dictionary features;

- среднее количество бит, необходимых для хранения одного признака;

- the average number of bits required to store one attribute;

S_cp - среднее количество символов в словарном признаке.S _cp is the average number of characters in a dictionary feature.

Среднее количество символов для словарных признаков, представленных в таблице на фигуре 3, составляет 6,4 символа на слово при количестве словарных признаков, равном 30.The average number of characters for the dictionary features presented in the table in figure 3 is 6.4 characters per word with the number of dictionary features equal to 30.

Для способа-прототипа среднее количество бит, необходимое для хранения одного словарного признака с учетом данных, представленных в таблице фигуры 4, составит значение, равное 29,7 бит/словарный признак. С учетом этого условная эффективность использования памяти для хранения эталонных информационных признаков для способа-прототипа составит:For the prototype method, the average number of bits required to store one dictionary attribute, taking into account the data presented in the table of figure 4, will be a value equal to 29.7 bits / dictionary attribute. With this in mind, the conditional efficiency of using memory to store reference information features for the prototype method will be:

При представлении в качестве эталонных информационных признаков сокращенных морфологических коэффициентов условная эффективность использования памяти для хранения эталонных информационных признаков составит 92,34375 бит/символ.When abbreviated morphological coefficients are presented as reference information features, the conditional memory efficiency for storing reference information features will be 92.34375 bits / symbol.

При реализации предлагаемого способа условная эффективность использования памяти для хранения эталонных информационных признаков составит:When implementing the proposed method, the conditional efficiency of memory use for storing reference information features will be:

Таким образом, использование памяти для хранения информационных признаков по предлагаемому способу по отношению к способу-прототипу эффективней в 3,3 раза.Thus, the use of memory for storing information signs of the proposed method in relation to the prototype method is 3.3 times more effective.

Практическая реализация предлагаемого способа не требует больших дополнительных вычислительных затрат, так как вычисление бинарного значения по модулю является бинарной операцией точно так же, как являются бинарными операции сложения и вычитания (Конкретная математика. Основание информатики / Р.Грэхем, Д.Кнут, О.Паташник. Пер. с англ. - 2-е изд., испр. - М.: Мир; БИНОМ. Лаборатория знаний, 2006. - С.104).The practical implementation of the proposed method does not require large additional computational costs, since the calculation of a binary value modulo is a binary operation in the same way as the addition and subtraction operations are binary (Specific Mathematics. Basics of Computer Science / R. Graham, D. Knut, O. Patashnik Translated from English - 2nd ed., Rev. - M.: Mir; BINOM. Laboratory of Knowledge, 2006. - P.104).

Введенные в способ-прототип два дополнительных действия:Two additional actions introduced into the prototype method:

1) бинарное деление морфологического коэффициента, рассчитанного по способу-прототипу и представленного в двоичной форме записи, на двоичную постоянную R, равную размерности произведения (N_k×10);1) a binary division of the morphological coefficient calculated by the prototype method and presented in binary notation by the binary constant R equal to the dimension of the product (N _k × 10);

2) бинарное вычисление вторичной хеш-функции с основанием модуля, равным N_k, в двоичном представлении от сокращенного морфологического коэффициента

в двоичной форме записи,2) binary calculation of the secondary hash function with the base of the module equal to N _k in binary representation of the reduced morphological coefficient

in binary form,

возможно реализовать на существующей в настоящее время элементной базе, например на любых серийно выпускаемых программируемых логических интегральных схемах (ПЛИС).it is possible to implement on the currently existing element base, for example, on any commercially available programmable logic integrated circuits (FPGA).

Таким образом, из рассмотренной сущности заявляемого способа следует, что он обеспечивает сокращение необходимого объема памяти хранения эталонных информационных признаков. Это подтверждает положительный эффект от внедрения предлагаемого способа.Thus, from the considered essence of the proposed method, it follows that it provides a reduction in the required amount of memory for storing reference information features. This confirms the positive effect of the implementation of the proposed method.

Claims

An addressable method for detecting identification features in information flows, which consists in the fact that to form a base of reference information features to be detected in the information stream, a set of N _i reference information features is selected, the symbols contained in them and different from each other are selected, formed from them alphabet of characters, calculate the number S of the characters contained in it, assign the jth, where j = 1, 2, ..., S, the character number n _j its position in the character alphabet and calculate for a given value k the fill factor K of the base of reference information features its volume is N _k = N / K, after which for the i-th, where i = 1, 2, ..., N, of the reference feature, the number m _{i of the} symbols forming it and its morphological coefficient d _i are calculated and also using the hash function of the specified form f (d _i ), the address of the reference information sign A _i = f (d _i ) is calculated, then the i-th reference information sign is stored in the database of reference information signs at the position corresponding to its address A _i , and to highlight from each fragment of the received information the outflow of information signs, a group of binary characters is located in it between two adjacent spaces, decode it to the type of information sign, calculate its morphological coefficient and address, compare the morphological coefficient obtained during decoding with the morphological coefficient stored at a similar address in the reference database information features, characterized in that for the formation of information features additionally carry out a reduction in the morphological dimension coefficient d _i , presented in binary notation, dividing by a binary constant R equal to the dimension of the product (N _k × 10), from the morphological coefficient of reduced dimension

calculate the secondary hash function with the base of the module equal to N _k , and write the resulting compact binary value to the database of reference information signs at the address A _i = f (d _i ), when monitoring information traffic, fragments of the information stream lead to the form of a compact binary value of the attribute, compare with a sign in binary form stored at address A _i = f (d _i ) in the database of reference information signs, and the results of the comparison record the presence or absence of an identifier in each fragment of the information stream identification signs to be identified.