RU2473964C1

RU2473964C1 - Method of detecting identification features for different letter-symbol writing systems

Info

Publication number: RU2473964C1
Application number: RU2011151684/08A
Authority: RU
Inventors: Дмитрий Викторович Комолов
Priority date: 2011-12-16
Filing date: 2011-12-16
Publication date: 2013-01-27

Abstract

FIELD: information technology.

SUBSTANCE: unique three-byte reference information feature base address (RIFB) is calculated and the logic one value is recorded at said address. Representation of the reference information feature in form of a three-byte address enables, if necessary, to augment the RIFB without recalculating values of addresses and morphological coefficients of information features. Three-byte addresses for the information features do not depend on the symbol length of vocabulary features.

EFFECT: broader functional capabilities when using the method to detect identification features presented in different letter-symbol writing systems, high efficiency of using memory to store reference information features.

2 dwg, 7 tbl

Description

Изобретение относится к области информатики и вычислительной техники и может использоваться для обработки информационных потоков и обнаружения в них заданных эталонных информационных признаков, представленных в различных буквенно-знаковых системах письменности. Способ может быть использован в устройствах контроля информационных потоков для мониторинга информационного трафика.The invention relates to the field of computer science and computer technology and can be used to process information flows and detect specified standard information features in them, presented in various alphanumeric writing systems. The method can be used in information flow control devices for monitoring information traffic.

Известен способ, реализованный в устройстве обработки информации для информационного поиска (патент РФ №2096825, МПК⁶ G06F 17/00, G06F 17/30, опубликованный 20.11.1997). Данный способ заключается в том, что предварительно формируют базу эталонных информационных значений, подлежащих выявлению в информационном потоке, запоминают их, запоминают количество символов в обрабатываемом текстовом фрагменте (ТФ), запоминают количество символов в словах (словосочетаниях), запоминают количество цифр и специальных символов в ТФ, запоминают предварительно выделенные комбинации символов, соответствующие структурным признакам ТФ, задают правила выделения ТФ из информационного потока. Далее принимают информационный поток, запоминают по предварительно заданным правилам очередной ТФ. Выделяют из ТФ слова и словосочетания, для чего используют предварительно запомненные структурные признаки. Запоминают ТФ, для чего записывают в память слова и словосочетания последовательно, аналогично позициям в выделенном ТФ. Сравнивают запомненные слова и словосочетания с выделенным ТФ, для чего выбирают методом прямого перебора из памяти слова (словосочетания), определяют количество и вид символов в выбранном слове на предмет наличия только цифр и (или) спецзнаков, сравнивают количество символов с эталонным значением и запоминают данные сравнения. Запоминают данные о количестве повторений данного слова в ТФ (о количестве одинаковых слов), запоминают данные о количестве совпадений символьной структуры. Сравнивают выделенный признак с эталонным, содержащимся в базе эталонных информационных признаков. В случае их совпадения считают обнаруженным искомый признак.The known method implemented in the information processing device for information retrieval (RF patent No. 2096825, IPC ⁶ G06F 17/00, G06F 17/30, published on 11/20/1997). This method consists in preliminarily forming a base of reference information values to be detected in the information stream, storing them, storing the number of characters in the processed text fragment (TF), storing the number of characters in words (phrases), storing the number of numbers and special characters in TF, remember pre-selected combinations of characters corresponding to the structural features of TF, set the rules for selecting TF from the information stream. Next, an information stream is received, stored according to predefined rules of the next TF. Words and phrases are isolated from TF, for which they use previously memorized structural features. TFs are memorized, for which words and phrases are written in memory sequentially, similarly to the positions in the selected TFs. The memorized words and phrases are compared with the selected TF, for which they select the method of direct search from the memory of the word (phrases), determine the number and type of characters in the selected word for the presence of only numbers and (or) special characters, compare the number of characters with the reference value and store data comparisons. Remember the data on the number of repetitions of a given word in the TF (the number of identical words), remember the data on the number of matches of the symbol structure. The selected feature is compared with the reference one contained in the database of reference information features. In case of their coincidence, the sought feature is considered detected.

Недостатками данного способа являются:The disadvantages of this method are:

1) относительно низкая скорость обработки информации вследствие использования алгоритмов последовательного поиска;1) relatively low speed of information processing due to the use of sequential search algorithms;

2) значительные затраты объемов памяти для хранения эталонных информационных признаков.2) significant costs of memory for storing reference information features.

Второй недостаток объясняется тем, что при повышении интенсивности трафика увеличивается время обработки необходимой текстовой единицы (слова, словосочетания и т.п.), вследствие чего увеличивается общее время обработки всего массива информационных признаков. Увеличение объемов памяти и необходимость увеличения вычислительного ресурса приводит к неоправданным экономическим затратам.The second drawback is due to the fact that with an increase in traffic intensity, the processing time of the necessary text unit (words, phrases, etc.) increases, resulting in an increase in the total processing time of the entire array of information attributes. The increase in memory and the need to increase the computing resource leads to unjustified economic costs.

В значительной степени первый недостаток устраняет способ обработки информации для обнаружения идентификационных признаков в информационных потоках (патент РФ №2282889, МПК⁶ G06F 17/30, опубликованный 27.08.2006. Бюл. №24). Данный способ заключается в том, что предварительно формируют базу эталонных информационных признаков (БЭИП), подлежащих выявлению в информационном потоке, принимают информационный поток, последовательно выделяют и запоминают фрагменты принимаемого информационного потока, из которых выделяют по установленным правилам информационные признаки, сравнивают их с эталонными информационными признаками из БЭИП и по результатам сравнения фиксируют наличие или отсутствие в каждом фрагменте информационного потока идентификационных признаков, подлежащих выявлению. Для формирования БЭИП выбирают совокупность из N_i эталонных информационных признаков, выделяют содержащиеся в них и отличающиеся друг от друга символы. Затем из выделенных символов формируют алфавит символов (АС), вычисляют число S содержащихся в нем символов, присваивают j-му, где j=1,2,…,S, символу номер n_j его позиции в алфавите символов и рассчитывают для заданного значения коэффициента заполнения K БЭИП ее объем N_k=N/K. После этого для i-го, где i=1, 2,…,N, эталонного информационного признака вычисляют число m_i символов, образующих этот признак, и его морфологический коэффициент d_i, а также рассчитывают с использованием хеш-функции заданного вида f(d_i) адрес эталонного информационного признака А_i=f(d_i). Затем запоминают i-й эталонный информационный признак в БЭИП на позиции, соответствующей его адресу A_i. Для выделения из каждого фрагмента принимаемого информационного потока информационных признаков выделяют в нем группу двоичных знаков, находящихся между примыкающими друг к другу двумя пробелами, декодируют ее к виду информационного признака, вычисляют его морфологический коэффициент и адрес. После этого сравнивают выделенный и декодированный информационный признак с эталонным информационным признаком, находящимся по этому адресу в БЭИП.To a large extent, the first drawback eliminates the method of processing information to detect identification signs in information flows (RF patent No. 2282889, IPC ⁶ G06F 17/30, published on 08.27.2006. Bull. No. 24). This method consists in preliminarily forming a base of reference information signs (BEIP) to be detected in the information stream, accepting the information stream, sequentially extracting and storing fragments of the received information stream, from which information signs are allocated according to established rules, and comparing them with the reference information signs from BEIP and according to the results of comparison fix the presence or absence of identification signs in each fragment of the information flow in the subject detection. For the formation of BEIP, a set of N _i reference information signs is selected, the symbols contained in them and distinguished from each other are distinguished. Then, the alphabet of characters (AC) is formed from the selected characters, the number S of the characters contained in it is calculated, assigned to the jth, where j = 1,2, ..., S, the character number n _j its position in the character alphabet and calculated for a given coefficient value filling K BEIP its volume N _k = N / K. After that, for the i-th, where i = 1, 2, ..., N, of the reference information attribute, calculate the number m _{i of} characters that form this attribute, and its morphological coefficient d _i , and also calculate using a hash function of a given form f ( d _i ) the address of the reference information attribute A _i = f (d _i ). Then, the i-th reference information attribute is stored in the BEIP at the position corresponding to its address A _i . To extract information signs from each fragment of the received information stream, a group of binary characters located between two spaces adjacent to each other is allocated in it, decode it to the type of information sign, and its morphological coefficient and address are calculated. After that, the selected and decoded information attribute is compared with the reference information attribute located at this address in BEIP.

Для i-го эталонного информационного признака морфологический коэффициент d_i вычисляют по формуле:For the i-th reference information feature, the morphological coefficient d _{i is} calculated by the formula:

где n_j - номер позиции j-го символа в алфавите символов;where n _j is the position number of the j-th character in the alphabet of characters;

m_i - число символов, образующих i-й признак;m _i - the number of characters forming the i-th sign;

S - число символов АС;S is the number of characters AC;

j=1,2,…,m_i - позиция символа в i-ом признаке.j = 1,2, ..., m _i is the position of the symbol in the i-th attribute.

В качестве хеш-функции для вычисления адреса признака А_i=f(d_i) используют функцию вида:As a hash function for calculating the address of the attribute A _i = f (d _i ) use a function of the form:

Данный способ является наиболее близким по технической сущности и выбран в качестве прототипа.This method is the closest in technical essence and is selected as a prototype.

Недостатком данного способа-прототипа является ограниченность его применения, выражающаяся в возможности обнаружения информационных признаков, представленных из символов в одном из регистров (только приписные или строчные символы) в одной буквенно-знаковой системе письменности. Кроме того, при необходимости внесения в БЭИП новых информационных признаков, содержащих новые дополнительные символы, не входящие в имеющийся АС, требуется осуществить полный перерасчет всех значений морфологических коэффициентов d_i, а также новый расчет адресов А_i, соответствующих каждому информационному признаку в БЭИП, при помощи хеш-функций f(d_i).The disadvantage of this prototype method is its limited use, expressed in the possibility of detecting information signs represented from characters in one of the registers (only ascribed or lowercase characters) in one alphanumeric writing system. In addition, if it is necessary to add new information signs to the BEIP containing new additional characters that are not included in the existing AS, it is necessary to completely recalculate all the values of the morphological coefficients d _i , as well as a new calculation of the addresses A _i corresponding to each information attribute in the BEIP, when using hash functions f (d _i ).

Техническим результатом реализации предлагаемого способа является расширение функциональных возможностей в применении способа для обнаружения идентификационных признаков, выраженных в различных буквенно-знаковых системах письменности, а также более эффективное использование памяти для хранения эталонных информационных признаков.The technical result of the implementation of the proposed method is to expand the functionality in the application of the method for detecting identification features expressed in various alphanumeric writing systems, as well as more efficient use of memory to store reference information features.

Этот результат достигается тем, что из способа-прототипа исключают процедуру определения морфологического коэффициента d_i, вместо нее определяют адрес памяти A_i, по которому записывают значение уровня логической единицы для информационного признака, подлежащего обнаружению в информационном потоке. По остальным адресам выделенной памяти записываются значения уровня логического нуля. Адрес памяти искомого идентификационного признака представляет собой трехбайтную последовательность вида:This result is achieved by the fact that the procedure for determining the morphological coefficient d _{i is} excluded from the prototype method, instead of it, the memory address A _i is determined by which the value of the logical unit level for the information attribute to be detected in the information stream is recorded. The remaining addresses of the allocated memory are recorded values of the level of logical zero. The memory address of the desired identification feature is a three-byte sequence of the form:

где

- 1-й байт адреса памяти,

- 2-й байт адреса памяти,

- 3-й байт адреса памяти.Where

- 1st byte of the memory address,

- 2nd byte of the memory address,

- 3rd byte of the memory address.

Значению 1-го байта адреса приводят в соответствие бинарное числовое значение кодового знака 1-го символа слова

. Значение 2-го байта адреса

рассчитывают выражением:The value of the 1st byte of the address is brought into correspondence with the binary numeric value of the code sign of the 1st character of the word

. The value of the 2nd byte of the address

calculated by the expression:

где bin - функция бинарного представления значения числа, представленного в десятичной системе исчисления;where bin is a binary representation function of the value of the number represented in the decimal system;

m_i - число символов в i-ом идентификационном признаке;m _i is the number of characters in the i-th identification attribute;

n_j - код j-го символа в алфавите символов;n _j is the code of the jth character in the alphabet of characters;

3 - минимальное нечетное простое число;3 - the minimum odd prime number;

mod - функция модуля по основанию;mod - base module function;

j=1,2,…,m_i - позиция символа в i-м идентификационном признаке.j = 1,2, ..., m _i is the position of the symbol in the i-th identification feature.

Значение 3-го байта адреса

рассчитывают выражением:The value of the 3rd byte of the address

calculated by the expression:

Бинарная последовательность А_i из трех байт определяет адрес выделенной памяти, в которой для эталонных информационных признаков, необходимых для отбора, записывают в память по их рассчитанным адресам значения уровня логической единицы, для остальных информационных признаков, не участвующих в расчете адресов, - значения уровня логического нуля. При мониторинге информационного потока для каждого выделенного из информационного потока слова определяют трехбайтовый адрес и при наличии по данному адресу памяти значения уровня логической единицы принимают решение о наличии эталонного информационного признака в информационном потоке. При добавлении нового эталонного информационного признака определяют его трехбайтный адрес, записывают по данному адресу памяти значение уровня логической единицы и осуществляют дальнейшее обнаружение идентификационных признаков в информационных потоках.The binary sequence A _i of three bytes determines the address of the allocated memory, in which for the standard information features necessary for selection, the values of the logical unit level are written to the memory at their calculated addresses, for the remaining information features that are not involved in the calculation of addresses, the values of the logical level zero. When monitoring the information flow, for each word extracted from the information flow, a three-byte address is determined, and if there is a logical unit at the given memory address, they decide on the presence of a reference information attribute in the information flow. When adding a new reference information feature, its three-byte address is determined, the logical unit level value is recorded at this memory address, and identification features are further detected in the information flows.

Благодаря новой совокупности существенных признаков заявленного способа достигается более эффективное использование памяти для БЭИП, а также устраняется необходимость дополнительных перерасчетов содержательной части БЭИП при появлении нового информационного признака, представленного в другой буквенно-знаковой системе письменности.Thanks to the new set of essential features of the claimed method, more efficient use of memory for BEIP is achieved, and the need for additional recalculations of the content part of BEIP is eliminated when a new information sign appears in another alphanumeric writing system.

Проведенный анализ уровня техники обработки информации позволил установить, что аналоги, характеризующиеся совокупностью признаков, тождественных всем признакам технического решения, отсутствуют в доступных источниках информации, что указывает на соответствие заявленного способа условию патентоспособности «новизна».The analysis of the level of information processing technology has made it possible to establish that analogues that are characterized by a combination of features that are identical to all the features of a technical solution are not available in available sources of information, which indicates the compliance of the claimed method with the condition of patentability “novelty”.

Введенные в совокупности отличительные признаки: представление информационных признаков различных буквенно-знаковых систем письменности в виде уникальных трехбайтных адресов БЭИП, а также размещение по данным адресам значений минимальной длины (1-го бита информации: 1 или 0), характеризующих наличие или отсутствие по данному адресу эталонного признака, в аналогах не встречаются. Следовательно, заявляемый способ соответствует критерию «изобретательский уровень».The distinctive features introduced together: the presentation of information features of various alphanumeric writing systems in the form of unique three-byte BEIP addresses, as well as the placement of minimum lengths (1 bit of information: 1 or 0) characterizing the presence or absence of a given address at these addresses a reference sign, in analogues is not found. Therefore, the claimed method meets the criterion of "inventive step".

Заявленный способ поясняется чертежами, на которых показаны:The claimed method is illustrated by drawings, which show:

на фиг.1 - блок-схема, поясняющая способ обнаружения идентификационных признаков для различных буквенно-знаковых систем письменности;figure 1 is a flowchart explaining a method for detecting identification features for various alphanumeric writing systems;

на фиг.2 - график зависимости роста количества адресов от количества эталонных информационных признаков по способу-прототипу;figure 2 is a graph of the dependence of the growth in the number of addresses on the number of reference information features according to the prototype method;

на фиг.3 - сводная таблица словарных признаков отбора, принадлежащих русскому языку, и трехбайтных адресов, им соответствующих, представленных в десятичной системе исчисления;figure 3 is a summary table of dictionary features of the selection belonging to the Russian language, and three-byte addresses corresponding to them, presented in the decimal system of calculus;

на фиг.4 - сводная таблица словарных признаков отбора, принадлежащих английскому языку, и трехбайтных адресов, им соответствующих, представленных в десятичной системе исчисления;figure 4 is a summary table of vocabulary signs of selection belonging to the English language, and three-byte addresses corresponding to them, presented in the decimal system of calculus;

на фиг.5 - сводная таблица словарных признаков отбора, принадлежащих немецкому языку, и трехбайтных адресов, им соответствующих, представленных в десятичной системе исчисления;figure 5 is a summary table of dictionary features of the selection belonging to the German language, and three-byte addresses corresponding to them, presented in the decimal system of calculus;

на фиг.6 - сводная таблица словарных признаков отбора, принадлежащих итальянскому языку, и трехбайтных адресов, им соответствующих, представленных в десятичной системе исчисления;figure 6 is a summary table of vocabulary signs of selection belonging to the Italian language, and three-byte addresses corresponding to them, presented in the decimal system of calculus;

на фиг.7 - сводная таблица словарных признаков отбора, принадлежащих испанскому языку, и трехбайтных адресов, им соответствующих, представленных в десятичной системе исчисления;Fig.7 is a summary table of vocabulary signs of selection belonging to the Spanish language, and three-byte addresses corresponding to them, presented in the decimal system of calculus;

на фиг.8 - сводная таблица словарных признаков отбора, принадлежащих литовскому языку, и трехбайтных адресов, им соответствующих, представленных в десятичной системе исчисления;on Fig - summary table of vocabulary signs of selection belonging to the Lithuanian language, and three-byte addresses corresponding to them, presented in the decimal system of calculus;

на фиг.9 - сводная таблица словарных признаков отбора с первым символом "i" языков латинской буквенно-знаковой системы письменности и трехбайтных адресов, им соответствующих, представленных в десятичной системе исчисления.figure 9 is a summary table of vocabulary signs of selection with the first character "i" of the languages of the Latin alphanumeric writing system and three-byte addresses corresponding to them, presented in the decimal system of calculus.

Заявленное техническое решение достигается исключением из способа-прототипа функционального блока расчета морфологического коэффициента выделенного слова и лишних связей между 2 и другими (1, 3, 4, 5) функциональными блоками. На фигуре 1 представлена блок-схема, поясняющая способ обнаружения идентификационных признаков для различных буквенно-знаковых систем письменности. В качестве функциональных блоков блок-схемы прототипа и предлагаемого способа выступают следующие элементы:The claimed technical solution is achieved by eliminating from the prototype method a functional block for calculating the morphological coefficient of the selected word and unnecessary connections between 2 and other (1, 3, 4, 5) functional blocks. The figure 1 presents a flowchart explaining a method for detecting identification features for various alphanumeric writing systems. The following elements act as functional blocks of the block diagram of the prototype and the proposed method:

1 - функциональный блок сегментации информационного потока по словам;1 - functional block segmentation of the information flow according to the words;

2 - функциональный блок расчета морфологического коэффициента выделенного слова;2 - functional block for calculating the morphological coefficient of the selected word;

3 - функциональный блок расчета адреса базы эталонных информационных признаков (БЭИП);3 - a functional unit for calculating the address of the base of reference information features (BEIP);

4 - база эталонных информационных признаков;4 - a base of reference information features;

5 - функциональный блок сравнения.5 - functional block comparison.

Функциональный блок 2 исключают из способа, так, вместо морфологического коэффициента d_i, вычисляемого выражением 1, в память БЭИП по адресу, соответствующему конкретному информационному признаку, заносится значение уровня логической единицы, которое является маркером нахождения в информационном потоке эталонного признака. Из предлагаемого способа также исчезают информационные связи между блоками 1, 3, 4, 5 и 2, представленные на фигуре пунктирными линиями.Functional block 2 is excluded from the method, so instead of the morphological coefficient d _i calculated by expression 1, the value of the logical unit level, which is a marker of the reference sign in the information stream, is entered into the BEIP memory at the address corresponding to a particular information attribute. From the proposed method also disappear information links between blocks 1, 3, 4, 5 and 2, represented in the figure by dashed lines.

Функциональный блок 1 полностью реализует действия, описанные в способе-прототипе. В функциональном блоке 1 осуществляется сегментация информационного потока по словам. На вход функционального блока 1 поступают на этапе заполнения БЭИП отобранные словарные признаки, а при контроле информационного графика - информационный поток. На этапе формирования БЭИП по предлагаемому способу функционируют только блоки 1, 3 и 4, а при контроле информационного трафика - блоки 1, 3, 4 и 5. Информационный выход функционального блока 1 является информационным входом функционального блока 3. Для реализации нового технического решения в блоке 3 реализуется отличная от способа-прототипа функция расчета адреса БЭИП. Вместо адреса, достаточного для отображения N_k строк БЭИП по способу-прототипу, для любых по длине слов, представленных в различных буквенно-знаковых системах письменности, в блоке 3 определяется трехбайтный адрес БИЭП для i-го эталонного информационного признака в соответствии с выражениями 3, 4 и 5. Три информационных выхода по количеству байт-адреса являются информационными входами в блок 4. В отличие от способа-прототипа формирование БЭИП в функциональном блоке 4 осуществляется иначе. Вместо записи морфологического коэффициента по рассчитанному адресу А_i осуществляется запись значения уровня логической единицы. Информационный выход функционального блока 4 является информационным входом блока 5. Функциональный блок 5 осуществляет процедуру сравнения только при контроле информационного трафика. В функциональном блоке 5 осуществляется сравнение значения уровня по рассчитанному адресу БЭИП со значением уровня логической единицы, при совпадении этих значений на выход выдается информация о наличии в текущем фрагменте информационного потока необходимого словарного признака.Functional block 1 fully implements the actions described in the prototype method. In the functional unit 1, the information flow is segmented by words. At the input of the functional block 1, selected vocabulary signs are received at the BEIP filling stage, and during the control of the information schedule, the information flow. At the stage of BEIP formation according to the proposed method, only blocks 1, 3 and 4 function, and when monitoring information traffic - blocks 1, 3, 4 and 5. The information output of function block 1 is the information input of function block 3. To implement a new technical solution in the block 3, a function of calculating the BEIP address is different from the prototype method. Instead of an address sufficient to display N _k lines of BEIP according to the prototype method, for any words of length represented in various alphanumeric writing systems, in block 3 a three-byte address of the BIEC for the i-th reference information attribute is determined in accordance with expressions 3, 4 and 5. Three information outputs by the number of byte addresses are information inputs to block 4. In contrast to the prototype method, the formation of BEIP in the functional block 4 is carried out differently. Instead of recording the morphological coefficient at the calculated address A _i , the value of the level of the logical unit is recorded. The information output of function block 4 is the information input of block 5. Function block 5 carries out the comparison procedure only when monitoring information traffic. In function block 5, a comparison is made of the level value at the calculated BEIP address with the value of the logical unit level; if these values coincide, the output gives information about the presence of the necessary vocabulary sign in the current fragment of the information stream.

Рассмотрение заявленного способа целесообразно провести на примере действий, реализованных способом-прототипом, и дополнить необходимыми действиями для получения заявленного технического решения.Consideration of the claimed method, it is advisable to carry out the example of actions implemented by the prototype method, and supplement with the necessary actions to obtain the claimed technical solution.

Пусть в качестве словарных признаков отбора информационных сообщений выбрано по 39 идентификационных признаков на 6 естественных языках, т.е. N=234 словарных признака, для 2-х различных знаково-буквенных систем письменности (латинского и кириллического написания). В качестве АС используем расширенную кодовую таблицу ASCII, где знаку "пробел" соответствует десятичное значение 32 или бинарное значение 00100000; строчной букве "а" латинской системы письменности (английского, немецкого, испанского, итальянского и литовского языков) соответствует десятичное значение 97 или бинарное значение 01100001; строчной букве "я" кириллической системы письменности соответствует десятичное значение 255 или бинарное значение 11111111. Состав АС содержит совокупность отличающихся символов, достаточных для составления из них любого из N информационных признаков.Let 39 identification features in 6 natural languages be selected as dictionary features for selecting informational messages N = 234 vocabulary signs, for 2 different letter-letter writing systems (Latin and Cyrillic spelling). As AS we use the extended ASCII code table, where the decimal place corresponds to the decimal value 32 or the binary value 00100000; lowercase letter "a" of the Latin writing system (English, German, Spanish, Italian and Lithuanian) corresponds to the decimal value 97 or the binary value 01100001; the lowercase letter “I” of the Cyrillic writing system corresponds to the decimal value 255 or the binary value 11111111. The composition of the AS contains a combination of different characters, sufficient to make any of N informational signs from them.

По способу-прототипу число строк базы эталонных признаков определяется как N_k=N/K, где K=0,2, соответственно минимальное число строк в базе эталонных информационных признаков будет равно:According to the prototype method, the number of lines of the base of reference features is defined as N _k = N / K, where K = 0.2, respectively, the minimum number of lines in the base of reference information features will be equal to:

Объем памяти V_БЭИП, необходимый для хранения эталонных информационных признаков в БЭИП, по способу-прототипу вычисляется как:The amount of memory V _BEIP required to store the reference information features in BEIP, the prototype method is calculated as:

где N_k - число строк в БЭИП; d_max - минимальное количество бит, необходимое для хранения морфологического коэффициента максимальной битовой длины. Так, для словарного признака "информация" в кодировке ASCII десятичная запись кодов, составляющих слово в строчном виде, имеет вид 232 237 244 238 240 236 224 246 232 255. Морфологический коэффициент, рассчитанный по способу-прототипу для слова "информация", имеет вид 963701989565045000000000 в десятичной системе исчисления, что соответствует в бинарном исчислении значению 110001110111000100111010111000000101110000100100101001000000000 и составляет 63 бита необходимого объема памяти. Учитывая тот факт, что морфологические коэффициенты по способу-прототипу имеют различную длину, а следовательно, требуют различные объемы памяти для хранения своих двоичных значений, необходимо определить информационный признак с максимальным значением морфологического коэффициента. Для записи морфологического коэффициента русского слова, не превышающего в своем составе 12 знаков и оканчивающегося на "я", потребуется в пределе 64 бит или 8 байт необходимого объема памяти. Кроме того, для отображения адресов 1170 строк БЭИП необходимо как минимум по 12 бит для адреса на строку. Таким образом, эффективный коэффициент использования памяти η(М₁) по способу-прототипу составит 9,5 байт/признак. Изменение количества информационных признаков как в сторону увеличения, так и уменьшения не позволяет использовать расчеты БЭИП, полученные ранее, т.к. каждый адрес БЭИП напрямую зависит от меняющегося значения N_k, которое в свою очередь зависит от количества информационных признаков N. Изменение количества информационных признаков N в сторону увеличения пропорционально увеличению строк в 5 раз, т.к. N_k=N/K, где K=0,2 (по способу-прототипу). Зависимость роста количества строк N_k от количества признаков N представлено на фигуре 2. С увеличением информационных признаков от 1 до 10000 адресное поле должно пропорционально увеличиваться от 3 бит до 2 байт на один адрес.where N _k is the number of lines in BEIP; d _max - the minimum number of bits required to store the morphological coefficient of the maximum bit length. So, for the ASCII vocabulary attribute “information”, the decimal notation of the codes that make up the word in lowercase is 232 237 244 238 240 236 224 246 232 255. The morphological coefficient calculated by the prototype method for the word “information” has the form 963701989565045000000000 in the decimal system, which corresponds in binary terms to the value 110001110111000100111010111000000101110000100100101001000000000 and is 63 bits of the required memory. Given the fact that the morphological coefficients of the prototype method have different lengths, and therefore require different amounts of memory to store their binary values, it is necessary to determine the information sign with the maximum value of the morphological coefficient. To record the morphological coefficient of a Russian word, not exceeding 12 characters in its structure and ending in "I", in the limit of 64 bits or 8 bytes of the required amount of memory is required. In addition, to display the addresses of 1170 BEIP lines, at least 12 bits are required for an address per line. Thus, the effective memory utilization factor η (M ₁ ) by the prototype method will be 9.5 bytes / attribute. The change in the number of information signs both in the direction of increase and decrease does not allow the use of BEIP calculations obtained earlier, because each BEIP address directly depends on a changing value of N _k , which in turn depends on the number of information signs N. A change in the number of information signs N in the direction of increase is proportional to the increase in lines by 5 times, because N _k = N / K, where K = 0.2 (according to the prototype method). The dependence of the growth in the number of lines N _k on the number of signs N is shown in Figure 2. With an increase in information signs from 1 to 10000, the address field should proportionally increase from 3 bits to 2 bytes per address.

В предлагаемом способе адрес фиксирован 3 байтами и не зависит от текущего количества информационных признаков. Для 39 идентификационных признаков на 6 естественных языках, т.е. N=234 словарных признака, в двух системах буквенно-знаковых систем письменности (латинской и кириллической) представлены значения трехбайтных адресов и соответствующих им информационных признаков на русском, английском, немецком, испанском, итальянском и литовском языках в таблицах фигур 3, 4, 5, 6, 7 и 8 соответственно.In the proposed method, the address is fixed by 3 bytes and does not depend on the current number of information signs. For 39 identification features in 6 natural languages, i.e. N = 234 vocabulary signs, in two systems of alphanumeric writing systems (Latin and Cyrillic) the values of three-byte addresses and the corresponding information signs in Russian, English, German, Spanish, Italian and Lithuanian are presented in the tables of figures 3, 4, 5, 6, 7 and 8, respectively.

Так, для словарного признака "информация" адрес в десятичной системе исчисления в результате определения кода 1-го символа слова и вычислений по выражениям 4 и 5 имеет вид: 232 13 159, что соответствует в бинарном исчислении значению 11101000 00001101 10011111. Для латинской буквенно-знаковой системы письменности для английского, немецкого, испанского, итальянского и литовского языков в одну таблицу фигуры 9 выделены информационные признаки, начинающиеся с единого для различных языков символа "i", соответствующего в кодовой таблице ASCII значению 105 в десятичной системе исчисления, исключение составляет существительное немецкого языка "Information", начальный символ которого начинается с прописного знака "i", которому соответствует значение 73 кодовой таблицы ASCII. Данные, представленные на фигуре 9, показывают, что адреса БЭИП являются уникальными для слов, имеющих различное символьное написание, хотя и одинаковое значение на русском языке. Исключением является слово "importante" испанского и итальянского языка, которое имеет одинаковое написание символов в латинской системе письменности, что не влияет на техническое решение заявленного способа.So, for the dictionary sign “information”, the address in the decimal system as a result of determining the code of the 1st character of the word and calculations by expressions 4 and 5 has the form: 232 13 159, which corresponds in binary terms to the value 11101000 00001101 10011111. For the Latin alphanumeric the sign system of writing for English, German, Spanish, Italian and Lithuanian, information signs are selected in one table of figure 9, starting with the symbol "i" that is common for different languages, corresponding to the value 105 in decimal in the ASCII code table th system of calculation, with the exception of German noun "Information", the initial symbol which begins with a capital sign "i", which corresponds to the value of 73 ASCII code table. The data presented in figure 9 show that the BEIP addresses are unique for words that have different symbolic spelling, although the same meaning in Russian. An exception is the word "importante" of the Spanish and Italian languages, which has the same spelling of characters in the Latin writing system, which does not affect the technical solution of the claimed method.

При появлении нового необходимого для выделения в информационном потоке признака не требуется корректировка и перерасчет старых адресов, а только рассчитывается новый адрес информационного признака и по данному адресу в память записывается 1 бит информации, соответствующий уровню логической единицы. Эффективный коэффициент использования памяти η(M₂) по предлагаемому способу для любого информационного признака различных знаково-буквенных систем письменности и любой длины (количества символов в информационном признаке) составляет значение 3,125 байт/признак, т.к. 3 байта составляют адрес поля памяти, а содержательная часть по данному адресу представлена одним битом информации, уровнем логической 1 или 0.When a new sign necessary for highlighting in the information flow appears, no adjustment and recalculation of old addresses is required, but only a new address of the information sign is calculated and 1 bit of information corresponding to the level of a logical unit is written to this address. The effective memory utilization factor η (M ₂ ) according to the proposed method for any information attribute of various alphanumeric writing systems and any length (number of characters in the information attribute) is 3.125 bytes / attribute, because 3 bytes make up the address of the memory field, and the content at this address is represented by one bit of information, logical level 1 or 0.

Сравнив коэффициенты использования памяти по способу-прототипу η(М₁) и предлагаемому способу η(M₂), можно констатировать повышение эффективности использования памяти в 3,04 раза.By comparing the memory utilization coefficients according to the prototype method η (M ₁ ) and the proposed method η (M ₂ ), it is possible to note an increase in memory efficiency by 3.04 times.

Сокращение в способе-прототипе функции расчета морфологического коэффициента при изменении функций определения адреса эталонного информационного признака и заполнения памяти в БЭИП можно реализовать на существующей в настоящее время элементной базе, например, на любых серийно выпускаемых программируемых логических интегральных схемах (ПЛИС).The reduction in the prototype method of the function of calculating the morphological coefficient when changing the functions of determining the address of the reference information feature and filling the memory in BEIP can be implemented on the currently existing element base, for example, on any commercially available programmable logic integrated circuits (FPGAs).

Таким образом, из рассмотренной сущности заявляемого способа следует, что он обеспечивает обнаружение в информационных потоках заданных эталонных информационных признаков, представленных в различных буквенно-знаковых системах письменности. Это подтверждает положительный эффект технического решения предлагаемого способа.Thus, from the considered essence of the proposed method, it follows that it ensures the detection in the information flows of the specified reference information features presented in various alphanumeric writing systems. This confirms the positive effect of the technical solution of the proposed method.

Claims

A method for detecting identification features for various alphanumeric writing systems, which consists in the fact that to form a base of reference information features to be detected in the information stream, a combination of N _i , reference information features is selected, an alphabet of characters is formed, assigned to the jth, where j = 1, 2, ..., S, symbol alphabet n _j the value of its position in the symbol alphabet, for each i-th, where i = 1, 2, ..., N, the reference feature is calculated numerical value, it is appropriate, and counting using a hash function of a given type, the address of the reference information sign, then the i-th reference information sign is stored in the base of the reference information signs at the position corresponding to its address, and a group of binary characters is extracted from each fragment of the received information stream of information signs, located between two spaces adjacent to each other, decode it to the type of information sign, calculate the numerical value of the sign and address, compare the result the numerical value of the information sign with the numerical value of the reference information sign stored at a similar address in the database of reference information signs, characterized in that the alphabet of characters includes all possible characters of various alphanumeric writing systems and corresponds to one of the standard character encodings, i- The first identification character is represented in a single encoding for various alphanumeric writing systems in the form of a word from the consecutive numbers corresponding to the encoding word values of characters n _j , words convert from the first to the last character in a sequence of three bytes

where

- first byte, corresponds to the binary numeric value of the code sign of the first character of the word

,

- second byte, determined by the expression

, where bin is the binary representation function of the number in the decimal system, m _i is the number of characters in the i-th identification feature, 3 is the minimum odd prime number, mod is the base function of the module,

- the third byte is determined by the expression

, the binary sequence A _i , of three bytes determines the address of the allocated memory, in which for the standard information features required for selection, the values of the logical unit level are written to the memory at their calculated addresses, for the remaining information features not participating in the calculation of addresses, the values logical zero level, when monitoring the information flow for each word extracted from the information flow, a three-byte address is determined and if there is a level value logical at this memory address of the ith unit, they decide on the presence of a reference information feature in the information stream, when a new reference information feature is added, its three-byte address is determined, the logical unit level value is recorded at this memory address and further identification characteristics are detected in the information flows.