EA037156B1

EA037156B1 - Method for template match searching in a text

Info

Publication number: EA037156B1
Application number: EA201800581A
Authority: EA
Inventors: Дмитрий Андреевич Сурков; Кирилл Андреевич Сурков; Юрий Михайлович Четырько; Иван Владимирович Шимко; Владислав Александрович Савёнок
Original assignee: Общество С Ограниченной Ответственностью "Незабудка Софтвер"
Priority date: 2018-09-24
Filing date: 2018-09-24
Publication date: 2021-02-12
Also published as: EA201800581A1

Abstract

The invention relates to a method for template match searching in a text and ensures rapid detection of a set of concepts, entities and relations in natural language texts. The technical result is higher search rate, completeness and accuracy, ensuring a safe and language-independent searching. A text being searched is tokenized. Using the template description language, a set of templates is created as sequences, variations and repetitions of text tokens and occurrence of other templates. The set of templates is translated to search expressions with search indexes. Search of all template matches is carried out by a single sequential scanning of text tokens. Search expressions beginning with this token are retrieved for each token in search indexes, match candidates are created for corresponding templates and are introduced into the set of candidates. As screening the tokens, matches and mismatches are removed from the set of candidates.

Description

ОБЩЕСТВО С ОГРАНИЧЕННОЙ ОТВЕТСТВЕННОСТЬЮ НЕЗАБУДКА СОФТВЕР (BY) (72) Изобретатель:LIMITED LIABILITY COMPANY FORGET SOFTWARE (BY) (72) Inventor:

Сурков Дмитрий Андреевич, Сурков Кирилл Андреевич, Четырько Юрий Михайлович, Шимко Иван Владимирович, Савёнок Владислав Александрович (BY) (74) Представитель:Surkov Dmitry Andreevich, Surkov Kirill Andreevich, Chetyrko Yuri Mikhailovich, Shimko Ivan Vladimirovich, Savenok Vladislav Alexandrovich (BY) (74) Representative:

Сурков Д.А. (BY) (56) US-A1-20130080886 WO-A3-2008103398 US-B2-7054855 US-A1-20110295595 US-B2-8155946 RU-C2-2605077Surkov D.A. (BY) (56) US-A1-20130080886 WO-A3-2008103398 US-B2-7054855 US-A1-20110295595 US-B2-8155946 RU-C2-2605077

037156 В1037156 B1

037156 Bl (57) Изобретение относится к способу поиска в тексте совпадений с шаблонами и обеспечивает быстрое выявление набора понятий, сущностей и отношений в текстах на естественном языке. Техническим результатом является повышение скорости, полноты и точности поиска, обеспечение безопасности поиска и независимости от языка. Текст, в котором выполняется поиск, разбирают на лексемы. На языке описания шаблонов создают набор шаблонов в виде последовательностей, вариаций и повторений лексем текста и вхождений других шаблонов. Транслируют набор шаблонов в поисковые выражения с поисковыми индексами. Поиск всех совпадений всех шаблонов выполняют за один последовательный просмотр лексем текста. По каждой лексеме в поисковых индексах отыскивают поисковые выражения, начинающиеся с этой лексемы, создают кандидатов совпадений для соответствующих шаблонов и заносят их в набор кандидатов. По мере просмотра лексем из набора кандидатов изымают совпадения и несовпадения.037156 Bl (57) The invention relates to a method for searching the text for matches with patterns and provides a quick identification of a set of concepts, entities and relationships in natural language texts. The technical result is to increase the speed, completeness and accuracy of the search, to ensure the safety of the search and independence from the language. The text being searched is parsed into tokens. In the template description language, a set of templates is created in the form of sequences, variations and repetitions of text tokens and occurrences of other templates. A set of templates is translated into search expressions with search indexes. The search for all matches of all patterns is performed in one sequential scan of the text tokens. For each token in the search indexes, search expressions starting with that token are searched for, match candidates for the corresponding patterns are created, and they are entered into a set of candidates. As the tokens are scanned, matches and mismatches are removed from the set of candidates.

Область техники, к которой относится изобретениеThe technical field to which the invention relates

Изобретение относится к способу поиска в тексте совпадений с шаблонами и обеспечивает быстрое выявление набора понятий, сущностей и их отношений в текстах на естественном языке.The invention relates to a method for searching for matches with patterns in the text and provides a quick identification of a set of concepts, entities and their relationships in natural language texts.

Предшествующий уровень техникиPrior art

В настоящее время многие прикладные программные системы, предназначенные для автоматизации разных отраслей человеческой деятельности, включают в себя средства анализа текста на естественном языке. Источниками текстовых данных в таких системах являются базы данных, хранилища документов, публикации на сайтах в сети интернет, записи в блогах и на форумах, сообщения электронной почты, сообщения социальных сетей, стенограммы из центров обработки звонков, сообщения бесед в программах текстовой связи, и другие системы. Задача анализа текстовых данных в таких системах заключается в выявлении понятий и сущностей в тексте, а также в определении отношений между ними. Анализ текстовых данных часто сопровождается этапом обогащения текстовых данных метаинформацией о найденных понятиях, сущностях и отношениях, совокупно называемых тегами. Процесс обогащения текстовых данных тегами известен как тегирование текста.Currently, many application software systems designed to automate various branches of human activity include tools for analyzing text in natural language. Sources of text data in such systems are databases, document repositories, publications on sites on the Internet, blog posts and forums, e-mail messages, social media messages, transcripts from call centers, messages of conversations in text communication programs, and others. systems. The task of analyzing text data in such systems is to identify concepts and entities in the text, as well as to determine the relationship between them. The analysis of textual data is often accompanied by the stage of enriching the textual data with meta information about the found concepts, entities, and relationships, collectively called tags. The process of enriching text data with tags is known as text tagging.

Задача анализа текстовых данных в настоящее время решается различными способами, среди которых наиболее известны и применяются полнотекстовый поиск, полнотекстовая индексация, методы машинного обучения, латентный семантический анализ, анализ на основе регулярных выражений, анализ на основе формальных грамматик с учётом или без учёта морфологии слов, а также различные комбинации перечисленных способов.The task of analyzing text data is currently being solved in various ways, among which the most famous and used are full-text search, full-text indexing, machine learning methods, latent semantic analysis, analysis based on regular expressions, analysis based on formal grammars with or without regard to word morphology, as well as various combinations of the above methods.

Ранним подходом к решению задачи анализа текста является полнотекстовый поиск - автоматизированный документальный поиск заданных комбинаций ключевых слов и выражений в полном тексте или в существенной части текста документа. В процессе полнотекстового поиска обычно присутствуют следующие подготовительные этапы: разбор текста на слова, удаление шумовых слов, выделение основ слов, приведение слов к словарной форме, выполнение статистического анализа. При этом часто учитывают морфологию языка анализируемого текста. Для разбора текста на слова используют подходы разной степени сложности, среди которых известен стандарт Unicode Standard Annex #29 - определяет наиболее точные правила для разбора текста на предложения и слова.An early approach to solving the problem of text analysis is full-text search - an automated documentary search for given combinations of keywords and expressions in the full text or in a significant part of the text of the document. In the process of full-text search, the following preparatory stages are usually present: parsing the text into words, removing noise words, highlighting word stems, converting words to dictionary form, performing statistical analysis. In this case, the morphology of the language of the analyzed text is often taken into account. For parsing text into words, approaches of varying degrees of complexity are used, among which the Unicode Standard Annex # 29 is known - it defines the most accurate rules for parsing text into sentences and words.

Для существенного ускорения полнотекстового поиска в случае его применения к большим объёмам текстовых данных выполняют полнотекстовую индексацию, т.е. создают поисковый индекс. Поисковый индекс - это структура данных, называемая словарём, в которой хранятся все слова текста вместе с информацией о том, в каких местах каких документов они встречаются. Полнотекстовая индексация не является обязательным этапом полнотекстового поиска, который может выполняться путём нахождения подстроки в строке с помощью известных алгоритмов, однако позволяет значительно сократить область, в которой выполняется поиск, а значит существенно ускорить поиск по сравнению с прямым поиском в тексте исходных документов. При построении поискового индекса обычно игнорируют так называемые шумовые слова - слова, которые содержатся практически во всех документах К таким словам часто относят предлоги, причастия, междометия, частицы и др. Шумовые слова, как правило, исключают и из поисковых запросов.To significantly speed up the full-text search in the case of its application to large amounts of text data, full-text indexing is performed, i.e. create a search index. A search index is a data structure called a dictionary that stores all the words in a text along with information about where they appear in which documents. Full-text indexing is not an obligatory stage of full-text search, which can be performed by finding a substring in a string using well-known algorithms, however, it can significantly reduce the area in which the search is performed, and therefore significantly speed up the search compared to direct search in the text of source documents. When building a search index, so-called noise words are usually ignored - words that are contained in almost all documents. Such words often include prepositions, participles, interjections, particles, etc. Noise words are usually excluded from search queries.

Для уменьшения вариативности словоформ естественного языка применяют стемминг - процесс нахождения основы слова для заданного исходного слова. Стемминг позволяет устранить необходимость перечисления словоформ в запросах и правилах поиска, однако многозначность слов и несовершенство алгоритмов стемминга приводят к увеличению числа ложных совпадений.To reduce the variability of word forms in a natural language, stemming is used - the process of finding the basis of a word for a given source word. Stemming eliminates the need to list word forms in queries and search rules, however, the ambiguity of words and imperfect stemming algorithms lead to an increase in the number of false matches.

Для выделения ключевых слов в процессе полнотекстовой индексации часто используют статистический анализ. Он предполагает подсчёт частоты использования слов в тексте и выделение наиболее часто используемых слов. Ключевым компонентом этого подхода является математическая модель, которую применяют при расчётах частоты использования слов.Statistical analysis is often used to highlight keywords in the full-text indexing process. It involves counting the frequency of use of words in the text and highlighting the most frequently used words. A key component of this approach is the mathematical model that is used to calculate the frequency of use of words.

Полнотекстовый поиск отличается высокой производительностью и хорошей масштабируемостью в распределённой вычислительной среде. Однако этот подход малоэффективен при решении сложных задач анализа текста из-за невысокой точности поиска и отсутствия возможности поиска понятий, сущностей и их отношений, так как ориентирован на поиск ключевых слов. Указанный недостаток не позволяет использовать полнотекстовый поиск для тегирования текста.Full-text search is characterized by high performance and good scalability in a distributed computing environment. However, this approach is ineffective in solving complex problems of text analysis due to the low search accuracy and the lack of the ability to search for concepts, entities and their relationships, as it is focused on the search for keywords. The indicated drawback does not allow using full-text search for text tagging.

В настоящее время для анализа текста на естественном языке широко применяют методы машинного обучения. Эти методы основаны на классификации объектов, представленных описаниями в определённом пространстве признаков. Выделяют две группы методов машинного обучения: обучение с учителем и обучение без учителя. В первой группе целью обучения является получение правил, с помощью которых можно произвести классификацию новых объектов на основании тех, которые использовались для обучения. Во второй группе обучающая выборка отсутствует, и стоит задача кластеризации - объединения объектов в группы на основании заданной меры их сходства или различия. Машинное обучение активно применяют для выделения ключевых слов, выявления сущностей и отношений, а также распознавания эмоций, тематики и языка текста.Currently, machine learning methods are widely used to analyze text in natural language. These methods are based on the classification of objects represented by descriptions in a specific feature space. There are two groups of machine learning methods: supervised learning and unsupervised learning. In the first group, the goal of training is to obtain rules that can be used to classify new objects based on those used for training. In the second group, there is no training sample, and the task is clustering - combining objects into groups based on a given measure of their similarity or difference. Machine learning is widely used to highlight keywords, identify entities and relationships, and recognize emotions, subject matter, and text language.

Среди методов машинного обучения известны те, что ориентированы на построение предметной онтологии с информацией об объектах и их отношениях. В качестве объектов выступают существитель- 1 037156 ные, а в качестве отношений - глаголы, с которыми ассоциированы существительные. Известны методы машинного обучения, которые выявляют такие смысловые зависимости с помощью математических моделей нейронных сетей.Among the machine learning methods, there are those that are focused on building a subject ontology with information about objects and their relationships. The objects are 1 037156 nouns, and the relations are the verbs with which the nouns are associated. There are known machine learning methods that reveal such semantic dependencies using mathematical models of neural networks.

Среди методов анализа текста на естественном языке известен латентный семантический анализ, который решает задачу поиска сходства текстов через сходство значений слов в контексте их использования. Значения слов считаются сходными, если слова используются в сходных контекстах.Among the methods for analyzing text in natural language, latent semantic analysis is known, which solves the problem of finding the similarity of texts through the similarity of the meanings of words in the context of their use. The meanings of words are considered similar if the words are used in similar contexts.

Применение методов машинного обучения требует следующих подготовительных этапов обработки текста: разбор текста на лексемы, нормализация слов и морфологический анализ. Далее следует фаза грамматического анализа, в ходе которого активно применяются методы машинного обучения. Грамматический анализ подразделяется на поверхностный и полный. Поверхностный анализ способен выявлять некоторые смысловые составляющие в тексте: именные, глагольные и предложные группы. Полный грамматический анализ представляет структуру всего предложения в виде грамматического дерева. В ходе полного грамматического разбора часто возникают неоднозначности, решение которых является достаточно трудоёмкой задачей. Кроме того, с практической точки зрения для получения приемлемого результата анализа необходимость в полном грамматическом разборе отсутствует.The application of machine learning methods requires the following preparatory stages of text processing: parsing the text into lexemes, normalizing words, and morphological analysis. This is followed by the phase of grammatical analysis, during which machine learning methods are actively applied. Grammar analysis is divided into superficial and complete. Superficial analysis is able to reveal some semantic components in the text: nominal, verb and prepositional groups. A complete grammatical analysis presents the structure of the entire sentence in the form of a grammatical tree. In the course of full parsing, ambiguities often arise, the solution of which is a rather time-consuming task. In addition, from a practical point of view, there is no need for complete parsing to obtain an acceptable analysis result.

Главным недостатком методов машинного обучения, который вытекает из сути данного подхода, является невозможность объяснения полученных результатов. Это вызвано природой математических моделей и процессом обучения классификатора, который происходит путём сложных модификаций числовых коэффициентов на основании обучающей выборки. Ещё одним недостатком является сложность тренировки системы для распознавания новых сущностей и понятий, во-первых, это требует достаточно большого объёма новых данных для обучения; во-вторых, в результате обучения для распознавания новых понятий может понизиться точность распознавания понятий, добавленных ранее. Наконец, существенным недостатком методов машинного обучения является низкая производительность. Как первичное обучение, так и классификация требуют много времени, что затрудняет использование данного подхода для быстрого тегирования текста, в частности для повторяющегося тегирования большого количества документов в базе данных.The main disadvantage of machine learning methods, which follows from the essence of this approach, is the impossibility of explaining the results obtained. This is due to the nature of mathematical models and the learning process of the classifier, which occurs through complex modifications of numerical coefficients based on the training sample. Another disadvantage is the complexity of training the system to recognize new entities and concepts, firstly, it requires a sufficiently large amount of new data for training; secondly, as a result of learning to recognize new concepts, the recognition accuracy of concepts added earlier may decrease. Finally, a significant disadvantage of machine learning methods is low performance. Both initial training and classification are time consuming, which makes it difficult to use this approach for quickly tagging text, in particular for repeatedly tagging a large number of documents in a database.

В настоящее время для анализа текста широко применяются регулярные выражения, которые являются средством поиска шаблонов в тексте как на машинно-ориентированных, так и на естественных языках. Данный подход характеризуется хорошей объяснимостью результатов и относительной свободой в задании шаблонов поиска. Базовые понятия регулярных выражений включают в себя символьные литералы, а также их последовательности, чередования и множества. Большинство современных реализаций регулярных выражений дополнительно позволяют указывать квантификаторы, выполнять просмотр вперёд и назад, а также обеспечивают поиск по классам символов.Currently, regular expressions are widely used for text analysis, which are a means of searching for patterns in text in both machine-oriented and natural languages. This approach is characterized by good explainability of results and relative freedom in specifying search patterns. The basic concepts of regular expressions include character literals, as well as their sequences, alternations, and sets. Most modern regex implementations additionally allow quantifiers to be specified, forward and backward lookups, and character class searches.

Основным недостатком регулярных выражений является отсутствие оптимизации для поиска множества шаблонов за один проход по тексту. При увеличении числа шаблонов производительность линейно снижается, так как поиск каждого шаблона требует отдельного просмотра текста. Ещё одним недостатком данного подхода является работа с текстовыми данными на уровне символов, а не слов или лексем, а также невозможность использования в шаблонах ссылок на другие шаблоны, в том числе рекурсивно определённых шаблонов, что снижает скорость, точность и полноту анализа текста на естественном языке. Ещё одним недостатком регулярных выражений является существование для отдельных шаблонов определённых наборов входных данных, на которых перебор всех вариантов совпадения приводит к частым возвратам назад по тексту и сильному замедлению поиска, что воспринимается пользователем как бесконечное зацикливание. Существенным недостатком регулярных выражений является также низкая читаемость описаний шаблонов - при записи выражения высока вероятность допущения ошибки, затруднено понимание написанных ранее шаблонов.The main drawback of regular expressions is the lack of optimization for finding many patterns in a single pass through the text. As the number of templates increases, performance decreases linearly, since searching for each template requires a separate look at the text. Another disadvantage of this approach is working with text data at the level of characters, not words or tokens, as well as the impossibility of using links to other templates, including recursively defined templates, in templates, which reduces the speed, accuracy, and completeness of natural language text analysis. ... Another drawback of regular expressions is the existence of certain input data sets for individual templates, on which iterating over all matches leads to frequent backtracks in the text and a strong slowdown in the search, which is perceived by the user as an endless loop. A significant drawback of regular expressions is also the low readability of template descriptions - when writing an expression, there is a high probability of making an error, it is difficult to understand previously written templates.

Более поздние подходы, применяемые для анализа текста на естественном языке, основаны на использовании наборов формальных грамматик. Правила определения формальных грамматик предполагают использование в левой части выражения одного нетерминального символа, а в правой части терминальных и нетерминальных символов, связанных некоторыми операторами. К терминальным символам (терминалам) относятся объекты, слова и символы языка, имеющие конкретное символьное значение. К нетерминальным символам (нетерминалам) относятся объекты, обозначающие какую-либо сущность языка и не имеющие конкретного символьного значения. Операторы связывают терминалы и нетерминалы в последовательности (цепочки), множества альтернатив (вариации), повторения и другие конструкции.More recent approaches to analyzing natural language text are based on the use of sets of formal grammars. The rules for defining formal grammars assume the use of one nonterminal symbol on the left side of an expression, and terminal and nonterminal symbols connected by some operators on the right side. Terminal symbols (terminals) include objects, words and symbols of a language that have a specific symbolic meaning. Nonterminal symbols (nonterminals) include objects that designate some entity of the language and do not have a specific symbolic meaning. Operators link terminals and nonterminals in sequence (chains), sets of alternatives (variations), repetitions, and other constructs.

Известным примером формальных грамматик является форма Бэкуса-Наура. В данной формальной системе в правой части правила допускается указание последовательности и вариативности терминалов и нетерминалов. Также известен ряд модификаций формы Бэкуса-Наура, которые позволяют описывать повторения и необязательные элементы. В них существует возможность указать как возможность пропуска или повторения отдельной части выражения произвольное число раз, так и точное число повторений с указанием диапазона значений повторителя.A famous example of formal grammars is the Backus-Naur form. In this formal system, on the right side of the rule, it is allowed to indicate the sequence and variability of terminals and non-terminals. Also known are a number of modifications of the Backus-Naur form that allow you to describe repetitions and optional elements. In them, it is possible to specify both the possibility of skipping or repeating a separate part of the expression an arbitrary number of times, and the exact number of repetitions, indicating the range of values of the repeater.

Хотя синтаксис грамматик в форме Бэкуса-Наура является более простым для восприятия по сравнению с синтаксисом регулярных выражений, грамматики этого вида ориентированы на разбор машин- 2 037156 но-ориентированного текста в корневой нетереминал и плохо пригодны для задач анализа и тегирования текста на естественном языке.Although the syntax of Backus-Naur grammars is easier to comprehend than the syntax of regular expressions, grammars of this kind are oriented towards parsing machine-oriented text in a root non-terminal and are poorly suited for the tasks of parsing and tagging natural language text.

Известным примером использования формальных грамматик для анализа текстов на естественном языке является язык шаблонов JAPE и система на его основе. Язык JAPE представляет из себя язык правил для выполнения текстовой разметки в форме аннотаций, приписываемых к непрерывным фрагментам обрабатываемого текста. Анализ текста предполагает выполнение следующих этапов:A well-known example of the use of formal grammars for analyzing natural language texts is the JAPE template language and a system based on it. The JAPE language is a rule language for performing textual markup in the form of annotations attached to continuous fragments of processed text. Text analysis involves the following steps:

разбор текста на лексемы: числа, знаки пунктуации, символы, знаки пробелов. Тип лексемы сохраняется в соответствующую аннотацию;parsing text into lexemes: numbers, punctuation marks, symbols, spaces. The token type is saved to the corresponding annotation;

выделение именованных сущностей в соответствии с их списками в текстовом файле с помощью так называемого газетира. Каждый список содержит отдельное множество сущностей: организации, города, имена и так далее;selection of named entities in accordance with their lists in a text file using the so-called gazetteer. Each list contains a separate set of entities: organizations, cities, names, and so on;

разделение текста на предложения. При выделении предложений используется список аббревиатур из газетира, чтобы отличать конец предложения от других случаев использования знаков препинания;splitting the text into sentences. Sentence highlighting uses a list of gazetteer abbreviations to distinguish the end of a sentence from other uses of punctuation;

разделение текста на предложения с использованием регулярных выражений;splitting text into sentences using regular expressions;

определение частей речи. Используется встроенный набор правил, выявленные части речи сохраняются в аннотации;definition of parts of speech. A built-in set of rules is used, the identified parts of speech are saved in the annotation;

семантическая аннотация, использующая сформулированные правила на языке JAPE и аннотации, полученные в результате выполнения предыдущих этапов.semantic annotation using the formulated rules in the JAPE language and annotations obtained as a result of the previous stages.

Грамматика на языке JAPE состоит из последовательности этапов, каждый из которых представляет собой набор правил, описывающих действия, выполняемые для некоторого шаблона текста. Наборы правил составляют каскад преобразователей аннотаций, которые запускаются последовательно. В левой части правила приводится описание шаблона, а в правой перечисляются операторы преобразования аннотации. Аннотации, совпавшие в шаблоне из левой части выражения, могут быть использованы в правой части путём ссылок на метки, которые ассоциированы с элементами шаблона. Левая часть правила представляет собой шаблон из аннотаций, в которых можно использовать свойства из других аннотаций, особые свойства, содержащие непосредственно аннотированную строку текста и её длину, а также операторы сравнения и регулярные выражения для строк. Шаблоны допустимо объединять в последовательности и альтернативы, а также задавать повторения элементов шаблона. Можно задавать несколько комбинаций шаблона и действия над аннотацией в одном правиле. Для этого шаблонам задают уникальные метки, используемые в дальнейшем в правой части правил. Помеченные метками шаблоны можно включать друг в друга и создавать иерархические аннотации. В левой части правила допустимо применение так называемых макросов, т.е. подстановочных элементов, которые можно многократно использовать при записи правил. В правой части правила допустимо использовать код на языке программирования Java, позволяющий реализовывать сложную логику операций над аннотациями. Такой код включается в сгенерированный на основании грамматики программный Java-класс.A JAPE grammar consists of a sequence of stages, each of which is a set of rules describing the actions performed for a certain text template. Rule sets form a cascade of annotation converters that run sequentially. The left side of the rule lists the template, and the right side lists the annotation transformation operators. Annotations matched in the pattern from the left side of the expression can be used on the right side by referencing the labels that are associated with the elements of the pattern. The left side of the rule is a template from annotations, in which you can use properties from other annotations, special properties containing the directly annotated text string and its length, as well as comparison operators and regular expressions for strings. Patterns can be combined in sequences and alternatives, as well as set repetitions of pattern elements. You can specify multiple combinations of template and annotation actions in one rule. To do this, templates are assigned unique labels that are used later on in the right side of the rules. Labeled templates can be nested and annotated hierarchically. On the left side of the rule, the use of so-called macros is allowed, i.e. wildcard elements that can be reused when writing rules. On the right side of the rule, it is permissible to use code in the Java programming language that allows you to implement complex logic of operations on annotations. This code is included in the grammar-generated Java program class.

Подход к анализу текста на основе языка JAPE является достаточно универсальным и расширяемым, однако платой за это является громоздкость получаемых шаблонов поиска и высокая сложность всей системы. Наиболее существенным недостатком является ориентация на многопроходную последовательную проверку правил, что подтверждается применением регулярных выражений и программного кода Java, и как следствие, линейное снижение производительности при увеличении числа шаблонов, а также небезопасность в смысле возможности возникновения слишком глубоких рекурсий и (условно) бесконечного зацикливания поиска.The approach to text analysis based on the JAPE language is quite universal and extensible, but the price for this is the cumbersomeness of the resulting search patterns and the high complexity of the entire system. The most significant drawback is the focus on multi-pass sequential rule checking, which is confirmed by the use of regular expressions and Java program code, and as a result, a linear decrease in performance with an increase in the number of patterns, as well as insecurity in the sense of the possibility of too deep recursions and (conditionally) infinite looping of the search ...

Известен подход для анализа текста на естественном языке, учитывающий морфологию. Примером реализации этого подхода является язык лексико-синтаксических шаблонов LSPL и система на его основе. Язык LSPL является декларативным языком для спецификации лексических и грамматических свойств конструкций, выделяемых в текстах на русском языке. Шаблон в этом языке задаётся указанием его имени, за которым через знак равенства следует, в общем случае, описание нескольких альтернативных вариантов формализуемой языковой конструкции. Для разделения альтернатив используется символ |. Разные шаблоны должны иметь разные имена, а в качестве имени должна использоваться произвольная последовательность букв, причём первая буква должна быть заглавной. Последовательность элементов в шаблоне строго соответствует их расположению в описываемой конструкции. Элементами шаблона могут выступать элемент-слово, элемент-строка, экземпляр шаблона, повторение элементов и набор альтернатив. Первые три вида элементов относятся к простым элементам, а остальные - к сложным, так как в их состав входят другие элементы. Необязательные элементы, повторения и альтернативы аналогичны операторам в языке JAPE и дополненной форме Бэкуса-Наура. Элемент-строка позволяет записать в шаблоне конкретную строку символов, например конкретную форму слова, знак пунктуации или условное обозначение. В таких строках можно использовать регулярные выражения. При этом в элементахстроках не допускается уточнение морфологических признаков слов. Элемент-слово соответствует отдельному слову, для которого указываются следующие признаки:The known approach for the analysis of text in natural language, taking into account the morphology. An example of the implementation of this approach is the language of lexical-syntactic templates LSPL and a system based on it. The LSPL language is a declarative language for the specification of lexical and grammatical properties of constructions highlighted in Russian texts. A template in this language is specified by indicating its name, followed by an equal sign, in the general case, a description of several alternative variants of the formalized language construction. The | character is used to separate alternatives. Different templates must have different names, and an arbitrary sequence of letters must be used as the name, with the first letter in uppercase. The sequence of elements in the template strictly corresponds to their location in the described construction. Elements of a template can be a word element, a string element, a template instance, a repetition of elements, and a set of alternatives. The first three types of elements belong to simple elements, and the rest - to complex ones, since they include other elements. Optional elements, repetitions, and alternatives are similar to operators in JAPE and augmented Backus-Naur form. A string element allows you to write a specific string of characters in a pattern, such as a specific word form, punctuation, or symbol. Regular expressions can be used on such strings. At the same time, it is not allowed to specify the morphological features of words in the elements of the lines. A word element corresponds to a single word, for which the following attributes are indicated:

часть речи: при этом используются символьные обозначения: N - существительное, V - глагол, А прилагательное, Pr - предлог, Pn - местоимение и т.д.;part of speech: in this case, symbolic designations are used: N - noun, V - verb, A adjective, Pr - preposition, Pn - pronoun, etc .;

имя лексемы - начальная форма слова, задающая множества всех словоформ этого слова;lexeme name - the initial form of a word, specifying the sets of all word forms of this word;

- 3 037156 морфологические характеристики слова: падеж, род, число (единственное или множественное) и др.- 3 037156 morphological characteristics of the word: case, gender, number (singular or plural), etc.

Каждой части речи соответствует свой набор морфологических характеристик, причём некоторые из них фиксированы, например род существительных, другие же изменяемы, например падеж существительного. Язык LSPL допускает использование в шаблонах произвольных слов, обозначаемых буквой W. Конкретные значения изменяемых характеристик могут быть записаны в угловых скобках после имени лексемы. Например, следующий шаблон задаёт прилагательное красный в именительном падеже женского рода:Each part of speech corresponds to its own set of morphological characteristics, and some of them are fixed, for example, the gender of nouns, while others are changeable, for example, the case of a noun. LSPL allows the use of arbitrary words in templates, denoted by the letter W. Specific values of the changed characteristics can be written in angle brackets after the token name. For example, the following pattern specifies the adjective red in the feminine nominative:

А<(красный),c=nom,g=fem>A <(red), c = nom, g = fem>

здесь А - прилагательное (от англ. adjective), далее в угловых скобках характеристики прилагательного, где красный - начальная форма слова, c=nom - обозначение именительного падежа (с от англ. case падеж, n от англ. nominative - именительный), g=fem - обозначение женского рода (g от англ. gender род, fern от англ. feminine - женский).here A is an adjective (from the English adjective), then in angle brackets the characteristics of the adjective, where red is the initial form of the word, c = nom is the designation of the nominative case (c from the English case, n from the English nominative), g = fem - designation of the feminine gender (g from the English gender gender, fern from the English feminine - feminine).

В шаблоне допускается задание повторения элементов: в фигурных скобках указываются элементы, которые могут встречаться в тексте несколько раз подряд. Для повторений возможно указание максимального и минимального множителя в угловых скобках после шаблона. Так, запись {А}<1,3> обозначает прилагательное, которое может повторяться от одного до трёх раз. Если второе значение не указано, зафиксированным считается только минимальное число повторений. Стоит отметить, что такой подход к использованию значения по умолчанию является не совсем очевидным и требует обязательных разъяснений для пользователя. Существует возможность задавать необязательные (опциональные) элементы, помещая их в квадратные скобки. Такая запись является укороченной версией повторения от нуля до одного раза: например, элемент [не] указывает необязательность совпадения строки не с участком текста. Опциональные элементы и повторения могут состоять из набора альтернатив, каждая из которых, в свою очередь, может быть последовательностью.In the template, it is allowed to set the repetition of elements: in curly braces, elements are indicated that can occur in the text several times in a row. For repetitions, you can specify the maximum and minimum multiplier in angle brackets after the pattern. So, the entry {A} <1,3> denotes an adjective that can be repeated from one to three times. If the second value is not specified, only the minimum number of repetitions is considered committed. It should be noted that this approach to using the default value is not entirely obvious and requires mandatory clarification for the user. It is possible to specify optional (optional) elements by placing them in square brackets. Such notation is a shortened version of repetition from zero to one time: for example, the [not] element indicates the optional match of a string not with a piece of text. Optional elements and repetitions can consist of a set of alternatives, each of which, in turn, can be a sequence.

Условия согласования определяют грамматическое согласование отдельных элементов шаблона и записываются в угловых скобках после всех согласуемых элементов. Условие согласования выражается в виде равенства отдельных или всех общих морфологических характеристик согласуемых слов. Например, следующий шаблон описывает согласованную в числе и роде пару слов - местоимение и глагол:Matching conditions determine the grammatical agreement of individual elements of the template and are written in angle brackets after all matched elements. The condition of agreement is expressed in the form of equality of individual or all common morphological characteristics of the words being matched. For example, the following pattern describes a pair of words that are consistent in number and gender - a pronoun and a verb:

PV = Pn V <Pn.n=V.n, Pn.g=V.g>PV = Pn V <Pn.n = V.n, Pn.g = V.g>

Здесь Pn - местоимение (от англ. pronoun), V - глагол (от англ. verb), далее в угловых скобках следует описание условий согласования, где Pn.n=V.n (n от англ. number - число) задаёт согласование числа предлога и глагола, a Pn.g=V.g - рода.Here Pn is a pronoun (from English pronoun), V is a verb (from English verb), then in angle brackets there follows a description of the conditions of agreement, where Pn.n = Vn (n from English number is a number) specifies the agreement of the number of the preposition and verb, a Pn.g = Vg - gender.

В такой записи используются составные имена, образованные из имени элемента шаблона и имени согласуемого признака, которые разделяются точкой. В случае, когда должны быть согласованы все общие морфологические характеристики двух элементов шаблона, условия согласования можно записать, используя только имена согласуемых элементов, не перечисляя морфологические характеристики.In such a record, compound names are used, formed from the name of the template element and the name of the matched feature, which are separated by a period. In the case when all the general morphological characteristics of two elements of the template must be agreed upon, the conditions for matching can be written using only the names of the matched elements, without listing the morphological characteristics.

В случае, когда в шаблоне несколько элементов и несколько условий согласования, их можно записывать по очереди друг за другом. Единственным ограничением является то, что каждое условие согласования в шаблоне не должно опережать запись самих согласуемых элементов.In the case when the template contains several elements and several matching conditions, they can be written in turn one after another. The only limitation is that each matching condition in the template must not precede the writing of the matching elements themselves.

При описании шаблона часть морфологических характеристик слов можно вынести в параметры шаблона, которые записываются после всех его элементов и условий. В параметрах можно указать только морфологические характеристики входящих в шаблон элементов-слов, которые не были конкретизированы в самом шаблоне. Следующий пример показывает определение с помощью параметров шаблона падежа и числа существительного:When describing a template, part of the morphological characteristics of words can be transferred to the template parameters, which are written after all its elements and conditions. In the parameters, you can specify only the morphological characteristics of the elements-words included in the template, which were not specified in the template itself. The following example shows the definition of the case and number of a noun using the template parameters:

ANp = [A] N1 N2<c=nom> <A=N1> (Nl.c, Nl.n)ANp = [A] N1 N2 <c = nom> <A = N1> (Nl.c, Nl.n)

Здесь А - опциональное прилагательное, N1 и N2 - существительные, где для существительного N2 в угловых скобках задано ограничение c=nom - обозначение именительного падежа; далее в угловых скобках A=N 1 задаёт грамматическое согласование всех общих морфологических характеристик прилагательного А и существительного N1, после чего в круглых скобках описаны морфологические характеристики первого существительного, выделяемые в качестве параметров шаблона ANp: с - падеж и n число. При этом в качестве параметров шаблона нельзя использовать падеж существительного N2, так как он уже конкретизирован в самом шаблоне.Here A is an optional adjective, N1 and N2 are nouns, where the restriction c = nom is specified for the noun N2 in angle brackets - the designation of the nominative case; further, in angle brackets A = N 1 sets the grammatical agreement of all common morphological characteristics of the adjective A and the noun N1, after which the morphological characteristics of the first noun are described in parentheses, allocated as parameters of the template ANp: c - case and n number. At the same time, the case of the noun N2 cannot be used as template parameters, since it has already been specified in the template itself.

Для объявленного таким образом шаблона с параметрами можно в дальнейшем создавать экземпляры, фиксируя требуемые аргументы, для использования в других шаблонах. Например, для создания экземпляра приведённого шаблона ANp с заданным именительным падежом и единственным числом для первого существительного следует использовать конструкциюFor a template with parameters declared in this way, you can later create instances, fixing the required arguments, for use in other templates. For example, to create an instance of the given ANp template with a given nominative case and a singular number for the first noun, use the construction

ANp<c=nom, n=sing>ANp <c = nom, n = sing>

Здесь ANp - имя шаблона, для которого создаётся экземпляр, далее в угловых скобках c=nom - фиксация в качестве первого параметра шаблона именительного падежа, n=sing - фиксация в качестве второго параметра шаблона единственного числа (от англ. singular).Here ANp is the name of the template for which the instance is being created, then in angle brackets c = nom is fixation as the first parameter of the nominative template, n = sing is fixation as the second parameter of the singular template.

Для проверки наличия слов текста в некотором словаре в шаблонах допускается запись словарных условий, имеющих вид обращения к некоторой логической функции, проверяющей вхождение в словарь.To check for the presence of text words in a certain dictionary in templates, it is allowed to write dictionary conditions in the form of an appeal to some logical function that checks the entry into the dictionary.

- 4 037156- 4 037156

Имя функции в таких выражениях можно считать именем словаря, в котором выполняется поиск. Как и условия согласования, словарные условия записываются в угловых скобках. Например, запись <Dict(A1)> означает, что элемент-слово А1 должно входить в словарь Dict.The function name in such expressions can be thought of as the name of the dictionary being searched. Like matching conditions, vocabulary conditions are written in angle brackets. For example, the record <Dict (A1)> means that the word element A1 must be included in the Dict dictionary.

Набор взаимосвязанных шаблонов задаёт расширенную дополнительными условиями формальную грамматику. Для распознавания шаблонов последовательно применяется процедура наложения шаблона на текст, результатом которой являются различные варианты наложения. Вариант наложения - это найденный непрерывный фрагмент текста, соответствующий шаблону вместе с набором конкретных значений морфологических характеристик слов, входящих в этот отрезок. Значениями элементов шаблона являются указанные отрезки текста. В некоторых случаях возможны совпадения одного и того же шаблона с одним и тем же участком текста разными способами, а также пересечения совпадений.A set of related templates defines a formal grammar augmented with additional terms. To recognize patterns, the procedure for imposing a template on the text is sequentially applied, which results in various overlay options. An overlay option is a found continuous piece of text that matches a template along with a set of specific meanings of the morphological characteristics of the words included in this segment. The values of the template elements are the specified text segments. In some cases, it is possible that the same template matches the same portion of the text in different ways, as well as the intersection of matches.

Язык лексико-синтаксических шаблонов LSPL содержит богатые средства для учёта морфологических характеристик слов и их грамматического согласования. В отличие от языка JAPE, синтаксис LSPL менее громоздкий и относительно более понятный, но выраженная лингвистическая направленность данного подхода отрицательно влияет на простоту использования и производительность анализа. Ориентированность на русский язык является не столько недостатком, сколько вынужденной мерой, так как учёт морфологии требует тесной связи с конкретным языком. Выявление шаблонов происходит последовательно, что приводит к линейному снижению производительности при увеличении числа шаблонов для поиска. Использование регулярных выражений хотя и помогает увеличить точность и полноту анализа, но снижает безопасность подхода в смысле исключения слишком долгих переборов вариантов совпадений.The language of lexico-syntactic templates LSPL contains rich tools for taking into account the morphological characteristics of words and their grammatical agreement. Unlike the JAPE language, the LSPL syntax is less cumbersome and relatively more understandable, but the pronounced linguistic focus of this approach negatively affects the ease of use and performance of the analysis. Orientation to the Russian language is not so much a disadvantage as a necessary measure, since taking into account morphology requires a close connection with a specific language. Pattern detection occurs sequentially, resulting in a linear performance degradation as the number of patterns to search increases. Although the use of regular expressions helps to increase the accuracy and completeness of the analysis, it reduces the safety of the approach in the sense of eliminating too long enumeration of variants of matches.

Глубокий анализ морфологических особенностей языка является важным направлением научных исследований, однако слабо приближает к решению задачи тегирования текста: грамматическое согласование слов оказывает влияние на незначительный процент результатов, так как анализируемые реальные тексты на естественном языке не составляются с целью затруднения работы алгоритмов поиска. Отбрасывание находящихся рядом и подходящих под шаблон по своим основам, но грамматически несогласованных слов, ощутимо не увеличивает качество поиска. Таким образом, для упрощения составления шаблонов для языков с богатым словообразованием достаточно использовать стемминг, а в некоторых случаях и просто перечислить требуемые формы слов.A deep analysis of the morphological features of a language is an important area of scientific research, but it does little to approach the solution of the problem of text tagging: grammatical agreement of words affects a small percentage of results, since the analyzed real texts in natural language are not compiled in order to complicate the operation of search algorithms. Discarding words that are nearby and match the pattern in their basics, but grammatically inconsistent words, does not significantly increase the quality of the search. Thus, to simplify the compilation of templates for languages with rich word formation, it is enough to use stemming, and in some cases just list the required word forms.

Ещё одним известным способом анализа текста является подход на основе грамматик Томитапарсера. Этот подход схож с подходом на основе языка LSPL и использует словари ключевых слов и формальные грамматики, учитывающие морфологические признаки слов, для извлечения фактов из текста на естественном языке. Правила грамматик Томита-парсера работают с цепочками. Одна цепочка соответствует одному предложению в тексте. Из цепочки выделяются подцепочки, которые в свою очередь интерпретируются как факты, разделённые по полям. Например, правилоAnother well-known method of text analysis is the approach based on Tomitaparser grammars. This approach is similar to the LSPL-based approach and uses keyword dictionaries and formal grammars that take into account morphological features of words to extract facts from natural language text. The Tomita parser grammar rules work with strings. One chain corresponds to one sentence in the text. Substrings are distinguished from the chain, which in turn are interpreted as facts separated by fields. For example, the rule

S -> Noun;S -> Noun;

описывает существительное. При этом по умолчанию происходит приведение слов к начальной форме. В приведённом правиле Noun является терминалом. В Томита-парсере к терминалам относятся названия частей речи, символы пунктуации и арифметические знаки, слова в начальной форме - леммы, а также некоторые другие специальные терминалы, например терминал Word, обозначающий любое слово. Терминалы могут стоять только в правой части правил. В левой части правил записываются нетерминалы. В приведенном примере S - нетерминал. Если нетерминал встречается только в левой части правил, то он считается вершиной грамматики - начальным нетерминалом. При этом начальный нетерминал можно задать и явно. Для использования в правой части правил последовательности терминалов и нетерминалов они разделяются пробелами. Например, следующим образом определяется правило, выделяющее прилагательное и идущее следом за ним существительное:describes a noun. In this case, by default, the words are reduced to their initial form. In the above rule, Noun is the terminal. In the Tomita parser, terminals include the names of parts of speech, punctuation symbols and arithmetic signs, words in the initial form - lemmas, as well as some other special terminals, for example, the Word terminal, denoting any word. Terminals can only be on the right side of the rules. Nonterminals are written on the left side of the rules. In this example, S is a nonterminal. If a nonterminal occurs only on the left side of the rules, then it is considered the top of the grammar - the initial nonterminal. In this case, the initial nonterminal can also be specified explicitly. For use on the right side of the terminal and nonterminal sequence rules, they are separated by spaces. For example, a rule is defined as follows, highlighting an adjective and a noun following it:

S -> Adj Noun;S -> Adj Noun;

Для наложения ограничений на терминал или нетерминал в Томита-парсере используются специальные атрибуты, задаваемые в угловых скобках после соответствующего терминала или нетерминала. Например, для указания того, что некоторое слово должно начинаться с заглавной буквы, применяется атрибут h-reg1. В дополнение к этому атрибуту существуют атрибуты, обозначающие, что первая буква слова и как минимум ещё одна буква слова должны быть заглавными, или что слово должно быть полностью записано прописными буквами. Для этого используются атрибуты h-reg2 и l-reg соответственно.To impose restrictions on a terminal or nonterminal, the Tomita parser uses special attributes specified in angle brackets after the corresponding terminal or nonterminal. For example, the h-reg1 attribute is used to indicate that a word must begin with an uppercase letter. In addition to this attribute, there are attributes that indicate that the first letter of the word and at least one other letter of the word must be capitalized, or that the word must be fully capitalized. For this, the h-reg2 and l-reg attributes are used, respectively.

Оператор повторения в языке описания грамматик Томита-парсера обозначается знаком + и означает, что терминал или нетерминал может встретиться в цепочке один или более раз. Для обозначения необязательности элемент заключается в круглые скобки. Например, следование за прилагательным нескольких слов с большой буквы, перед которыми может быть предлог, описывается шаблономThe repetition operator in the Tomita-parser grammar description language is denoted by the + sign and means that a terminal or a nonterminal can occur one or more times in a chain. The element is enclosed in parentheses to indicate optional. For example, following an adjective with several capitalized words, which may be preceded by a preposition, is described by the pattern

S -> Adj (Prep) Word<h-regl>+;S -> Adj (Prep) Word <h-regl> +;

Здесь Adj - прилагательное, Prep - предлог, Word - любое слово, далее в угловых скобках h-reg1 задаёт, что слово Word должно начинаться с заглавной буквы, а символ + определяет, что слово Word может повторяться один и более раз.Here Adj is an adjective, Prep is a preposition, Word is any word, then h-reg1 in angle brackets specifies that the word Word should start with a capital letter, and the + symbol specifies that the word Word can be repeated one or more times.

В языке описания грамматик Томита-парсера возможно использование оператора альтернативы,In the Tomita-parser grammar description language, it is possible to use the alternative operator,

- 5 037156 обозначаемого символом |. Например, варианты обращения к человеку описываются шаблоном- 5 037156 denoted by the symbol |. For example, options for addressing a person are described by the template

FormOfAddress -> 'товарищ' | 'мистер' | 'господин';FormOfAddress -> 'comrade' | 'mister' | "lord";

Томита-парсер, как и язык лексико-синтаксических шаблонов LSPL, обладает средствами анализа морфологии и указания грамматического согласования слов. Грамматическое согласование задаётся специальными атрибутами, например, для согласования по роду, числу и падежу служит атрибут gnc-agr (от англ. gender, number, case agreement).Tomita-parser, like the language of lexical-syntactic templates LSPL, has tools for analyzing morphology and indicating the grammatical agreement of words. Grammatical agreement is specified by special attributes, for example, the gnc-agr attribute (from the English gender, number, case agreement) is used to agree on gender, number, and case.

В каждой цепочке, определяемой правилом Томита-парсер, есть главное слово, грамматические признаки которого наследуются всей цепочкой. По умолчанию главным словом устанавливается первое слово цепочки, однако его можно указать вручную с помощью атрибута rt. Это может потребоваться изза того, что нормализация слов выполняется исходя из заданного главного слова. В случае нормализации прилагательного и связанного с ним существительного, если не указать главным словом существительное, то в результате нормализации прилагательное будет приведено к мужскому роду, что может не совпадать с родом существительного.Each chain, defined by the Tomita-parser rule, has a main word, the grammatical features of which are inherited by the entire chain. By default, the first word in the string is set as the main word, but you can specify it manually using the rt attribute. This may be required due to the fact that word normalization is performed based on the given main word. In the case of normalization of the adjective and the noun associated with it, if the main word is not indicated by the noun, then as a result of normalization the adjective will be reduced to the masculine gender, which may not coincide with the gender of the noun.

Томита-парсер позволяет не только задавать ограничения на грамматическое согласование, но и отдельно указывать требуемые морфологические характеристики слов. Морфологические характеристики задаются с помощью атрибутов, например, для поиска слова женского рода, ему следует задать атрибут gram со значением жен. В данном атрибуте через запятую можно указывать и другие морфологические характеристики слов, такие как число, падеж, вид глагола, форма прилагательных и так далее.Tomita parser allows not only to set restrictions on grammatical agreement, but also to separately indicate the required morphological characteristics of words. Morphological characteristics are set using attributes, for example, to search for a feminine word, he should set the gram attribute with the value of wives. In this attribute, you can specify other morphological characteristics of words, separated by commas, such as number, case, type of verb, form of adjectives, and so on.

Для работы на уровне символов с применением регулярных выражений Томита-парсер содержит соответствующие атрибуты wfrn, wff и wfl. Синтаксис регулярных выражений, применяемых в Томитапарсере, основан на синтаксисе таковых в языке Perl. Например, для определения даты можно использовать шаблонTo work at the character level using regular expressions, the Tomita parser contains the corresponding attributes wfrn, wff, and wfl. The syntax of regular expressions used in Tomitaparser is based on the syntax of those in the Perl language. For example, to define a date, you can use the template

Date -> AnyWord<wff=/[1-2]?[0-9]{1,3}\.?/>Date -> AnyWord <wff = / [1-2]? [0-9] {1,3} \.? />

Для некоторых грамматик требуется не только морфологическая, но и семантическая информация о словах из текста. Такую информацию можно задавать с помощью словарей. Словарь состоит из статей, каждая из которых имеет название и набор слов, входящих в статью. Слова можно перечислять как в файле описания словаря, так и в отдельном текстовом файле, указав его имя в файле описания словаря. После описания словаря для использования его необходимо импортировать в корневой словарь. После этого в атрибутах терминалов можно использовать атрибут kwtype, значение которого определяет имя словаря, в который должно входить соответствующее слово.Some grammars require not only morphological, but also semantic information about words from the text. Such information can be specified using dictionaries. The dictionary consists of articles, each of which has a title and a set of words included in the article. Words can be listed both in the dictionary description file and in a separate text file by specifying its name in the dictionary description file. After the dictionary has been described to be used, it must be imported into the root dictionary. After that, the kwtype attribute can be used in the terminal attributes, the value of which determines the name of the dictionary, which should include the corresponding word.

Томита-парсер выделяет цепочки текста на основе грамматик, чтобы затем интерпретировать цепочки как факты и получить структурированные данные. Типы фактов описываются в специальном формате и содержат название факта и описание имён и типов составляющих его полей. Для каждого поля указывается, обязательно ли оно должно быть заполнено, для этого используются ключевые слова optional и required. Описание того, как следует интерпретировать выделенную цепочку, содержится в грамматике. Для этого используется ключевое слово interp, после которого в скобках записывается имя факта и имя поля факта, в которое должна попасть соответствующая цепочка.The tomita parser extracts strings of text based on grammars in order to then interpret the strings as facts and obtain structured data. Fact types are described in a special format and contain the name of the fact and a description of the names and types of its constituent fields. For each field, it is indicated whether it must be filled in, the keywords optional and required are used for this. The grammar contains a description of how the highlighted string should be interpreted. To do this, use the interp keyword, after which the name of the fact and the name of the fact field, into which the corresponding string should fall, are written in parentheses.

Выявление цепочек текста, подходящих под грамматики, в Томита-парсере выполняется с помощью GLR-парсера. GLR-парсер - это модифицированная версия алгоритма LR-парсера, предназначенная для разбора по недетерминированным и неоднозначным грамматикам. GLR-парсер, как и LR-парсер, основан на использовании специальной структуры данных - таблицы анализа, которая содержит синтаксис анализируемого языка. При этом таблица анализа LR-парсера допускает только один переход состояния, определяемый состоянием парсера и входным терминалом. Для обеспечения работы с недетерминированными и неоднозначными грамматиками таблица GLR-парсера может содержать несколько переходов. Возникновение конфликта решается разветвлением стека парсера на два или больше параллельных стека, последние элементы которых соответствуют каждому из возможных переходов. В дальнейшем входные символы используются для определения последующих переходов всех полученных ветвей. Если же для какой-то из ветвей на некотором терминале не найдётся перехода, то она будет удалена как ошибочная. В лучшем случае для полностью детерминированной грамматики GLR-парсер работает за линейное время, в худшем случае - за кубическое. Приведённое описание справедливо для одной грамматики, так как процесс проверки совпадения сводится к сворачиванию исходной строки в стартовый нетерминал грамматики. Из всех возможных вариантов совпадения грамматики выбираются те, которые покрывают больший участок текста. После построения синтаксического дерева Томита-парсер выполняет создание фактов в соответствии с заданными описаниями.Finding strings of text that match grammars in the Tomita parser is done using the GLR parser. The GLR parser is a modified version of the LR parser algorithm designed to parse non-deterministic and ambiguous grammars. The GLR parser, like the LR parser, is based on the use of a special data structure - an analysis table that contains the syntax of the parsed language. In this case, the LR parser analysis table allows only one state transition, determined by the state of the parser and the input terminal. To ensure work with non-deterministic and ambiguous grammars, the GLR parser table can contain several jumps. The occurrence of a conflict is solved by forking the parser stack into two or more parallel stacks, the last elements of which correspond to each of the possible transitions. Further, the input symbols are used to determine the subsequent transitions of all received branches. If for some of the branches on some terminal there is no transition, then it will be deleted as an erroneous one. At best, for a fully deterministic grammar, the GLR parser runs in linear time, at worst in cubic time. The above description is valid for one grammar, since the process of matching is reduced to folding the original string into the starting nonterminal of the grammar. Of all the possible variants of grammar matching, those that cover a larger area of the text are selected. After building the syntax tree, the Tomita parser creates facts in accordance with the given descriptions.

Томита-парсер обладает схожим функционалом с языком лексико-синтаксических шаблонов LSPL, позволяя задавать правила грамматического согласования слов и требуемые морфологические характеристики. Повышенное внимание к морфологии приводит в данном случае к аналогичному усложнению синтаксиса записи шаблонов. Томита-парсер хорошо справляется с задачей выявления фактов в тексте, однако в задаче тегирования текста поиск, как правило, осуществляется по неким конкретным интересующим словам, а не по синтаксическим конструкциям, таким как, например, глагольные или именные группы. Применение Томита-парсера для задачи тегирования затруднено громоздкостью и сложностьюTomita-parser has similar functionality to the language of lexical-syntactic templates LSPL, allowing you to specify the rules of grammatical agreement of words and the required morphological characteristics. Increased attention to morphology leads in this case to a similar complication of the syntax for writing templates. The tomita parser does a good job of identifying facts in the text, however, in the task of tagging text, the search, as a rule, is carried out by certain specific words of interest, and not by syntactic constructions such as, for example, verb or noun phrases. The use of the Tomita parser for the tagging task is complicated by its cumbersomeness and complexity

- 6 037156 синтаксиса шаблонов, а также невысокой производительностью, что обусловлено богатыми возможностями по работе с морфологией.- 6 037156 template syntax, as well as low performance, due to the rich possibilities for working with morphology.

Подводя итог, можно констатировать, что известные подходы отвечают многим, но не всем требованиям, предъявляемым к средствам выявления набора понятий, сущностей и их отношений в текстах на естественном языке. В действительности требуется сбалансированное решение на основе шаблонов, которое обладает высокой скоростью за счёт поиска всего набора шаблонов за один проход по лексемам текста, не снижает производительность линейно при увеличении числа шаблонов для поиска, одновременно с этим предоставляет мощный синтаксис описания шаблонов в виде формальных грамматик, включающих средства определения последовательностей (цепочек), вариаций (альтернатив) и повторений лексем текста и вхождений других шаблонов, в том числе рекурсивных шаблонов, обеспечивает приемлемые точность и полноту поиска, независимость от языка, а также безопасность в смысле исключения слишком долгих переборов вариантов совпадений.Summing up, we can state that the known approaches meet many, but not all, requirements for the means of identifying a set of concepts, entities and their relationships in natural language texts. What is actually required is a balanced pattern-based solution that is fast by searching the entire set of patterns in a single pass through the text tokens, does not degrade performance linearly as the number of patterns to search increases, while providing a powerful syntax for describing patterns in the form of formal grammars. including means of determining sequences (strings), variations (alternatives) and repetitions of text tokens and occurrences of other patterns, including recursive patterns, provides acceptable accuracy and completeness of the search, language independence, and also safety in the sense of avoiding too long searches of matches.

Сущность изобретенияThe essence of the invention

Техническим результатом изобретения является повышение скорости поиска в тексте совпадений с шаблонами за счёт предварительного разбора текста на лексемы и проверки всех совпадений всех шаблонов за один последовательный просмотр лексем текста.The technical result of the invention is to increase the speed of searching for matches with templates in the text due to preliminary parsing of the text into tokens and checking all matches of all templates in one sequential scan of the text tokens.

Ещё одним техническим результатом изобретения является повышение полноты и точности поиска в тексте по шаблонам за счёт использования в качестве шаблонов формальных грамматик, допускающих последовательности, вариации и повторения лексем текста и вхождений других шаблонов, в том числе грамматик неограниченного уровня вложенности, рекурсивных грамматик и параметризованных грамматик.Another technical result of the invention is to increase the completeness and accuracy of searching in a text by templates due to the use of formal grammars as templates that allow sequences, variations and repetitions of text lexemes and occurrences of other templates, including grammars of unlimited nesting level, recursive grammars and parameterized grammars. ...

Ещё одним техническим результатом изобретения является обеспечение независимости поиска от языка текста за счёт отказа от морфологического анализа текста и использования в шаблонах вариаций для учёта словоформ.Another technical result of the invention is to ensure the independence of the search from the language of the text due to the rejection of the morphological analysis of the text and the use of variations in templates to take into account word forms.

Ещё одним техническим результатом изобретения является обеспечение безопасности поиска, что означает избежание затягивания поиска на продолжительное время, воспринимаемое пользователем как зацикливание или бесконечный поиск, за счёт ограничения количества создаваемых в процессе поиска кандидатов совпадений или за счёт ограничения объёма потребляемых кандидатами ресурсов компьютера.Another technical result of the invention is to ensure the security of the search, which means avoiding the delay in the search for a long time, perceived by the user as a loop or endless search, by limiting the number of matches created during the search for candidates or by limiting the amount of computer resources consumed by candidates.

Ещё одним техническим результатом изобретения является повышение скорости, полноты и точности выявления набора понятий, сущностей и их отношений в естественно-языковых текстах за ограниченное время и независимо от языка.Another technical result of the invention is to increase the speed, completeness and accuracy of identifying a set of concepts, entities and their relationships in natural language texts in a limited time and regardless of the language.

В соответствии с одним из аспектов настоящего изобретения перечисленные технические результаты достигаются за счёт выполнения следующих шагов: текст предварительно разбирают на лексемы, к которым относятся, по крайней мере, слова и разделители слов; на языке описания шаблонов создают набор шаблонов, в котором каждый шаблон является формальной грамматикой, состоящей, по крайней мере, из последовательностей и(или) вариаций, и(или) повторений лексем текста и(или) вхождений других шаблонов и допускающей параметры, предназначенные для обобщения шаблонов и(или) для уточнения результатов поиска; транслируют набор шаблонов в деревья поисковых выражений с поисковыми индексами, которые позволяют по заданной лексеме или заданному имени шаблона быстро отыскать все поисковые выражения, которые прямо или косвенно начинаются с заданной лексемы или вхождения шаблона; создают переменный набор кандидатов, каждый из которых хранит информацию о совпадении фрагмента текста с соответствующим поисковым выражением и служит для принятия решения о дальнейшем совпадении или несовпадении фрагмента текста с шаблоном; затем последовательно просматривают лексемы текста и для каждой лексемы выполняют, по крайней мере, следующие действия: в поисковых индексах отыскивают все шаблоны, начинающиеся с текущей лексемы, создают кандидатов для проверки совпадений текста с этими шаблонами и добавляют их в набор кандидатов; для каждого кандидата из набора кандидатов принимают решение о полном совпадении, несовпадении или неполном совпадении текста с соответствующим шаблоном в зависимости от значения текущей лексемы; для каждого совпадения шаблона принимают решение о полном совпадении, несовпадении или неполном совпадении кандидатов, имеющих ссылки на этот шаблон; для учёта различных вариантов совпадения текста с проверяемыми шаблонами создают и добавляют в набор кандидатов логические копии тех кандидатов, для которых возможны разные варианты совпадения с текстом до текущей лексемы включительно, причём логические копии кандидатов наследуют общее состояние, которое соответствует уже просмотренным лексемам текста, и отличаются в той части состояния, которая соответствует текущей лексеме текста; в том случае, если количество создаваемых в процессе поиска кандидатов и(или) объём потребляемых кандидатами ресурсов компьютера превышает некоторый заданный предел, то, во избежание затягивания поиска на продолжительное время, набор кандидатов сокращается удалением части или всех кандидатов, что эквивалентно обрыву текста и началу нового поиска с текущей лексемы.In accordance with one aspect of the present invention, the listed technical results are achieved by performing the following steps: the text is pre-parsed into lexemes, which include at least words and word separators; in the template description language, a set of templates is created in which each template is a formal grammar consisting of at least sequences and (or) variations, and (or) repetitions of text tokens and (or) occurrences of other templates and admitting parameters intended for generalization of templates and (or) to refine search results; a set of templates is translated into search expression trees with search indexes, which allow for a given token or a given template name to quickly find all search expressions that directly or indirectly begin with a given token or pattern occurrence; create a variable set of candidates, each of which stores information about the coincidence of the text fragment with the corresponding search expression and serves to make a decision on the further coincidence or non-coincidence of the text fragment with the template; then the tokens of the text are sequentially scanned and for each token, at least the following actions are performed: in the search indexes, all templates starting with the current token are found, candidates are created to check for text matches with these templates, and they are added to the set of candidates; for each candidate from the set of candidates, a decision is made on complete match, non-match or incomplete match of the text with the corresponding template, depending on the value of the current token; for each template match, a decision is made on complete match, non-match or incomplete match of candidates having links to this template; to take into account various variants of the text coincidence with the checked templates, create and add to the set of candidates logical copies of those candidates for which different variants of matching with the text up to the current token inclusive are possible, and the logical copies of the candidates inherit a common state that corresponds to the already scanned text tokens, and differ in the part of the state that corresponds to the current token of the text; in the event that the number of candidates created in the search process and (or) the amount of computer resources consumed by candidates exceeds a certain specified limit, then, in order to avoid dragging out the search for a long time, the set of candidates is reduced by removing part or all of the candidates, which is equivalent to cutting off the text and beginning a new search with the current token.

Перечень чертежейList of drawings

На фиг. 1 показана архитектура устройства;FIG. 1 shows the architecture of the device;

на фиг. 2 - пример шаблона и соответствующее ему дерево выражений;in fig. 2 - an example of a template and the corresponding expression tree;

на фиг. 3 - блок-схема, поясняющая общий алгоритм работы исполнителя поиска;in fig. 3 is a block diagram explaining the general algorithm of the search performer's work;

- 7 037156 на фиг. 4 - блок-схема, поясняющая общий алгоритм принятия решений по дереву кандидатов в зависимости от значения текущей лексемы;- 7 037156 in Fig. 4 is a block diagram explaining the general decision-making algorithm for the candidate tree depending on the value of the current token;

на фиг. 5 - пошаговое изменение состояния исполнителя поиска на примере поиска шаблона с альтернативами и исключениями вариации в некоторой входной последовательности лексем.in fig. 5 - a step-by-step change in the state of the search executor using the example of a pattern search with alternatives and exceptions of variation in a certain input sequence of tokens.

Сведения, подтверждающие возможность осуществления изобретенияInformation confirming the possibility of carrying out the invention

Сведения, подтверждающие возможность осуществления изобретения, изложены далее в разделах Архитектура устройства, Язык описания шаблонов, Правила обработки текста, Разбор входного текста на лексемы, Представление шаблонов на уровне исполнителя поиска, Трансляция текстового описания шаблонов, Структуры данных кандидатов и результатов совпадений, Алгоритм работы исполнителя поиска, Общий алгоритм принятия решений, Правила принятия решений, Рекомендации по тестированию, Испытания производительности, Заключение.Information confirming the possibility of carrying out the invention is set forth below in the sections Device architecture, Template description language, Text processing rules, Parsing the input text into tokens, Representation of templates at the level of the search executor, Translation of the text description of templates, Data structures of candidates and match results, Algorithm of the executor search, General algorithm of decision making, Decision rules, Recommendations for testing, Performance tests, Conclusion.

Архитектура устройства.Device architecture.

Частью данного изобретения является изображённый на фиг. 1 вариант архитектуры устройства, выполненный авторами для осуществления изобретения. Согласно фиг. 1 типичное устройство включает в себя транслятор с языка описания шаблонов 12, лексический анализатор текста 19 и исполнитель поиска 21.Part of the present invention is shown in FIG. 1 is a variant of the device architecture made by the authors to implement the invention. As shown in FIG. 1, a typical device includes a template description language translator 12, a text lexical analyzer 19, and a search engine 21.

Шаблоны 11, написанные на языке описания шаблонов, загружаются из текстового файла и подаются на вход транслятора с языка описания шаблонов 12, который выполняет синтаксический разбор шаблонов с помощью синтаксического анализатора 13, а затем создаёт пакет шаблонов 15 с помощью генератора пакета шаблонов 14. В процессе своей работы транслятор 12 обращается к лексическому анализатору текста 19 для преобразования текстовых литералов, состоящих более чем из одной лексемы, в последовательность лексем.The templates 11, written in the template description language, are loaded from a text file and fed to the input of the translator from the template description language 12, which parses the templates using the parser 13, and then creates the template package 15 using the template package generator 14. In the process its work translator 12 refers to the lexical analyzer of the text 19 to convert text literals, consisting of more than one token, into a sequence of tokens.

Пакет шаблонов 15 представляет собой транслированные описания шаблонов в виде структур данных в оперативной памяти и включает в себя поисковые выражения с поисковыми индексами 16 и корневой поисковый индекс 17.Template package 15 is translated descriptions of templates in the form of in-memory data structures and includes search expressions with search indexes 16 and a root search index 17.

Перед выполнением поиска лексический анализатор текста 19 выполняет разбор входного текста 18 на лексемы. Полученная в результате разбора последовательность лексем 20 передаётся вместе с пакетом шаблонов 15 исполнителю поиска 21.Before performing the search, the lexical text analyzer 19 parses the input text 18 into tokens. The resulting parsing sequence of tokens 20 is passed along with the pattern packet 15 to the search engine 21.

Исполнитель поиска 21 отыскивает в тексте все совпадения с шаблонами за один последовательный просмотр лексем текста 20. Для реализации данной возможности исполнитель поиска 21 создаёт и обновляет структуру данных, которая содержит набор кандидатов совпадений 22. По каждой лексеме исполнитель поиска 21 обновляет набор кандидатов 22 на основе правил принятия решений 23 о полном совпадении, несовпадении или неполном совпадении соответствующих шаблонов. Кандидаты полностью совпадающих с текстом шаблонов превращаются в результаты совпадений 24. Кандидаты не совпадающих с текстом шаблонов удаляются. Кандидаты частично совпадающих с текстом шаблонов остаются в наборе кандидатов 22.The search engine 21 searches the text for all matches with patterns in one sequential scan of the text tokens 20. To implement this possibility, the search engine 21 creates and updates a data structure that contains a set of match candidates 22. For each token, the search engine 21 updates the set of candidates 22 based on decision rules 23 on complete match, non-match or incomplete match of the corresponding templates. Candidates for templates that completely match the text are converted to match results 24. Candidates for templates that do not match the text are removed. The candidates for overlapping text templates remain in the candidate set 22.

Выполненный авторами вариант осуществления изобретения использует методологию объектноориентированного программирования (ООП), что сделано с целью лучшего структурирования данных и алгоритмов, а также с целью упрощения и ускорения реализации. Однако выбор методологии и языка программирования с поддержкой ООП не является обязательным для осуществления изобретения. С тем же результатом может быть использована методология структурного программирования или другая методология, а также любой язык императивного программирования, в том числе машинный язык (ассемблер).The embodiment of the invention made by the authors uses the methodology of object-oriented programming (OOP), which is done in order to better structure data and algorithms, as well as to simplify and speed up the implementation. However, the choice of a methodology and programming language with support for OOP is not required for the implementation of the invention. With the same result, a structured programming methodology or other methodology can be used, as well as any imperative programming language, including machine language (assembler).

Язык описания шаблонов.Template description language.

Частью данного изобретения является один из вариантов языка описания шаблонов, который позволяет декларативно описывать искомые конструкции в тексте с помощью формальных грамматик. Грамматики перечисляются в тексте друг за другом. Каждая грамматика начинается с имени (идентификатора), затем записывается знак равенства, затем следует выражение, состоящее из текстовых литералов, типов лексем, ссылок на другие грамматики и операторов. Грамматика завершается символом точки с запятой. В общем случае выражение может содержать следующие элементы:Part of the present invention is one of the variants of the template description language, which allows declaratively describing the desired constructions in the text using formal grammars. The grammars are listed one after another in the text. Each grammar begins with a name (identifier), then an equal sign is written, followed by an expression consisting of text literals, token types, references to other grammars, and operators. The grammar is terminated with a semicolon. In general, an expression can contain the following elements:

текстовые литералы, сравниваемые с учётом или без учёта регистра символов (т.е. без учёта заглавных и прописных букв);text literals that are compared with or without case-sensitive comparisons (i.e., no uppercase or lowercase);

типы лексем: буквенная, цифровая или буквенно-цифровая последовательность символов, знак препинания, небуквенный знак, пробельный символ или признак перевода строки;types of tokens: alphabetic, numeric or alphanumeric sequence of characters, punctuation mark, non-alphabetic character, whitespace or line feed character;

последовательности со строгим следованием элементов;sequences with strict adherence of elements;

альтернативы (вариации), представляющие собой перечисления элементов, допустимых в данной точке шаблона, а также элементов, совпадение которых недопустимо;alternatives (variations), which are enumerations of elements that are allowed at a given point in the template, as well as elements that cannot be matched;

повторения любого элемента шаблона минимально и максимально допустимое число раз;repetition of any element of the template the minimum and maximum number of times;

область действия шаблона, задающая необходимость совпадения одного шаблона внутри другого;the scope of the template, which specifies the need to match one template within another;

следование элементов через слова с указанием максимального и минимального допустимого числа слов между элементами;the following of elements through words with an indication of the maximum and minimum permissible number of words between the elements;

ссылки на другие шаблоны, т.е. вхождения других шаблонов.links to other templates, i.e. occurrences of other patterns.

- 8 037156- 8 037156

Допустимо произвольное сочетание вышеперечисленных элементов. Их синтаксис и примеры на языке описания шаблонов рассмотрены ниже.An arbitrary combination of the above elements is permissible. Their syntax and examples in the template description language are discussed below.

Текстовые литералы.Text literals.

Текстовый литерал представляет собой заключённую в одинарные или двойные кавычки последовательность символов в кодировке UnicodeA text literal is a sequence of Unicode characters enclosed in single or double quotes

MinskMinsk

По умолчанию текст сравнивается без учёта регистра символов. Если необходимо задать текст, который сравнивается с учётом регистра символов, то после кавычки, завершающей текст, записывается восклицательный знак. НапримерBy default, text is compared in a case-insensitive manner. If you need to specify a text that is compared in a case-sensitive manner, then an exclamation mark is written after the quotation mark that ends the text. for example

Minsk!Minsk!

При трансляции шаблона каждый текстовый литерал разбирается на лексемы по такому же алгоритму, что и текст, в котором выполняется поиск. В результате, исходный текстовый литерал преобразуется в последовательность лексем.When translating a template, each text literal is parsed into tokens using the same algorithm as the text in which the search is performed. As a result, the original text literal is converted into a sequence of tokens.

Типы лексем.Types of tokens.

Типы лексем применяются для поиска лексем без указания их точного строкового представления. Доступны следующие типы лексем:Token types are used to search for tokens without specifying their exact string representation. The following types of tokens are available:

Alpha - буквенная лексема (от англ. alphabetic), непрерывная последовательность буквенных символов;Alpha - an alphabetic token (from the English alphabetic), a continuous sequence of alphabetic characters;

Num - цифровая лексема (от англ. numeric), непрерывная последовательность цифр;Num - digital token (from the English numeric), a continuous sequence of numbers;

AlphaNum - буквенно-цифровая лексема, непрерывная последовательность букв и цифр, начинающаяся с буквы (от англ. alphanumeric);AlphaNum - an alphanumeric token, a continuous sequence of letters and numbers starting with a letter (from the English alphanumeric);

NumAlpha - буквенно-цифровая лексема, непрерывная последовательность букв и цифр, начинающаяся с цифры (от англ. numeric-alpha);NumAlpha - an alphanumeric token, a continuous sequence of letters and numbers, starting with a number (from the English numeric-alpha);

Punct - знак препинания (от англ. punctuation), относятся все знаки, выполняющие функции разделения или выделения смысловых частей текста, предложений, словосочетаний, слов, частей слова, указания отношений между словами, указания на тип предложения, его эмоциональную окраску, а также некоторые иные функции;Punct - punctuation mark (from the English punctuation), includes all signs that perform the function of separating or highlighting semantic parts of the text, sentences, phrases, words, parts of a word, indicating the relationship between words, indicating the type of sentence, its emotional coloring, as well as some other functions;

Symbol - небуквенный знак, например: &, %, $, #, _, {,} и др.;Symbol - a non-alphabetic character, for example: &,%, $, #, _, {,}, etc .;

Space - пробел, табуляция или другой разделитель без начертания;Space - space, tab or other separator without style;

NewLine - признак новой строки;NewLine - sign of a new line;

Start - признак начала текста;Start - sign of the beginning of the text;

End - признак конца текста.End - a sign of the end of the text.

Использование типа лексемы осуществляется по идентификатору (имени). Все идентификаторы в языке описания шаблонов чувствительны к регистру символов.The token type is used by its identifier (name). All identifiers in the template description language are case sensitive.

Последовательности.Sequences.

Строгое следование элементов задаётся оператором последовательности, где каждый элемент является шаблоном. Для объединения элементов в последовательность используется знак +. Например, доменное имя .com можно задать шаблоном + сотStrict adherence of elements is specified by the sequence operator, where each element is a pattern. The + sign is used to combine elements into a sequence. For example, a .com domain name can be specified with the template + honeycomb

Приведённая запись эквивалентна текстовому литералу .com. Однако последовательность, полученная разбиением текстового литерала на части без соблюдения правил разбора на лексемы, не эквивалентна исходному текстовому литералу. Например, замена строковой константы Minsk, состоящей из одной лексемы, на последовательность буквThis notation is equivalent to the text literal .com. However, the sequence obtained by splitting a text literal into pieces without observing the rules for parsing into tokens is not equivalent to the original text literal. For example, replacing the string constant Minsk, consisting of one token, with a sequence of letters

4- П ·£ п 4- 4- g 4- ]ζ приведёт к тому, что совпадение с текстом Minsk выявлено не будет.4- П · £ п 4- 4- g 4-] ζ will lead to the fact that no match with the text Minsk will be found.

Вариации.Variations.

Для перечисления нескольких элементов, совпадение которых допустимо в данной точке шаблона, используется оператор вариации. Как и в случае с последовательностью, каждый элемент вариации является шаблоном. Оператор вариации также обладает возможностью задавать шаблоны, совпадение которых недопустимо. Такие элементы вариации называются исключениями. Исключения способны отменять совпадения альтернатив, которые начали совпадать с той же позиции в тексте, и при этом могут покрывать больший участок текста, чем сама альтернатива. Вариация задаётся перечислением в фигурных скобках через запятую всех альтернатив и исключений. При этом исключения помечаются префиксным знаком ~. Например, упрощённый вариант шаблона поиска точки как разделителя предложений можно записать следующим образом:The variation operator is used to list multiple elements that can be matched at a given point in the pattern. As with the sequence, each element of the variation is a pattern. The variation operator also has the ability to specify patterns that cannot be matched. Such elements of variation are called exceptions. Exceptions are capable of canceling coincidences of alternatives that began to coincide from the same position in the text, and at the same time they can cover a larger section of the text than the alternative itself. The variation is specified by listing all alternatives and exceptions in curly braces separated by commas. In this case, the exceptions are marked with a prefix ~. For example, a simplified version of the dot search pattern as a sentence separator can be written as follows:

-.net, -.com}-.net, -.com}

Приведённый шаблон выявит точку в тексте Sentence., но не выявит её в тексте Sentence.net.The above template will reveal a dot in the Sentence. Text, but will not reveal it in the Sentence.net text.

Повторения.Repetitions.

Язык описания шаблонов позволяет задавать количество повторений любого самостоятельного элемента шаблона с помощью оператора повторений. Под самостоятельным элементом понимается та- 9 037156 кой элемент, который сам по себе является шаблоном, т.е., например, элемент исключения в вариации (знак ~ и следующий за ним шаблон) не является самостоятельным элементом, но сам шаблон, следующий за знаком ~, является таковым. Повторение задаётся с помощью квадратных скобок, записанных перед повторяемым элементом. Внутри квадратных скобок указывается минимальное и максимальное допустимое число повторений, разделённых знаком -. Например, повторение элемента X от трёх до пяти раз задаётся шаблономThe template description language allows you to specify the number of repetitions of any independent template element using the repetition operator. An independent element is understood as an element that is itself a template, i.e., for example, an exclusion element in a variation (the ~ sign and the following template) is not an independent element, but the template itself following the sign ~ is that. Repetition is specified using square brackets in front of the repeated element. Inside the square brackets, the minimum and maximum allowed number of repetitions are indicated, separated by the - sign. For example, repetition of element X from three to five times is specified by the pattern

[3-5] X[3-5] X

Если минимальное и максимальное допустимое число повторений совпадают, то допустимо указывать одно число. Например, повторение элемента X ровно три раза задаётся шаблономIf the minimum and maximum permissible number of repetitions are the same, then it is permissible to specify the same number. For example, repetition of element X exactly three times is specified by the pattern

[3] X[3] X

Если минимальное и максимальное допустимое число повторений не заданы явно, то подразумевается, что они равны единице. Максимальное ограничение может быть не задано, что разрешает сколь угодно большое число повторений. В этом случае за минимальным ограничением ставится знак +. Например, повторение элемента X один или более раз задаётся шаблономIf the minimum and maximum allowed number of repetitions are not specified explicitly, then they are assumed to be equal to one. The maximum limit can be omitted, which allows an arbitrarily large number of repetitions. In this case, a + sign is placed behind the minimum restriction. For example, repetition of element X one or more times is specified by the pattern

[1+] х[1+] x

Отсутствие максимального ограничения в записи трактуется исполнителем поиска как достаточно большое число, которое при этом не должно быть слишком велико, чтобы не приводить к сильному замедлению проверки шаблонов. В случае нескольких пересекающихся вариантов совпадения выбирается наибольшая последовательность, совпадение которой в тексте началось раньше.The absence of a maximum restriction in the record is interpreted by the search performer as a sufficiently large number, which at the same time should not be too large, so as not to significantly slow down the pattern checking. In the case of several overlapping matches, the largest sequence is selected, the match of which in the text began earlier.

Минимальное число может быть нулевым, тогда считается, что совпадение элемента необязательно, такие элементы называются необязательными (опциональными). В случае, если элемент не встречается вовсе или встречается один раз, можно использовать вопросительный знак - сокращённую форму записи повторения от нуля до одного разаThe minimum number can be zero, then it is considered that the match of an element is optional, such elements are called optional (optional). In case an element does not occur at all or occurs once, you can use a question mark - an abbreviated form of writing repetition from zero to one time

Необязательные элементы оказывают влияние и на шаблон, в состав которого они входят. Например, последовательность необязательных элементов сама считается необязательной, также необязательной считается вариация, если хотя бы один из её элементов-альтернатив является необязательным.Optional elements also affect the template they are included in. For example, a sequence of optional elements is itself considered optional, and a variation is also considered optional if at least one of its alternative elements is optional.

Стандартные шаблоны.Standard templates.

В языке описания шаблонов существуют предопределённые или стандартные шаблоны, которые автоматически доступны для использования в любых пользовательских шаблонах. Стандартные шаблоны служат для короткой записи часто используемых конструкций. Существуют следующие стандартные шаблоны:There are predefined or standard templates in the template description language that are automatically available for use in any custom templates. Standard templates are used to shortly describe commonly used constructs. The following standard templates exist:

Any - любая лексема, исключая признаки начала и конца текста, эквивалентно вариации {Alpha, Num, AlphaNum, NumAlpha, Space, Punct, Symbol, NewLine};Any - any lexeme, excluding the signs of the beginning and end of the text, is equivalent to the variation {Alpha, Num, AlphaNum, NumAlpha, Space, Punct, Symbol, NewLine};

Word - слово, эквивалентно вариации {Alpha, Num, AlphaNum, NumAlpha};Word is a word equivalent to the variation {Alpha, Num, AlphaNum, NumAlpha};

Blanks - любое количество пробелов или признаков конца строки, эквивалентно шаблону [1+] {Space, NewLine};Blanks - any number of spaces or line terminators, equivalent to the pattern [1+] {Space, NewLine};

WordBreaks - любое количество разделителей слов, таких как пробелы, знаки пунктуации, небуквенные знаки и признаки конца строки, эквивалентно шаблонуWordBreaks - Any number of word separators such as spaces, punctuation, non-alphabetic and line terminators, equivalent to pattern

[1+] {Space, Punct, Symbol, NewLine}.[1+] {Space, Punct, Symbol, NewLine}.

Язык описания шаблонов не специфицирует реализацию стандартных шаблонов на уровне исполнителя поиска. Хотя их всех можно выразить через уже описанные операторы, из соображений повышения производительности допустима также отдельная реализация.The template description language does not specify the implementation of standard templates at the search executor level. While all of them can be expressed in terms of the operators already described, a separate implementation is also acceptable for performance reasons.

Область действия.Scope.

Для поиска совпадения одного шаблона внутри другого используется оператор области действия. Данный оператор позволяет выполнять поиск конструкций в пределах определённых частей текста, например, предложений или абзацев. Оператор области действия обозначается символом @ и включает в себя левую и правую части. В левой части записывается шаблон, который необходимо найти, а в правой шаблон, внутри которого должно быть найдено совпадение шаблона из левой части. Один шаблон считается совпавшим внутри другого, если левая граница совпадения первого шаблона находится не раньше в тексте, чем левая граница совпадения второго шаблона, а правая граница - не позже в тексте, чем правая граница совпадения второго шаблона. Например, для поиска совпадения шаблона X внутри шаблона Y следует использовать шаблонThe scope operator is used to find a match of one pattern within another. This operator allows you to search for structures within certain parts of the text, for example, sentences or paragraphs. The scope operator is denoted by the @ symbol and includes both left and right sides. On the left side, the template that needs to be found is written, and in the right side, the template, inside which a match for the template from the left side should be found. One pattern is considered to be a match within another if the left border of the first match is no earlier in the text than the left border of the second match, and the right border is no later in the text than the right border of the second match. For example, to find a match for pattern X within pattern Y, use pattern

Здесь вместо Y допускается использовать как имя другого объявленного шаблона, так и непосредственно сам шаблон, объявленный по месту. Оператор области действия имеет ограничение на указание повторений: для самого шаблона области действия допустимо только указание необязательности. При этом для шаблонов в левой и правой части оператора никаких ограничений нет.Here, instead of Y, it is allowed to use both the name of another declared template, and the template itself, declared in place. The scope operator has a restriction on specifying repetitions: for the scope template itself, only optional specifying is allowed. At the same time, there are no restrictions for templates on the left and right sides of the operator.

Следование через слова и разделители.Follow through words and separators.

Поиск определённых слов и конструкций, находящихся на некотором расстоянии в тексте друг отSearch for certain words and constructions that are at some distance in the text from each other

- 10 037156 друга, является популярной задачей. В языке описания шаблонов для этого существует соответствующий оператор следования через слова. В операторе следования через слова под словом понимается стандартный шаблон Word. Расстояние между левой и правой частью задаётся в словах и записывается аналогично повторениям, но без квадратных скобок. В случае нескольких пересекающихся вариантов совпадения результирующим считается совпадение меньшей длины. Это связано с тем, что при поиске связанных по смыслу слов, чем меньше расстояние между словами, тем вероятнее они связаны и являются целью поиска. Оператор следования через слова также допускает указание шаблона, совпадение которого между левой и правой частью недопустимо. Если необходимо запретить несколько конструкций, то следует использовать вариацию. В общем виде оператор следования через слова записывается следующим образом:- 10 037156 friend is a popular challenge. In the language for describing templates, there is a corresponding operator for following through words for this. In the word sequence operator, a word means a standard Word template. The distance between the left and right parts is specified in words and is written similarly to repetitions, but without square brackets. In the case of several overlapping matches, the result is a match of shorter length. This is due to the fact that when searching for related words, the smaller the distance between words, the more likely they are related and are the target of the search. The operator of following through words also allows specifying a pattern, the match of which between the left and right parts is not allowed. If it is necessary to prohibit several constructions, then a variation should be used. In general, the word-sequence operator is written as follows:

X .. Μ-Ν -Ζ .. ΥX .. Μ-Ν -Ζ .. Υ

Здесь определено следование Y за X на расстоянии от M до N слов, причём между X и Y не должно находиться шаблона Z. Эквивалентную данной конструкцию можно также выразить через уже описанные операторы и стандартные шаблоныHere, Y follows X at a distance from M to N words, and between X and Y there should not be a pattern Z. An equivalent to this construction can also be expressed through the already described operators and standard patterns

X + [M-N](WordBreaks + {Word, ~Х, ~Y, ~Ζ}) + ?{WordBreaks, ~Y} + ΥX + [M-N] (WordBreaks + {Word, ~ X, ~ Y, ~ Ζ}) +? {WordBreaks, ~ Y} + Υ

Важной частью приведённого примера является включение шаблонов X и Y в виде исключений вариации. Это обеспечивает выполнение требования о выборе самого короткого совпадения из пересекающихся: между X и Y не может находиться ни X, ни Y. Частным случаем следования через слова является следование через разделители, т.е. следование через ноль слов. Сокращённая форма записи такой конструкции представлена шаблономAn important part of the above example is to include the X and Y patterns as variation exceptions. This ensures that the requirement to choose the shortest match from intersecting ones is fulfilled: neither X nor Y can be between X and Y. A special case of following through words is following through separators, i.e. following through zero words. The abbreviated form of such a construction is represented by the template

X . . . Υ который также может быть выражен эквивалентной цепочкойX. ... ... Υ which can also be expressed by an equivalent string

X + ?{WordBreaks, ~Y} + YX +? {WordBreaks, ~ Y} + Y

Язык описания шаблонов не специфицирует способ реализации данного оператора на уровне исполнителя поиска: допустимо как выражение через другие операторы, так и отдельная реализация.The template description language does not specify a way to implement this operator at the level of the search executor: both an expression through other operators and a separate implementation are acceptable.

Последовательное упоминание.Consistent mention.

В некоторых случаях требуется найти некоторые слова в тексте, наподобие того, как это выполняется при полнотекстовом поиске. Для этого в языке описания шаблонов используется оператор последовательного упоминания, обозначаемый символом &. Считается, что шаблоны последовательно упоминаются в тексте, если они встречаются в нем на любом расстоянии и в любой последовательности. Оператор последовательного упоминания можно выразить через другие операторы, например, шаблонIn some cases, you need to find some words in the text, similar to how it is done with a full-text search. To do this, the template description language uses the sequential reference operator, denoted by the & symbol. It is considered that patterns are consistently mentioned in the text if they occur in it at any distance and in any sequence. The sequential mention operator can be expressed through other operators, for example, the pattern

X & Υ можно заменить на эквивалентный шаблон {X .. 0+ .. Υ, Υ .. 0+ .. X]X & Υ can be replaced with the equivalent pattern {X .. 0+ .. Υ, Υ .. 0+ .. X]

Язык описания шаблонов не специфицирует способ реализации данного оператора на уровне исполнителя поиска, допустимо как выражение оператора через другие операторы, так и отдельная реали зация.The template description language does not specify a way to implement this operator at the level of the search executor; both the expression of the operator through other operators and a separate implementation are acceptable.

Приоритет операций.Priority of operations.

В следующей таблице перечислены приоритеты операций в языке описания шаблонов в порядке убывания приоритета.The following table lists the priorities of the operations in the template description language in descending order of priority.

Приоритет Priority Название Name Обозначение Designation 1 one Исключение элемента вариации Eliminating a variation element 2 2 Повторение Reiteration [] [] 3 3 Вариация Variation {} {} 4 four Последовательное упоминание Consistent mention & & 5 five Последовательность Sequence + + 6 6 Следование через слова и разделители Follow through words and separators 7 7 Область действия Scope

В случае необходимости приоритет операции можно задать явно с помощью круглых скобок. Например, в шаблонеIf necessary, the priority of the operation can be specified explicitly using parentheses. For example, in the template

[1-5] (X + Υ) повторение относится ко всей цепочке. Без круглых скобок повторение будет относиться только к шаблону X.[1-5] (X + Υ) repetition refers to the entire chain. Without parentheses, repetition will only refer to pattern X.

Именованные шаблоны и теги.Named templates and tags.

Если необходимо сделать ссылку на шаблон из другого шаблона или вернуть его в результатах поиска, то ему следует задать уникальное имя. Имя шаблона представляет собой идентификатор, т.е. начинается с буквы или знака подчёркивания, за которым следует последовательность букв, цифр и знаковIf you need to link to a template from another template or return it in search results, then it should be given a unique name. The template name is an identifier, i.e. starts with a letter or underscore followed by a sequence of letters, numbers, and signs

- 11 037156 подчёркивания. Такой шаблон называется именованным. За именем через знак равенства следует сам шаблон. Определение именованного шаблона завершается точкой с запятой. Например, в шаблоне- 11 037156 underlines. Such a template is called named. The name, separated by an equal sign, is followed by the pattern itself. The definition of a named template ends with a semicolon. For example, in the template

Ζ = X;Ζ = X;

именованный шаблон Z определён через шаблон X. Совпадение X означает совпадение Z. Идентификатор Z может использоваться для ссылки на шаблон в других шаблонах. Для того чтобы шаблон возвращался в результатах поиска, перед шаблоном следует поставить знак #. Помеченный таким образом шаблон называется целевым шаблоном или тегом. Например, для определения тега Z через цепочку X и Y используется шаблон #Z = X + Y;named pattern Z is defined through pattern X. Match X means match Z. The identifier Z can be used to refer to a pattern in other patterns. In order for the template to be returned in search results, precede the template with the # sign. A template marked this way is called a target template or tag. For example, to define a Z tag through a string of X and Y, use the pattern #Z = X + Y;

Именованные шаблоны, не являющиеся тегами, используются для создания других шаблонов, но в результат поиска они не попадают.Named templates that are not tags are used to create other templates, but they are not included in the search results.

Правила обработки текста.Text processing rules.

Правила разбора входного текста на лексемы.Rules for parsing the input text into tokens.

Разбор входного текста на лексемы реализован на основе следующих правил:Parsing the input text into tokens is implemented based on the following rules:

а) начало и конец текста должны выделяться как лексемы нулевой длины с типами Start и End соответственно;a) the beginning and end of the text should be selected as zero-length tokens of types Start and End, respectively;

б) последовательность из символа возврата каретки и символа перевода строки должна выделяться как лексема с типом NewLine;b) a sequence of carriage return and linefeed characters must be allocated as a token of type NewLine;

в) последовательность буквенных символов должна выделяться как лексема с типом Alpha;c) a sequence of alphabetic characters must be selected as a token with the Alpha type;

г) последовательность цифровых символов должна выделяться как лексема с типом Num;d) a sequence of numeric characters should be allocated as a token with the Num type;

д) последовательность цифровых и буквенных символов, начинающаяся с буквы, должна выделяться как лексема с типом AlphaNum;e) a sequence of numeric and alphabetic characters starting with a letter must be allocated as a token with the AlphaNum type;

е) последовательность цифровых и буквенных символов, начинающаяся с цифры, должна выделяться как лексема с типом NumAlpha;f) a sequence of numeric and alphabetic characters starting with a digit must be allocated as a token of type NumAlpha;

ж) последовательность пробельных символов должна выделяться как лексема с типом Space;g) a sequence of whitespace characters must be allocated as a token of type Space;

з) каждый знак препинания должен выделяться как отдельная лексема с типом Punct;h) each punctuation mark must be highlighted as a separate token with the Punct type;

и) каждый небуквенный знак, не относящийся к знакам препинания, должен выделяться как отдельная лексема с типом Symbol.i) each non-alphabetic character that is not punctuation marks must be allocated as a separate token with the Symbol type.

Правила трансляции шаблонов поиска.Search pattern translation rules.

Трансляция шаблонов поиска с языка описания шаблонов должна быть реализована с учётом следующих правил:Translation of search templates from the template description language should be implemented taking into account the following rules:

а) текстовое описание шаблонов должно транслироваться в дерево выражений, используемое исполнителем поиска;a) the textual description of the templates should be translated into an expression tree used by the search engine;

б) попытка трансляции некорректного текстового описания шаблонов должна приводить к формированию программного исключения с информацией об ошибке;b) an attempt to translate an incorrect text description of templates should lead to the generation of a program exception with information about the error;

в) текст описания шаблонов для трансляции должен загружаться из файла или браться из текстовой строки;c) the text describing templates for broadcasting must be loaded from a file or taken from a text string;

г) текстовые литералы, состоящие более чем из одной лексемы, должны преобразовываться в последовательности лексем на основе тех же правил лексического разбора, что используются для входного текста;d) text literals consisting of more than one token must be converted into a sequence of tokens based on the same parsing rules that are used for the input text;

д) в процессе трансляции текстового описания шаблонов может выполняться замена одних выражений в дереве выражений на другие эквивалентные выражения с целью оптимизации используемой памяти и скорости поиска.e) in the process of translation of the textual description of templates, some expressions in the expression tree can be replaced with other equivalent expressions in order to optimize the memory used and the search speed.

Общие правила поиска.General search rules.

Алгоритм поиска должен быть реализован с учётом следующих правил:The search algorithm must be implemented taking into account the following rules:

а) поиск множества шаблонов в одном документе должен выполняться за один просмотр последовательности лексем, соответствующих документу;a) the search for multiple patterns in one document should be performed in one scan of the sequence of tokens corresponding to the document;

б) в случае обнаружения нескольких пересекающихся совпадений одного и того же шаблона, результирующим должно считаться наибольшее по длине в лексемах совпадение, которое началось раньше по тексту;b) if several overlapping matches of the same template are found, the result should be considered the largest match in length in lexemes, which began earlier in the text;

в) максимальный используемый объем памяти при поиске должен ограничиваться максимальным количеством одновременно рассматриваемых вариантов совпадений;c) the maximum memory used during the search should be limited by the maximum number of simultaneously considered variants of matches;

г) результаты поиска должны быть представимы в иерархическом виде, где для каждого элемента шаблона хранится участок текста, с которым найдено совпадение, и аналогичные результаты для его вложенных элементов;d) search results should be presented in a hierarchical form, where for each element of the template is stored a piece of text with which a match was found, and similar results for its nested elements;

д) результаты поиска должны быть представимы в виде списка, где сохранены только совпадения тегов, без детализации вложенных элементов;e) search results should be presented in the form of a list, where only matches of tags are saved, without detailing the nested elements;

е) в качестве результатов поиска должны возвращаться только совпадения тегов, т.е. совпадения шаблонов, помеченных знаком #.f) Only tag matches should be returned as search results, i.e. matches of templates marked with #.

Правила поиска текстовых литералов и типов лексем.Search rules for text literals and token types.

Поиск текстовых литералов и типов лексем должен быть реализован с учётом следующих правил:Searching for text literals and token types must be implemented with the following rules in mind:

- 12 037156- 12 037156

а) поиск текстовых литералов должен выполняться в соответствии с заданной чувствительностью к регистру символов;a) the search for text literals should be performed in accordance with the specified case sensitivity;

б) поиск типа лексемы должен выполняться в соответствии с типом лексемы, присвоенным на этапе разбора входного текста на лексемы, без дополнительного анализа текста лексемы.b) the search for the type of token should be performed in accordance with the type of token assigned at the stage of parsing the input text into tokens, without additional analysis of the text of the token.

Правила поиска последовательностей.Sequence search rules.

Поиск последовательностей должен быть реализован с учётом следующих правил:Sequence search should be implemented taking into account the following rules:

а) в качестве элемента последовательности может быть задан любой шаблон;a) any template can be specified as an element of the sequence;

б) последовательность должна считаться совпавшей, когда совпали все её обязательные элементы;b) the sequence must be considered matched when all of its mandatory elements have been matched;

в) при поиске необязательных элементов последовательности должны всегда рассматриваться оба варианта совпадения последовательности: с необязательным элементом и без, даже в том случае, когда совпадение необязательного элемента найдено;c) when searching for optional elements of a sequence, both variants of matching the sequence should always be considered: with an optional element and without, even if a match for an optional element is found;

г) последовательность, все элементы которой являются необязательными, должна считаться необязательной.d) a sequence, all elements of which are optional, should be considered optional.

Правила поиска вариаций.Variations search rules.

Поиск вариаций должен быть реализован с учётом следующих правил:The search for variations should be implemented taking into account the following rules:

а) в качестве альтернативы может быть задан любой шаблон;a) alternatively, any template can be specified;

б) в качестве исключения может быть задан любой шаблон;b) any template can be specified as an exception;

в) при поиске совпадения вариации должны независимо рассматриваться варианты совпадения каждой из её альтернатив;c) when searching for a match of a variation, the options for matching each of its alternatives should be independently considered;

г) совпадение шаблона-исключения должно отменять совпадение только тех шаблоновальтернатив, которые совпали или начали совпадать с той же начальной позиции, что и шаблонисключение;d) the match of the exclusion pattern should cancel the match of only those alternative patterns that coincided or began to match from the same initial position as the exclusion pattern;

д) вариация, хотя бы один из элементов-альтернатив которой является необязательным, должна считаться необязательной.e) a variation, at least one of the alternative elements of which is optional, should be considered optional.

Правила поиска повторений.Repetition search rules.

Поиск повторений должен быть реализован с учётом следующих правил:The search for repetitions should be implemented taking into account the following rules:

а) повторение может быть задано без ограничений для любого шаблона, кроме шаблона области действия;a) repetition can be specified without restriction for any template other than the scope template;

б) для шаблона области действия кроме значения повторения по умолчанию должно быть доступно только указание повторения от нуля до единицы, другие значения должны быть запрещены;b) for the scope template, in addition to the default repetition value, only an indication of repetition from zero to one should be available, other values should be prohibited;

в) шаблон должен считаться повторяющимся, если он совпадает последовательно несколько раз, причём следующее совпадение начинается с лексемы, следующей за той, которой оканчивается предыдущее повторение;c) a pattern should be considered repeated if it matches several times in succession, and the next match begins with the token following the one that ends the previous repetition;

г) должны рассматриваться все варианты совпадения элемента с заданным в виде диапазона числом повторений;d) all variants of matching an element with a specified number of repetitions in the form of a range should be considered;

д) элементы с минимальным числом повторений, равным нулю, должны считаться необязательными;e) elements with a minimum number of repetitions equal to zero should be considered optional;

е) для элементов, у которых не задано максимальное число повторений, это число должно задаваться таким образом, чтобы время поиска совпадений не превышало интервал ожидания ответа вызывающей системой.f) for elements that do not have a maximum number of repetitions specified, this number shall be set in such a way that the search time for matches does not exceed the interval waiting for a response from the calling system.

Правила поиска шаблонов с областью действия.Search rules for scoped templates.

Поиск шаблонов с областью действия должен быть реализован с учётом следующих правил:Searching for scoped templates should be implemented taking into account the following rules:

а) в качестве левого операнда может быть задан любой шаблон;a) any pattern can be specified as the left operand;

б) в качестве правого операнда может быть задано как имя шаблона, так и объявленный по месту шаблон;b) both the template name and the template declared in place can be specified as the right operand;

в) шаблон должен считаться совпавшим в области действия другого, если левая граница совпадения первого шаблона находится не раньше в тексте, чем левая граница совпадения второго шаблона, а правая не позже в тексте, чем правая граница совпадения второго шаблона.c) a pattern should be considered as matched in the area of another pattern if the left border of the first pattern is no earlier in the text than the left border of the second pattern, and the right one is no later in the text than the right border of the second pattern.

Правила поиска шаблонов, следующих через слова и разделители.Search rules for patterns following words and separators.

Поиск шаблонов, следующих через слова и разделители, должен быть реализован с учётом следующих правил:Searching for patterns following words and separators must be implemented taking into account the following rules:

а) в качестве операндов могут быть заданы любые шаблоны;a) any templates can be specified as operands;

б) при подсчёте числа слов под словом должна пониматься цифровая, буквенная или буквенноцифровая лексема, разделители при подсчёте должны игнорироваться;b) when counting the number of words, a word should be understood as a digital, alphabetic or alphanumeric token, separators during counting must be ignored;

в) в случае нескольких пересекающихся вариантов совпадения результирующим должно быть совпадение с меньшим расстоянием в словах между шаблонами;c) in the case of several overlapping variants of the match, the result should be a match with a smaller distance in words between the templates;

г) совпадение шаблона-исключения должно приводить к отмене совпадения шаблона следования через слова;d) the match of the exclusion pattern should lead to the cancellation of the match of the pattern of following through the words;

д) в операторе следования через слова и разделители под разделителем должна пониматься непустая последовательность пробельных символов, знаков пунктуации, переводов строки и других небуквенных знаков.e) in the operator of following through words and separators, a separator must be understood as a non-empty sequence of whitespace characters, punctuation marks, line feeds and other non-letter characters.

Правила поиска последовательно упоминающихся шаблонов.Search rules for sequentially mentioned patterns.

- 13 037156- 13 037156

Поиск последовательно упоминающихся шаблонов должен быть реализован с учётом следующих правил:The search for consistently mentioned patterns should be implemented taking into account the following rules:

б) под последовательно упоминающимися шаблонами должны пониматься шаблоны, которые встречаются в тексте на любом расстоянии и в любом порядке.b) sequentially mentioned patterns should be understood as patterns that occur in the text at any distance and in any order.

Правила поиска ссылок на шаблоны.Rules for searching for links to templates.

Поиск ссылок на шаблоны должен быть реализован с учётом следующих правил:The search for links to templates should be implemented taking into account the following rules:

а) в случае, когда некоторый участок текста совпадает с шаблоном, на который ссылается несколько других шаблонов, поиск шаблона, который находится по ссылке, должен выполняться один раз;a) in the case when a certain piece of text matches a template that is referenced by several other templates, the search for the template that is located by the link must be performed once;

б) должен выполняться поиск шаблонов, структура ссылок между которыми образует левую или правую рекурсию;b) a search for templates should be performed, the structure of links between which forms a left or right recursion;

в) при ссылке на стандартный шаблон выражение, соответствующее стандартному шаблону, должно напрямую подставляться в месте ссылки.c) when referring to a standard template, the expression corresponding to the standard template must be directly substituted at the link location.

Разбор входного текста на лексемы.Parsing the input text into tokens.

Частью данного изобретения является один из вариантов лексического анализатора текста 19, который выполняет разбор входного текста на лексемы. Входными данными для лексического анализатора является текстовая строка. Иных данных для работы не требуется, так как правила разделения на лексемы зафиксированы. Под текстовой строкой в данном случае понимается объект строки в языке программирования, логически представляющий собой массив символов в оперативной памяти. Решение о работе с данными только в оперативной памяти связано с необходимостью высокой производительности данного этапа поиска. Загрузка текстовой строки в память увеличивает вероятность кэширования как частей самой строки, так и программного кода обращения к символу, который при таком подходе является простым обращением по индексу в массиве. Таким образом, разрешение компромисса между затратами времени и памяти в данном случае выполняется в пользу времени. Стоит отметить, что такое же решение принято и в алгоритме поиска: выявление многих шаблонов в тексте за один просмотр текста требует дополнительный объем памяти для хранения состояния всех совпадений. При этом следует учитывать, что в языках с автоматическим управлением памятью увеличение потребления памяти за счёт увеличения числа создаваемых объектов косвенно оказывает влияние на производительность, так как растут накладные расходы на выполнение сборки мусора. В связи с этим выходными данными лексического анализатора является специальный индекс, содержащий исходную строку текста и массив структур, кодирующих тип и границы каждой лексемы в исходном тексте. Создание подстроки для каждой лексемы из исходного текста выполняется по мере необходимости. Такая структура резко сокращает количество объектов, создаваемых в динамической памяти, по сравнению с простым созданием новой подстроки для каждой лексемы.Part of this invention is one of the variants of the lexical text analyzer 19, which parses the input text into tokens. The input data for the lexical analyzer is a text string. No other data is required for work, since the rules for dividing into tokens are fixed. A text string in this case means a string object in a programming language, which is logically an array of characters in RAM. The decision to work with data only in RAM is due to the need for high performance of this stage of the search. Loading a text string into memory increases the likelihood of caching both parts of the string itself and the program code for accessing a symbol, which, with this approach, is a simple access by an index in an array. Thus, in this case, the trade-off between the cost of time and memory is resolved in favor of time. It is worth noting that the same decision was made in the search algorithm: identifying many patterns in the text in one scan of the text requires additional memory to store the state of all matches. It should be borne in mind that in languages with automatic memory management, an increase in memory consumption due to an increase in the number of objects created indirectly affects performance, since the overhead of garbage collection increases. In this regard, the output of the lexical analyzer is a special index containing the original text string and an array of structures encoding the type and boundaries of each token in the original text. The creation of a substring for each token from the source text is performed as needed. This structure drastically reduces the number of objects created in heap compared to simply creating a new substring for each token.

Лексический анализатор использует правила деления на слова стандарта Unicode Standard Annex #29 со следующими модификациями:The lexical analyzer uses the Unicode Standard Annex # 29 word division rules with the following modifications:

символ точки и знак подчёркивания разделяют буквенную или буквенно-цифровую последовательность на несколько лексем;a dot and an underscore separate an alphabetic or alphanumeric sequence into multiple tokens;

десятичный разделитель разделяет цифровую последовательность на несколько лексем; последовательность пробельных символов не разделяется;decimal separator divides a digital sequence into several tokens; the sequence of whitespace characters is not separated;

все знаки препинания и другие небуквенные знаки выделяются отдельной лексемой.all punctuation marks and other non-alphabetic characters are distinguished by a separate token.

Для применения правил определения границы слова используются свойства границы слова. Эти свойства определены для всего диапазона символов Unicode, в том числе для тех, размер которых превышает два байта. В данном решении применяется формат UTF-16, предполагающий использование двух байтов для одного символа. Символы, выходящие за два байта, относятся к древним или редко используемым языкам, поэтому игнорируются.Word boundary properties are used to apply word boundary definition rules. These properties are defined for the entire range of Unicode characters, including those larger than two bytes. This solution uses UTF-16, which uses two bytes for one character. Characters beyond two bytes are ancient or rarely used languages and are therefore ignored.

В процессе обработки текста требуется определять свойство границы слова для каждого символа, выполняя поиск в некоторой структуре данных, в которой хранится информация о свойстве границы слова для каждого двухбайтового символа Unicode. Для эффективного решения данной задачи используется статический двумерный массив свойств границы слова. Адресация осуществляется по старшему и младшему байту символа, для которого требуется получить свойство границы слова. Последовательность свойств символов имеет следующие характеристики: наиболее часто встречаются непрерывные последовательности значений любой и буква алфавита. Для сокращения размера массива для таких последовательностей на втором уровне адресации используются не массивы с одинаковыми значениями свойства для каждого символа, а ссылки на пустые массивы-маркеры. Таким образом, при поиске требуется сначала выделить старший байт искомого символа и использовать его для первого уровня адресации в массиве. Ссылку на полученный массив следует сначала сравнить с двумя массивами-маркерами. В случае совпадения с одним из них тип символа определён. Если же совпадения не найдено, то необходимо выделить младший байт символа и использовать его для адресации в полученном массиве.During text processing, you need to determine the word boundary property for each character by searching some data structure that stores the word boundary property information for each double-byte Unicode character. To effectively solve this problem, a static two-dimensional array of word boundary properties is used. Addressing is performed on the high and low bytes of the character for which you want to get the word boundary property. The sequence of character properties has the following characteristics: the most common are continuous sequences of values of any and the letter of the alphabet. To reduce the size of the array for such sequences, the second addressing level does not use arrays with the same property values for each character, but references to empty marker arrays. Thus, when searching, you must first allocate the most significant byte of the desired character and use it for the first addressing level in the array. The resulting array reference must first be compared against the two marker arrays. In case of a match with one of them, the type of the symbol is determined. If no match is found, then it is necessary to select the least significant byte of the character and use it for addressing in the resulting array.

Кроме выявления границ лексический анализатор определяет также тип лексем. Для выявления типов лексем для каждого нового символа отслеживается, к какому типу он относится, а также какой тип имеет лексема с учётом рассмотренных ранее символов. Символы разделяют на буквы, цифры, знакиIn addition to detecting boundaries, the lexical analyzer also determines the type of tokens. To identify the types of tokens for each new character, it is tracked to what type it belongs to, as well as what type the token has, taking into account the previously considered characters. Symbols are divided into letters, numbers, signs

- 14 037156 пунктуации, знак возврата каретки и перевода строки. Прочие символы считаются небуквенными знаками. Данная логическая модель является конечным автоматом, где состоянием является тип лексемы с учётом рассмотренных ранее символов, а переход между состояниями осуществляется по входному символу. Сброс автомата в исходное состояние, когда тип лексемы считается неопределённым, выполняется на границах лексем. Главный цикл лексического анализатора, в котором выполняется просмотр строки, координирует действия разделителя на лексемы и классификатора лексем. Полный перечень правил определения границы лексемы предполагает буфер свойств символов размером четыре элемента. Кроме текущего символа используется просмотр вперёд на два символа и назад на один. Переход к следующему символу предполагает определение для него свойства границы слова и добавление значения свойства в буфер. В начале обработки текста для инициализации буфера выполняется переход на два символа вперёд. Далее в главном цикле происходит движение вперёд по строке, в процессе которого происходит продвижение буфера и подача символов в классификатор лексем. После обработки каждого символа выполняется проверка, нет ли границы лексемы в текущей позиции. В случае, когда на данном символе лексема завершилась, её границы и тип добавляется в результат. Также в соответствующие позиции результата добавляются искусственные лексемы начала и конца текста нулевой длины.- 14 037156 punctuation, carriage return and line feed. Other characters are considered non-alphabetic characters. This logical model is a finite state machine, where the state is the type of token, taking into account the previously considered symbols, and the transition between states is carried out by the input symbol. Resetting the automaton to its initial state, when the type of token is considered undefined, is performed at the boundaries of tokens. The main loop of the lexical analyzer, which scans the string, coordinates the actions of the token separator and the token classifier. The complete list of rules for determining the boundary of a token assumes a buffer of character properties with a size of four elements. In addition to the current character, look forward by two characters and backward by one are used. Moving to the next character involves defining a word boundary property for it and adding the property value to the buffer. At the beginning of text processing, a jump forward two characters is performed to initialize the buffer. Further, in the main loop, there is a movement forward along the line, during which the buffer moves and the characters are fed into the token classifier. After each character is processed, a check is made to see if there is a token border at the current position. In the case when the token has ended on a given character, its boundaries and type are added to the result. Also, artificial tokens of the beginning and end of the text of zero length are added to the corresponding positions of the result.

Представление шаблонов на уровне исполнителя поиска.Representation of templates at the level of the search performer.

На уровне исполнителя поиска каждый шаблон представлен структурой данных, которая кодирует дерево всех элементов грамматики, из которых состоит шаблон. Эта структура данных называется деревом поисковых выражений и отличается от синтаксического дерева, полученного в результате трансляции шаблона, тем, что приспособлена для эффективной реализации поиска. Дерево поисковых выражений содержит лишь базовые операторы языка описания шаблонов, такие как последовательность и вариация, и не содержит производные операторы, выразимые через базовые операторы, такие как следование через слова и разделители и последовательное упоминание. Разделение между деревом поисковых выражений и синтаксическим деревом шаблона обеспечивает необходимую гибкость при модернизации и совершенствовании решения, позволяет оптимизировать внутренние структуры данных и алгоритмы исполнителя поиска, работающего на уровне поисковых выражений, и сохраняет совместимость с существующими клиентами, работающими на уровне синтаксического дерева.At the search performer level, each template is represented by a data structure that encodes a tree of all grammar elements that make up the template. This data structure is called the search expression tree and differs from the syntax tree resulting from the translation of the template in that it is adapted to efficiently implement the search. The search expression tree contains only the basic operators of the template language, such as sequence and variation, and does not contain derived operators expressible through basic operators, such as following through words and separators and sequential mention. The separation between the search expression tree and the template syntax tree provides the flexibility needed to modernize and improve the solution, optimize the internal data structures and algorithms of the search engine operator working at the search expression level, and maintains compatibility with existing clients working at the syntax tree level.

Дерево поисковых выражений состоит из элементов (выражений) следующего типа:The search expression tree consists of elements (expressions) of the following type:

лексема (класс TokenExpression);token (TokenExpression class);

ссылка на шаблон (класс ReferenceExpression);template reference (ReferenceExpression class);

последовательность (класс SequenceExpression);sequence (SequenceExpression class);

вариация (класс VariationExpression);variation (VariationExpression class);

элемент вариации (класс VariationItemExpression);variation item (VariationItemExpression class);

область действия (класс EnclosureExpression);scope (class EnclosureExpression);

шаблон (класс PatternExpression).pattern (PatternExpression class).

Оператор повторения языка текстового описания шаблонов представлен в исполнителе поиска не отдельным типом выражения, а атрибутом других выражений. Данный атрибут определён у следующих выражений: лексема, ссылка на шаблон, последовательность, вариация и области действия. Такой подход не является обязательным, он выбран, чтобы сократить количество объектов, создаваемых динамически, и чтобы уменьшить глубину дерева выражений. Представление оператора повторения в виде отдельного типа поискового выражения не меняет сути изобретения.The repetition operator of the textual description language of templates is presented in the search engine not as a separate type of expression, but as an attribute of other expressions. This attribute is defined for the following expressions: token, template reference, sequence, variation, and scopes. This approach is optional and was chosen to reduce the number of dynamically created objects and to reduce the depth of the expression tree. The representation of the repetition operator as a separate type of search expression does not change the essence of the invention.

Выражение с типом шаблон является корнем дерева поисковых выражений. Подчинённым элементом этого выражения может быть выражение любого из вышеперечисленных типов, кроме типов шаблон и элемент вариации. Выражения с типом лексема и ссылка на шаблон являются листовыми (терминальными) узлами в дереве, т.е. не могут иметь подчинённых элементов; все остальные выражения являются составными. В терминах ООП классы терминальных выражений (классы TokenExpression и ReferenceExpression) являются производными от класса TerminalExpression. Классы составных выражений (классы SequenceExpression, VariationExpression, VariationItemExpression, EnclosureExpression и PatternExpression) могут включать в себя подчинённые выражения и являются производными от класса CompoundExpression. Базовым классом для классов TerminalExpression и CompoundExpression, а косвенно для всех классов выражений, является класс Expression. Все поисковые выражения с типом шаблон собраны в пакет шаблонов.An expression of type pattern is the root of the search expression tree. The subordinate element of this expression can be an expression of any of the above types, except for the types template and variation element. Expressions with token type and template reference are leaf (terminal) nodes in the tree, i.e. cannot have subordinate elements; all other expressions are compound. In OOP terms, the terminal expression classes (TokenExpression and ReferenceExpression classes) derive from the TerminalExpression class. Composite expression classes (SequenceExpression, VariationExpression, VariationItemExpression, EnclosureExpression, and PatternExpression) can include subexpressions and are derived from CompoundExpression. The base class for the TerminalExpression and CompoundExpression classes, and indirectly for all expression classes, is the Expression class. All search expressions of type template are collected in a package of templates.

Пример шаблона и соответствующего ему дерева выражений приведён на фиг. 2. Как принято в информатике, дерево рисуется перевёрнутым: корень находится вверху, а терминальные элементы - внизу.An example of a template and its corresponding expression tree is shown in Fig. 2. As is customary in computer science, the tree is drawn upside down: the root is at the top, and the terminal elements are at the bottom.

Дерево поисковых выражений каждого шаблона является полностью связным, позволяя не только спускаться от корня вниз к терминальным элементам, но и подниматься от терминальных элементов к головным элементам вверх по дереву. Это важно для работы исполнителя поиска.The search expression tree of each template is fully coherent, allowing not only to go down from the root to the terminal elements, but also to climb from the terminal elements to the head elements up the tree. This is important for the search engineer's work.

Каждое поисковое выражение содержит поисковый индекс. Поисковый индекс - это структура данных (объект), который позволяет по заданной входной лексеме или по заданному совпавшему шаблону быстро получить терминальные элементы (выражения с типом лексема и ссылка на шаблон) всех подчинённых выражений, с которых может начинаться совпадение или несовпадение головного выражеEach search expression contains a search index. A search index is a data structure (object) that allows, for a given input token or for a given matching pattern, to quickly obtain terminal elements (expressions with the token type and a reference to a template) of all subordinate expressions, with which a match or mismatch of the head expression can begin

- 15 037156 ния. Поисковый индекс головного выражения есть результат слияния поисковых индексов подчинённых выражений, причём способ слияния зависит от типа головного выражения. Если головное выражение это вариация, то её поисковый индекс - это объединение, в смысле объединения множеств, поисковых индексов всех элементов вариации. Если головное выражение - это последовательность, то её поисковый индекс - это объединение, в смысле объединения множеств, поисковых индексов всех первых необязательных элементов с поисковым индексом первого обязательного элемента последовательности. Необязательным считается элемент, у которого минимально допустимое число повторений равно нулю. Для терминального выражения (лексема или ссылка на шаблон) поисковый индекс состоит, по сути, из одного элемента и возвращает в результатах поиска либо один этот элемент, либо пустое множество. Очевидно, что полное или частичное включение поисковых индексов подчинённых элементов в поисковые индексы головных элементов требует заметных затрат оперативной памяти, поэтому целесообразно проводить оптимизацию с целью сокращения дублирования одних и тех же данных, в частности текстовых строк, в памяти.- 15 037156 ny. The search index of the head expression is the result of merging the search indices of the subordinate expressions, and the way of merging depends on the type of the head expression. If the head expression is a variation, then its search index is a union, in the sense of combining sets, of search indices of all elements of the variation. If the head expression is a sequence, then its search index is a union, in the sense of combining sets, of the search indices of all the first optional elements with the search index of the first required element of the sequence. An element is considered optional if the minimum number of repetitions is zero. For a terminal expression (token or reference to a template), the search index consists, in fact, of one element and returns either one of this element or an empty set in the search results. It is obvious that the full or partial inclusion of the search indexes of subordinate elements in the search indexes of the head elements requires noticeable expenditures of RAM, therefore, it is advisable to carry out optimization in order to reduce duplication of the same data, in particular text strings, in memory.

Поисковый индекс должен обеспечивать разные виды быстрого поиска выражений: только лексем, только ссылок на шаблоны, любых терминальных выражений. Причём критериями поиска могут быть значения типов лексем и текстовых литералов с учётом и без учёта регистра символов, идентификаторы шаблонов, признаки учёта исключений в вариациях, т.е. терминальных выражений, приводящих к совпадению или не совпадению поискового выражения.The search index should provide different types of quick search for expressions: only tokens, only references to templates, any terminal expressions. Moreover, the search criteria can be values of types of tokens and text literals with and without regard to case, template identifiers, signs of considering exceptions in variations, i.e. terminal expressions that match or fail to match the search expression.

Поисковые индексы всех корневых поисковых выражений с типом шаблон объединяются, в смысле объединения множеств, в корневой поисковый индекс. Корневой поисковый индекс является частью пакета шаблонов.The search indices of all root search expressions of type pattern are merged, in the sense of combining sets, into the root search index. The search root is part of the template package.

В терминах ООП поисковые выражения представляют собой объекты соответствующих классов и имеют следующие основные свойства и процедуры:In OOP terms, search expressions are objects of the corresponding classes and have the following main properties and procedures:

ParentExpression - ссылка на головное выражение;ParentExpression - reference to the head expression;

RepeatRange - структура, в которой хранится допустимое число повторений;RepeatRange is a structure that stores the allowed number of repetitions;

IsRepeatable - определяет, допустим ли оператор повторения для данного выражения (переопределяется в производных классах);IsRepeatable - Determines whether the repetition operator is valid for the given expression (overridden in derived classes);

IsOptional - определяет, является ли выражение необязательным;IsOptional - Determines if the expression is optional;

GetOrCreateExpressionIndex - процедура для получения поискового индекса выражения; если доступ происходит впервые, поисковый индекс будет предварительно создан;GetOrCreateExpressionIndex - a procedure for getting the search index of an expression; if this is the first time it is accessed, the search index will be pre-created;

CreateExpressionIndex - процедура, переопределяемая в производных классах, которая непосредственно выполняет построение индекса для соответствующего типа выражения.CreateExpressionIndex is a procedure, overridden in derived classes, that directly builds an index for the corresponding expression type.

После полного создания пакета шаблонов с поисковыми выражениями и поисковыми индексами, эти структуры данных больше не меняются, в процессе работы исполнителя поиска доступ к ним выполняется только по чтению.After the complete creation of a package of templates with search expressions and search indexes, these data structures do not change anymore; during the search executor's work, access to them is performed only by reading.

Трансляция текстового описания шаблонов.Translation of text descriptions of templates.

Язык описания шаблонов, предложенный в рассматриваемом варианте осуществления изобретения, отличается простым и понятным синтаксисом, который к тому же удобен для автоматического разбора. Трансляция описания шаблонов с этого языка в поисковые выражения выполняется с помощью хорошо известных рекурсивных алгоритмов с использованием машинного или программного стека.The template description language proposed in this embodiment of the invention has a simple and understandable syntax, which is also convenient for automatic parsing. Translation of descriptions of templates from this language into search expressions is performed using well-known recursive algorithms using a machine or software stack.

В рассматриваемом варианте реализации изобретения трансляция текстового описания шаблонов состоит из двух этапов: построение и оптимизация синтаксического дерева по текстовому описанию шаблонов и генерация поисковых выражений с поисковыми индексами для исполнителя поиска.In the considered embodiment of the invention, the translation of the textual description of the templates consists of two stages: the construction and optimization of the syntax tree based on the textual description of the templates and the generation of search expressions with search indices for the search performer.

Построение синтаксического дерева выполняется за один просмотр лексем языка шаблонов с помощью рекурсивного алгоритма и с использованием машинного стека. После выделения следующей лексемы её текстовое представление и тип сохраняются в состоянии синтаксического анализатора. Типы лексем в данном случае отличаются от таковых в лексическом анализаторе текста, так как в данном случае идёт разбор не естественного, а формального языка. Выделяются идентификаторы, строковый и числовой литералы, круглые, фигурные и квадратные скобки, арифметические знаки, а также другие специальные символы, имеющие особый смысл в языке описания шаблонов.The construction of the syntax tree is performed in a single scan of the tokens of the template language using a recursive algorithm and using the machine stack. After highlighting the next token, its textual representation and type are stored in the parser state. The types of tokens in this case differ from those in the lexical analyzer of the text, since in this case the analysis is not of a natural, but of a formal language. Identifiers, string and numeric literals, parentheses, curly brackets, square brackets, arithmetic signs, and other special characters that have a special meaning in the template description language are highlighted.

Синтаксический разбор формальных грамматик выполняется методом восходящей рекурсии. Конечный результат разбора текста на языке описания шаблонов - это объект пакета шаблонов, представленный в виде синтаксического дерева. Разбор начинается с вызова процедуры разбора пакета, который в цикле вызывает процедуру разбора синтаксиса именованного шаблона. Процедура разбора синтаксиса именованного шаблона после определения имени шаблона выполняет процедуру разбора правой части тела шаблона. В этой процедуре выполняется вызов процедуры разбора самого низкоприоритетного оператора - оператора области действия. В дальнейшем в процедуре разбора синтаксиса оператора области действия вызывается процедура разбора предыдущего по приоритету оператора - оператора следования через слова и разделители, и так далее. В местах, где может быть записано любое выражение, происходит вызов процедуры разбора главного выражения, в котором последовательность вызовов начинается сначала, а также обрабатывается указание требуемого приоритета операторов с помощью круглых скобок. Такой подход позволяет при разборе получить требуемый приоритет операторов. Полученное в ре- 16 037156 зультате разбора синтаксическое дерево содержит узлы, которые точно соответствуют операторам языка описания шаблонов.Parsing of formal grammars is performed by the bottom-up recursion method. The end result of parsing the text in the template description language is a template package object, represented as a syntax tree. Parsing begins with a call to the parse package, which in a loop calls the parse routine for the named template. The parser for the syntax of a named template, after determining the template name, parses the right side of the template body. This procedure calls the parsing procedure for the lowest-priority operator, the scope operator. Later, in the procedure for parsing the syntax of the scope operator, the procedure for parsing the previous operator in priority is called - the operator of following through words and separators, and so on. In places where any expression can be written, the procedure for parsing the main expression is called, in which the sequence of calls starts from the beginning, and the indication of the required operator precedence is processed using parentheses. This approach allows parsing to obtain the required operator priority. The resulting parsing syntax tree contains nodes that exactly correspond to the operators of the template description language.

Следующим шагом является преобразование и оптимизация синтаксического дерева и связывание шаблонов. Под связыванием шаблонов понимается запись ссылок на синтаксические деревья именованных шаблонов в местах ссылок на них. Выполнение этого шага в конце позволяет не беспокоиться о порядке записи шаблонов, использовать ссылки вперёд на шаблоны, объявленные позже, а также создавать рекурсивные шаблоны. На этом шаге выполняются следующие преобразования и оптимизации:The next step is to transform and optimize the syntax tree and link templates. Template linking refers to the notation of references to syntax trees of named templates at the locations of their references. Completing this step at the end allows you not to worry about the template writing order, use forward references to templates declared later, and create recursive templates. In this step, the following transformations and optimizations are performed:

замена операторов следования через слова и разделители и операторов последовательного упоминания на эквивалентные конструкции из базовых операторов;replacement of sequence operators through words and separators and operators of sequential reference to equivalent constructions from basic operators;

лексический анализ текстовых литералов и замена на эквивалентную последовательность лексем, если текстовый литерал состоит более чем из одной лексемы;lexical analysis of text literals and replacement with an equivalent sequence of tokens if the text literal consists of more than one token;

замена последовательностей и вариаций, состоящих из одного элемента, на сам элемент;replacing sequences and variations consisting of one element with the element itself;

если положительный элемент вариации сам является вариацией, то его элементы будут добавлены в родительскую вариацию напрямую.if a positive variation element is itself a variation, then its elements will be added directly to the parent variation.

Обход и обновление синтаксического дерева реализуется с использованием приёма проектирования посетитель, называемого также двойной диспетчеризацией. В процессе обхода синтаксического дерева с помощью объекта посетителя в случае обновления какого-либо узла происходит рекурсивное перестроение всех его родительских узлов. Для удовлетворения правилам синтаксического разбора текстовых литералов они разделены на два вида: лексемы и составные текстовые литералы. В процессе разбора текста синтаксический анализатор считает все текстовые литералы составными, а при обходе дерева каждый такой литерал заменяется либо на одну лексему, либо на последовательность из лексем.Traversing and updating the syntax tree is accomplished using a visitor design technique, also called double dispatching. In the process of traversing the syntax tree using the visitor object, if a node is updated, all its parent nodes are recursively rebuilt. To satisfy the rules for parsing text literals, they are divided into two kinds: tokens and compound text literals. In the process of parsing the text, the parser considers all text literals to be composite, and when traversing the tree, each such literal is replaced with either one token or a sequence of tokens.

Генерация дерева поисковых выражений из синтаксического дерева шаблонов тоже выполняется через обход дерева с помощью приёма проектирования посетитель. В процессе обхода синтаксического дерева используется стек создаваемых выражений. Происходит рекурсивный спуск вглубь до терминальных элементов и создание для них выражений, а затем на подъёме из рекурсии создаются выражения для всех родительских элементов. Особым образом происходит создание выражений для ссылок на шаблоны и для оператора области действия. В обоих случая требуется дополнительно выполнять связывание ссылок после создания всех выражений. В случае с оператором области действия дополнительной особенностью является оптимизация поиска: если в правой части оператора находится шаблон, объявленный по месту, то для выполнения поиска требуется создать анонимный шаблон и поместить в его правую часть записанное по месту выражение. Ссылка на полученный анонимный шаблон используется при конструировании выражения для оператора области действия. Однако если в правой части оператора сразу записана ссылка на шаблон, то необходимо использовать сразу её и не создавать анонимный шаблон.The generation of the search expression tree from the template syntax tree is also performed through tree traversal using the visitor design technique. The syntax tree traversal uses a stack of generated expressions. We recursively descend down to the terminal elements and create expressions for them, and then, ascending from the recursion, expressions are created for all parent elements. Expressions for template references and for the scope operator are created in a special way. In both cases, additional linking is required after all expressions have been created. In the case of the scope operator, an additional feature is the optimization of the search: if there is a template declared in place on the right side of the operator, then in order to perform the search, you need to create an anonymous template and place the expression written in place in its right side. A reference to the resulting anonymous template is used when constructing an expression for the scope operator. However, if the link to the template is immediately written on the right side of the operator, then you must use it immediately and not create an anonymous template.

На всех этапах трансляции в случае любых ошибок происходит создание программного исключения с информацией об ошибке. Этот подход используется для обнаружения и обработки ошибок синтаксиса, ошибок входных данных, а также внутренних ошибок, в частности, недопустимых состояний синтаксического анализатора или генератора выражений.At all stages of translation, in case of any errors, a software exception is created with information about the error. This approach is used to detect and handle syntax errors, input errors, and internal errors such as invalid parser or expression generator states.

Конечным результатом трансляции шаблонов является пакет шаблонов, содержащий поисковые выражения с собственными поисковыми индексами, а также корневой поисковый индекс.The end result of template translation is a template package containing search expressions with their own search indexes, as well as a root search index.

Структуры данных кандидатов и результатов совпадений.Data structures of candidates and match results.

Проверка совпадения всех шаблонов из пакета шаблонов выполняется за один последовательный просмотр лексем текста. Для реализации этой возможности исполнитель поиска хранит и использует на каждом шаге просмотра данные о состоянии проверки поисковых выражений, из которых состоят шаблоны. Структура данных, которая хранит информацию о полном или неполном совпадении текста с поисковым выражением, начиная с некоторой позиции в тексте, и служит для принятия решения о дальнейшем совпадении или несовпадении фрагмента текста с поисковым выражением, называется кандидатом совпадения (далее просто кандидат). Как и поисковые выражения, кандидаты совпадений образуют дерево кандидатов, где корнем дерева является кандидат совпадения всего шаблона, а элементами дерева являются кандидаты совпадения для последовательностей, элементов вариаций, лексем и других типов поисковых выражений. Дерево кандидатов каждого шаблона является полностью связным, позволяя не только спускаться от корня вниз к терминальным элементам, но и подниматься от терминальных элементов к головным элементам вверх по дереву, что важно для работы исполнителя поиска. Множество всех кандидатов всех шаблонов, проверяемых на каждом шаге просмотра лексем текста, называется набором кандидатов и образует состояние исполнителя поиска.Checking the match of all templates from the template package is performed in one sequential scan of the text tokens. To implement this feature, the search engine stores and uses data on the validation status of search expressions that make up the templates at each step of the scan. A data structure that stores information about a complete or incomplete match of a text with a search expression, starting from a certain position in the text, and serves to make a decision on further matching or non-matching of a text fragment with a search expression, is called a match candidate (hereinafter simply a candidate). Like search expressions, match candidates form a candidate tree, where the root of the tree is the entire pattern match candidate, and the tree elements are match candidates for sequences, variation items, tokens, and other types of search expressions. The candidate tree of each template is fully connected, allowing not only to go down from the root down to the terminal elements, but also ascend from the terminal elements to the head elements up the tree, which is important for the search executor's work. The set of all candidates of all templates checked at each step of viewing text tokens is called a set of candidates and forms the state of the search executor.

По мере просмотра лексем текста кандидаты совпадений создаются в наборе кандидатов, затем используются для принятия решения о совпадении или несовпадении шаблона с читаемыми лексемами текста, и в конце концов удаляются из набора кандидатов либо по причине несовпадения с очередной лексемой, либо по причине превращения кандидатов в результаты совпадения. Таким образом, результат совпадения фрагмента текста с шаблоном - это кандидат совпадения шаблона, по которому окончательно принято решение о совпадении, и который содержит исчерпывающую информацию, объясняющую, где и почему произошло совпадение. Результаты совпадения шаблонов используются для принятия решенияAs the text tokens are viewed, match candidates are created in the set of candidates, then used to decide whether the pattern matches or not matches the readable tokens of the text, and are ultimately removed from the set of candidates either because they do not match with the next token, or because candidates turn into results coincidences. Thus, the result of a match of a piece of text with a pattern is a pattern match candidate, by which the final decision on a match is made, and which contains comprehensive information explaining where and why the match occurred. Pattern matching results are used for decision making

- 17 037156 о совпадении кандидатов ссылок на эти шаблоны. Результаты совпадения целевых шаблонов (тегов) возвращаются клиенту исполнителя поиска, причём по требованию клиента результаты возвращаются либо в полном, либо в сокращённом виде. В последнем случае это могут быть лишь позиции начальной и конечной лексем текста, соответствующих совпадению с шаблоном. Дерево кандидатов состоит из элементов (кандидатов) следующего типа:- 17 037156 about the match of candidates of links to these templates. The results of matching target templates (tags) are returned to the client of the search performer, and at the request of the client, the results are returned either in full or in an abbreviated form. In the latter case, these can only be the positions of the initial and final tokens of the text corresponding to the match with the pattern. The candidate tree consists of elements (candidates) of the following type:

лексема (класс TokenCandidate);token (TokenCandidate class);

ссылка на шаблон (класс ReferenceCandidate);template reference (ReferenceCandidate class);

последовательность (класс SequenceCandidate):sequence (SequenceCandidate class):

элемент вариации (класс VariationItemCandidate) подразделяется на кандидата альтернативы (класс Inclusive VariationItemCandidate) и кандидата исключения (класс ExclusiveVariationItemCandidate);a variation item (class VariationItemCandidate) is subdivided into an alternative candidate (class Inclusive VariationItemCandidate) and an exclusion candidate (class ExclusiveVariationItemCandidate);

область действия (класс EnclosureCandidate);scope (class EnclosureCandidate);

вхождение шаблона (класс PatternCandidate).occurrence of a pattern (class PatternCandidate).

Для кодирования повторений дополнительного типа кандидата не требуется. Каждое повторение представляется отдельным кандидатом, в котором указан порядковый номер повторения. Кандидат хранит номер своего повторения, а родительский кандидат хранит ссылки на кандидатов всех повторений своих подчинённых элементов.No additional candidate type is required to encode repetitions. Each repetition is presented as a separate candidate, in which the repetition sequence number is indicated. The candidate stores its repetition number, and the parent candidate stores references to candidates of all repetitions of its subordinate elements.

Кандидат с типом вхождение шаблона является корнем дерева кандидатов. Его подчинёнными элементами могут быть кандидаты других типов, кроме типа вхождение шаблона. Кандидаты с типом лексема и ссылка на шаблон являются листовыми (терминальными) элементами в дереве, т.е. не могут иметь подчинённых элементов; все остальные кандидаты являются составными. В терминах ООП классы терминальных кандидатов (классы TokenCandidate и ReferenceCandidate) являются производными от класса TerminalCandidate. Классы составных кандидатов (классы SequenceCandidate, VariationItemCandidate, EnclosureCandidate и PatternCandidate) могут включать в себя подчинённых кандидатов и являются производными от класса CompoundCandidate. Базовым классом для классов TerminalCandidate и CompoundCandidate, а косвенно для всех классов кандидатов является класс Candidate.The candidate with the type occurrence of the pattern is the root of the candidate tree. Its subordinate elements can be candidates of other types, except for the type of template occurrence. The candidates with the token type and the template reference are leaf (terminal) elements in the tree, i.e. cannot have subordinate elements; all other candidates are composite. In OOP terms, the terminal candidate classes (TokenCandidate and ReferenceCandidate classes) are derived from the TerminalCandidate class. Composite candidate classes (the SequenceCandidate, VariationItemCandidate, EnclosureCandidate, and PatternCandidate classes) can include subordinate candidates and are derived from the CompoundCandidate class. The base class for the TerminalCandidate and CompoundCandidate classes, and indirectly for all candidate classes is the Candidate class.

В терминах ООП кандидаты представляют собой объекты соответствующих классов и имеют следующие основные свойства:In OOP terms, candidates are objects of the corresponding classes and have the following main properties:

ParentCandidate - ссылка на родительского кандидата;ParentCandidate - link to the parent candidate;

RepeatNumber - номер повторения;RepeatNumber The number of the repetition;

StartTokenNumber - номер лексемы, с которой началось совпадение кандидата;StartTokenNumber The number of the token that started the match of the candidate.

Expression - ссылка на выражение, соответствующее кандидату.Expression - a reference to the expression matching the candidate.

ChildRepeatsPerPosition - двумерный список подчинённых кандидатов, где для первого уровня адресации используется номер позиции подчинённого кандидата внутри головного, а для второго уровня адресации - номер повторения; свойство определено только для составных кандидатов.ChildRepeatsPerPosition - a two-dimensional list of subordinate candidates, where the position number of the subordinate candidate within the head is used for the first addressing level, and the repetition number for the second addressing level; the property is only defined for compound candidates.

Набор кандидатов исполнителя поиска устроен особым образом, он хранит лишь кандидатов последних совпавших терминальных элементов (лексем и ссылок на шаблоны). Причём каждый вариант совпадения шаблона, начинающийся с некоторой позиции в тексте, представлен полностью обособленным деревом кандидатов. Дерево кандидатов для каждого варианта совпадения текста с шаблоном зацеплено за набор кандидатов последним совпавшим терминальным элементом. Если зацепление пропадает в результате изъятия терминального элемента из набора кандидатов, то это означает, что всё дерево кандидатов выпадает из набора кандидатов, и если это произошло не в результате полного совпадения, то дерево кандидатов превращается в мусор, который должен утилизироваться сборщиком мусора. Сборка мусора в современных платформах программирования выполняется автоматически; для других платформ программирования утилизацию кандидатов, превратившихся в мусор, нужно выполнять вручную.The set of search executor candidates is arranged in a special way, it stores only the candidates of the last matched terminal elements (tokens and links to templates). Moreover, each variant of the pattern matching, starting from a certain position in the text, is represented by a completely separate tree of candidates. The tree of candidates for each variant of matching the text with the template is linked to the set of candidates by the last matched terminal element. If the link disappears as a result of removing the terminal element from the set of candidates, then this means that the entire candidate tree falls out of the set of candidates, and if this did not happen as a result of a complete match, then the candidate tree turns into garbage, which must be disposed of by the garbage collector. Garbage collection in modern programming platforms is automatic; for other programming platforms, the disposal of garbage candidates must be done manually.

Алгоритм работы исполнителя поиска.Algorithm of the search performer's work.

Блок-схема, поясняющая общий алгоритм работы исполнителя поиска 21, приведена на фиг. 3. В соответствие с блок-схемой последовательность лексем текста 20 передаётся в главный цикл исполнителя поиска 31-40, в котором для каждой лексемы Т выполняются операции, начиная с 32 и заканчивая 39. Суть этих операций состоит в том, чтобы в зависимости от конкретного значения лексемы Т принять решение о совпадении, несовпадении или необходимости дальнейшей проверки проверяемых кандидатов в наборе кандидатов 22, а также создать кандидатов для шаблонов, совпадение которых возможно начиная с лексемы Т. Для этого в цикле 32-34 для каждого кандидата С в наборе кандидатов лексем выполняется шаг 33, в котором по лексеме Т принимается решение по дереву кандидатов, соответствующего кандидату С. Действия на шаге 33 более подробно раскрыты на фиг. 4. После завершения цикла обработки кандидатов 32-34, на шаге 35 выполняется создание и добавление в набор кандидатов новых кандидатов, полученных на основе поиска в корневом поисковом индексе шаблонов, начинающихся с лексемы Т. Создание новых кандидатов состоит из двух шагов: создание новых кандидатов лексем и создание новых кандидатов ссылок на шаблоны. Для создания кандидатов лексем выполняется обращение к поисковому индексу, который возвращает список терминальных поисковых выражений лексем, совпадающих с лексемой Т. Для каждого полученного выражения происходит построение дерева кандидатов. Это осуществляется на подъёме вверх по дереву выражений. Если после создания кандидатов не был превышен лимит кандидатов, то происходит также создание кандидатов ссылок на шаблоны. КандидатыA block diagram explaining the general operation of the search executor 21 is shown in FIG. 3. In accordance with the block diagram, the sequence of tokens of the text 20 is transferred to the main loop of the search executor 31-40, in which operations are performed for each token T, starting from 32 and ending with 39. The essence of these operations is that, depending on the specific the values of the token T make a decision on the coincidence, mismatch, or the need for further verification of the candidates to be checked in the set of candidates 22, and also create candidates for patterns, the coincidence of which is possible starting from the token T. To do this, in a loop 32-34 for each candidate C in the set of candidate tokens Step 33 is executed in which a decision is made on the candidate tree corresponding to Candidate C using the token T. The steps in Step 33 are described in more detail in FIG. 4. After the completion of the processing cycle of candidates 32-34, at step 35, new candidates are created and added to the set of candidates obtained on the basis of searching in the root search index for patterns starting with the token T. The creation of new candidates consists of two steps: creating new candidates tokens and the creation of new candidate links to templates. To create candidate lexemes, a call is made to the search index, which returns a list of terminal search expressions of tokens that match the lexeme T. For each obtained expression, a candidate tree is built. This is done as you go up the expression tree. If, after the creation of candidates, the candidate limit has not been exceeded, then the candidates are also created with links to templates. Candidates

- 18 037156 ссылок на шаблоны также являются терминальными кандидатами, но они хранятся в наборе кандидатов отдельно от кандидатов лексем. Это связано с тем, что для принятия решений в таких кандидатах необходима не лексема текста, а совпавший шаблон, ссылку на который и представляет данный кандидат. Создание дерева кандидатов для ссылки происходит аналогично созданию дерева кандидатов для лексемы: на подъёме вверх по дереву выражений. На шаге 36 проверяется, является ли лексема Т лексемой конца текста, и если является, то выполняется шаг 37 для удаления всех кандидатов из набора кандидатов. Если лексема Т - это любая другая лексема, то шаг 37 пропускается. На шаге 38 выполняется проверка, не превышен ли лимит кандидатов, и если превышен, то на шаге 39 в последовательность лексем искусственно вставляется лексема конца текста, которая начинает обрабатываться на следующей итерации цикла 31-40. После выхода из главного цикла 31-40 в данных исполнителя поиска остаются накопленными результаты совпадений 24. Эти данные должны удаляться перед поиском по другому тексту.- 18,037156 template references are also terminal candidates, but they are stored in the candidate set separately from the token candidates. This is due to the fact that in order to make decisions in such candidates, it is not the lexeme of the text that is needed, but a matched template, the link to which is presented by this candidate. The creation of a candidate tree for a link is similar to the creation of a candidate tree for a token: ascending up the expression tree. At step 36, it is checked if the token T is an end-of-text token, and if so, then step 37 is performed to remove all candidates from the set of candidates. If the token T is any other token, then step 37 is skipped. At step 38, it is checked whether the candidate limit has been exceeded, and if it is, then at step 39, an end-of-text token is artificially inserted into the token sequence, which is processed at the next iteration of loop 31-40. After exiting the main loop 31-40 in the data of the search artist, the results of matches 24 remain accumulated. These data must be deleted before searching in other text.

Общий алгоритм принятия решений.General decision-making algorithm.

Блок-схема, поясняющая общий алгоритм принятия решений по дереву кандидатов, соответствующего терминальному кандидату С, в зависимости от значения текущей лексемы Т, приведен на фиг. 4. Входными данными 41 алгоритма являются кандидат С из набора кандидатов и лексема Т в некоторой позиции текста. Кандидат С является терминальным кандидатом и может быть либо кандидатом лексемы, либо кандидатом ссылки на шаблон.A flowchart explaining the general decision-making algorithm for the candidate tree corresponding to the terminal candidate C, depending on the value of the current token T, is shown in FIG. 4. The input data of the 41 algorithm are a candidate C from a set of candidates and a lexeme T in a certain position of the text. Candidate C is a terminal candidate and can be either a token candidate or a template reference candidate.

Принятие решения выполняется относительно дерева кандидатов, соответствующего кандидату С. Сначала на шаге 42 проверяется, помечен ли кандидат С удалённым, и если помечен, то все шаги принятия решения, начиная с 43 и заканчивая 51, пропускаются, а на шаге 52 происходит выход из алгоритма принятия решения. Если кандидат С не помечен удалённым, то на шаге 43 выполняется проверка, совпало ли дерево кандидатов полностью.Decision making is performed with respect to the candidate tree corresponding to candidate C. First, at step 42, it is checked whether candidate C is marked as deleted, and if it is, then all decision steps, starting from 43 and ending with 51, are skipped, and at step 52, the algorithm exits. decision making. If candidate C is not marked as deleted, then at step 43 it is checked whether the candidate tree completely matches.

В соответствие с выбранной методологией ООП, процедуры принятия решений являются процедурами объектов, которыми представлены кандидаты, поэтому для принятия решения на шаге 43 лексема Т передаётся процедуре принятия решений кандидата С. Если в процедуре кандидата невозможно принять решение, этот процесс переносится выше по дереву кандидатов путём вызова процедуры принятия решений у головного кандидата. Таким образом для принятия решения сначала происходит обращение к кандидату лексемы, а кандидат лексемы в случае необходимости делегирует принятие решения своему головному элементу и так далее до корневого элемента дерева - кандидата шаблона.In accordance with the selected OOP methodology, the decision-making procedures are procedures of the objects by which the candidates are presented, therefore, to make a decision at step 43, the lexeme T is passed to the decision-making procedure of the candidate C. calling the decision-making procedure of the lead candidate. Thus, to make a decision, the candidate of the token is first addressed, and the candidate of the token, if necessary, delegates the decision to its head element, and so on up to the root element of the tree - the candidate of the template.

Если на шаге 43 дерево кандидатов полностью совпало, то на шаге 44 проверяется, не является ли соответствующий шаблон целевым, т.е. тегом, и если является, то на шаге 45 корневой кандидат этого дерева (кандидат шаблона) добавляется в набор результатов. Если соответствующий шаблон не являются тегом, то шаг 45 пропускается. После выполнения шага 44 и вне зависимости от шага 45, т.е. в случае, если дерево кандидатов полностью совпало, и вне зависимости от того, является ли соответствующий шаблон тегом, происходит принятие решения для кандидатов ссылок на этот шаблон на шаге 48. Принятие решения выполняется для кандидатов ссылок, присутствующих в наборе кандидатов, и начинающихся с той же самой позиции в тексте, с которой начинается кандидат совпавшего шаблона. Фактически шаг 48 означает рекурсивное обращение к алгоритму принятия решения на фиг. 4.If in step 43 the candidate tree completely matches, then in step 44 it is checked whether the corresponding template is the target, i.e. tag, and if so, then in step 45 the root candidate of this tree (template candidate) is added to the result set. If the matching template is not a tag, then step 45 is skipped. After step 44 and regardless of step 45, i. E. in the event that the candidate tree completely matches, and regardless of whether the corresponding template is a tag, a decision is made for link candidates to this template at step 48. The decision is made for link candidates present in the set of candidates and starting with that the same position in the text from which the candidate of the matched pattern begins. In fact, step 48 means recursively referring to the decision algorithm of FIG. four.

Если на шаге 43 дерево кандидатов не совпало полностью, то на шаге 46 выполняется проверка того, что дерево кандидатов продолжает совпадать, и, если нет, то на шаге 47 дерево кандидатов помечается к удалению, а на следующем шаге 48 выполняется рекурсивное обращение к алгоритму принятия решения для кандидатов ссылок на соответствующий не совпавший шаблон.If at step 43 the candidate tree does not completely match, then at step 46 it is checked that the candidate tree continues to match, and if not, then at step 47 the candidate tree is marked for deletion, and at the next step 48 a recursive call to the acceptance algorithm is performed solutions for candidates to link to a matching non-matched template.

Если на шаге 46 оказывается, что дерево кандидатов продолжает совпадать, то на шаге 49 выполняется проверка, существует ли более одного варианта совпадения. Если вариантов совпадения несколько, то на шаге 50 создаются логические копии дерева кандидатов для каждого варианта совпадения, причём логические копии совпадают в той части выражения, которая предшествует лексеме Т, и отличаются в той части выражения, которая соответствует разным вариантам совпадения лексемы Т с лексемами в поисковых выражениях шаблона. Если существует всего один вариант совпадения, то шаг 50 пропускается, копии дерева кандидатов не создаются, а используется имеющееся дерево кандидатов. На следующем шаге 51 для каждого варианта совпадения в дереве кандидатов добавляется кандидат лексемы Т, который в свою очередь добавляется в набор кандидатов.If at step 46 it turns out that the candidate tree continues to match, then at step 49 it is checked if there is more than one match. If there are several matching options, then at step 50, logical copies of the candidate tree are created for each match option, and the logical copies coincide in the part of the expression that precedes the token T, and differ in that part of the expression that corresponds to different options for matching the token T with the tokens in search expressions of the template. If there is only one match, then step 50 is skipped, no copies of the candidate tree are created, but the existing candidate tree is used. In the next step 51, for each match in the candidate tree, a candidate token T is added, which in turn is added to the set of candidates.

Правила принятия решений.Decision making rules.

Правила принятия решений по полному совпадению, несовпадению или частичному совпадению кандидатов зависят от типов кандидатов.Decision rules for complete match, non-match, or overlap of candidates depend on the types of candidates.

Общей частью правил принятия решения во всех кандидатах является анализ номера текущего повторения. Если повторение еще не достигло минимально допустимой границы, то необходимо проверить, может ли начаться совпадение нового повторения шаблона со следующей лексемы текста, и если может, то создать кандидата нового повторения. В случае, когда совпадение нового повторения невозможно, головному кандидату передаётся информация о несовпадении его подчинённого элемента. Эта информация передается выше по дереву до его корня. В случае, когда повторение достигло максимально допустимой границы, в текущем кандидате принимается решение о совпадении, которое сразу передаётся его головному кандидату. Самым сложным случаем является совпадение кандидата с номером повтоA common part of the decision rules in all candidates is the analysis of the current repetition number. If the repetition has not yet reached the minimum permissible boundary, then it is necessary to check whether a new repetition of the pattern can begin to match from the next token of the text, and if so, then create a candidate for a new repetition. In the case when a new repetition match is impossible, the head candidate receives information about the mismatch of its subordinate element. This information is passed up the tree up to its root. In the case when the repetition has reached the maximum permissible limit, the decision on the match is made in the current candidate, which is immediately transmitted to its head candidate. The most difficult case is the match of the candidate with the repeat number

- 19 037156 рения, который находится между минимальной и максимальной границей. При совпадении элемента с вариативным числом повторений, необходимо рассмотреть два варианта: когда данное совпадение было последним и когда будет выполняться поиск еще одного совпадения. Приведенные варианты по результату проверки могут образовывать различные деревья кандидатов, что в итоге может привести к различным совпадениям соответствующего шаблона. Для того чтобы учитывать все варианты, необходимо выполнить копирование дерева кандидатов вплоть до корневого элемента - кандидата шаблона. После копирования один экземпляр кандидата руководствуется правилами принятия решения при достижении максимального числа повторений, а второй - правилами принятия решения в случае, когда повторение еще не достигло минимально допустимой границы.- 19 037156 rhenium, which is between the minimum and maximum border. When an item with a variable number of repetitions is matched, there are two options to consider: when this match was the last one, and when another match will be searched for. Based on the verification result, the above options can form different candidate trees, which ultimately can lead to different matches of the corresponding template. In order to take into account all the options, it is necessary to copy the candidate tree up to the root element - the template candidate. After copying, one copy of the candidate is guided by the decision-making rules when the maximum number of repetitions is reached, and the second - by the decision-making rules in the case when the repetition has not yet reached the minimum permissible limit.

К логике принятия решения на основе номера повторения в последовательности добавляется анализ номера позиции совпавшего подчинённого элемента. Последовательность считается совпавшей, когда совпала ее обязательная часть. Под обязательной частью последовательности понимается её подпоследовательность от первого до последнего обязательного элемента. Например, в последовательности ?А + В + ?С + D + ?Е обязательной частью является подпоследовательность от В до D с необязательным элементом С в середине.An analysis of the position number of the matched subelement is added to the decision logic based on the repetition number in the sequence. A sequence is considered to be matched when its mandatory part matches. A mandatory part of a sequence means its subsequence from the first to the last required element. For example, in the sequence? A + B +? C + D +? E, the required part is the subsequence from B to D with an optional C in the middle.

В случае, если после обязательной части следуют необязательные элементы, необходимо также проверить, возможно ли их совпадение. Для этого в момент совпадения обязательной части последовательности, как и в случае с вариативным значением повторения, происходит копирование дерева кандидатов. В одном из экземпляров выполняется попытка продолжить совпадение необязательной части последовательности, а другой экземпляр считается совпавшим.In the event that optional elements follow the mandatory part, you must also check whether they can match. To do this, at the moment of coincidence of the mandatory part of the sequence, as in the case of the variable value of repetition, the candidate tree is copied. One of the instances tries to continue matching the optional part of the sequence, and the other instance is considered a match.

Для вариации правила принятия решений также имеют свои особенности. Это связано с логикой работы исключений: совпадение исключения должно отменить совпадение всех альтернатив, совпадение которых началось с той же позиции в тексте.For variation, the decision-making rules also have their own characteristics. This is due to the logic of the work of exceptions: the coincidence of an exception must cancel the coincidence of all alternatives, the coincidence of which began from the same position in the text.

При этом такая отмена может произойти как до совпадения кандидата альтернативы, так и после, спустя некоторое количество лексем текста. При этом основной сценарий использования для предварительного просмотра вперед предполагает второй случай. Более сложным сценарием является отмена с учетом повторений элементов. Например, если исключение совпадает с позиции некоторого не первого повторения элемента, то отменяется совпадение только тех элементов, позиции в тексте которых больше позиции совпадения, к которому относится исключение. Для связи между деревьями кандидатов исключений и альтернатив используется состояние совпадения вариации (далее просто состояние вариации). Состояние вариации хранит ссылки на всех кандидатов в специальной структуре, удобной для поиска по позиции первой совпавшей лексемы. Состояние вариации связано с объектом поискового выражения вариации и позицией лексемы, на которой был создан первый кандидат элемента вариации. Все кандидаты элементов одной вариации, созданные во время обработки одной лексемы, связываются одним состоянием вариации. В процессе принятия решения в кандидате происходит обращение к состоянию вариации для получения дополнительной информации и обновления состояния.In this case, such a cancellation can occur both before the match of the candidate of the alternative, and after, after a certain number of tokens of the text. That being said, the main use case for forward preview assumes the second case. A more complex scenario is undo based on repetitions of elements. For example, if the exception coincides with the position of some non-first repetition of an element, then only those elements whose position in the text is greater than the position of the match to which the exception belongs is canceled. The variation coincidence state (hereinafter simply the variation state) is used to communicate between candidate exception and alternative trees. The variation state stores references to all candidates in a special structure that is convenient for searching by the position of the first matching token. The variation state is associated with the variation search expression object and the token position on which the first variation item candidate was created. All candidate items of the same variation, created during the processing of a single token, are bound by the same variation state. In the process of making a decision in the candidate, the state of the variation is consulted to obtain additional information and update the state.

Из-за того, что выявление всех шаблонов осуществляется за один просмотр текста, проверка исключений в вариациях должна проходить одновременно с поиском совпадений следующих за вариацией элементов. При этом совпавшие кандидаты альтернатив, для которых еще ведётся проверка исключений, считаются совпавшими неокончательно. Свойство неокончательного совпадения распространяется и на головных кандидатов. Так, например, если все элементы последовательности совпали, но хотя бы один из них на любой позиции совпал неокончательно, то и вся последовательность считается совпавшей неокончательно.Due to the fact that all patterns are detected in one scan of the text, checking for exceptions in variations must be done simultaneously with searching for matches of the elements following the variation. In this case, the matched candidates of alternatives, for which the exceptions are still being checked, are considered incompletely matched. The property of an inconclusive coincidence extends to the head candidates as well. So, for example, if all the elements of the sequence coincide, but at least one of them at any position coincides incompletely, then the entire sequence is considered to coincide inconclusively.

Примером работы описанного механизма является поиск шаблона #Р = {А, ~ (А + В)} во входной последовательности лексем AAB. Здесь и далее в примерах буквами латинского алфавита обозначаются лексемы поисковых выражений и соответствующие им лексемы входного текста. Приведенный шаблон выявляет лексему А, за которой не следует лексема В.An example of how the described mechanism works is to search for the pattern # P = {A, ~ (A + B)} in the input sequence of tokens AAB. Hereinafter, in the examples, the letters of the Latin alphabet denote tokens of search expressions and the corresponding lexemes of the input text. The above pattern identifies a token A not followed by a token B.

Пошаговое изменение состояния исполнителя поиска приведено на фиг. 5. Для простоты на фиг. 5 приводится список лексем без лексемы начала текста, так как на поиск рассматриваемого шаблона она влияния не оказывает. Блоки 110, 120, 130, 140 соответствуют снимкам состояния исполнителя поиска и показывают изменение этого состояния в направлении от 110 к 140 по мере обработки лексем текста: лексемы А в позиции 0 (А₀), лексемы А в позиции 1 (A1), лексемы В в позиции 2 (B₂) и лексемы END конца текста. На каждом из снимков 110, 120, 130, 140 показаны состояния для набора кандидатов 22 и для результатов совпадений 24. Обработка каждой лексемы состоит из двух шагов: принятие решений для кандидатов, созданных во время обработки предыдущей лексемы (далее просто принятие решений), и создание новых кандидатов на основе корневого поискового индекса (далее просто создание новых кандидатов).The step-by-step change in the state of the search executor is shown in FIG. 5. For simplicity, FIG. 5 shows a list of lexemes without the lexeme of the beginning of the text, since it does not affect the search for the template in question. Blocks 110, 120, 130, 140 correspond to snapshots of the search executor's state and show the change in this state in the direction from 110 to 140 as the text tokens are processed: tokens A at position 0 (A ₀ ), tokens A at position 1 (A1), tokens B at position 2 (B ₂ ) and end of text tokens END. Each of the snapshots 110, 120, 130, 140 shows the states for the set of candidates 22 and for the match results 24. The processing of each token consists of two steps: making decisions for the candidates created during the processing of the previous token (hereinafter simply making decisions), and creating new candidates based on the root search index (hereinafter simply creating new candidates).

Во время обработки первой лексемы А₀ никаких кандидатов ранее создано не было, поэтому шаг принятие решений пропускается и происходит создание кандидатов на основе корневого индекса. ПоDuring the processing of the first token A _0, no candidates were previously created, so the decision step is skipped and the candidates are created based on the root index. By

- 20 037156 лексеме А из корневого индекса извлекается две лексемы, относящиеся к двум элементам вариации: альтернативе и исключению. По ним создаются соответственно кандидаты 113 и 117, а также дерево кандидатов с корневым кандидатом шаблона 111. При этом элементы вариации 112 и 115 оказываются связанными с одним состоянием вариации 114. Итоговое состояние после этого шага показано на снимке 110.- 20 037156 token A, two tokens are extracted from the root index, referring to two elements of the variation: alternative and exclusion. According to them, candidates 113 and 117, as well as a tree of candidates with the root candidate of the template 111, are created respectively. In this case, the elements of the variation 112 and 115 are associated with one state of the variation 114. The final state after this step is shown in snapshot 110.

При обработке второй в тексте лексемы A1 происходит принятие решения по набору кандидатов и обнаруживается совпадение дерева кандидатов, соответствующего корневому кандидату 111 шаблона Р в позиции 0. Это происходит из-за совпадения его подчинённого элемента - альтернативы вариации 112. Альтернатива вариации 112 совпадает сама по себе, так как совпадает ее подчинённый элемент 113. Но окончательное её совпадение происходит только после того, как проверка всех исключений вариации, созданных на той же позиции в тексте, завершается с отрицательным результатом. Порядок обработки кандидатов в данном случае не имеет значения. В случае, когда обработка начинается с кандидата 117, сначала будет установлено, что исключение вариации 115 не совпадает, принятие решения для альтернативы вариации 112 будет отложено, так как подчинённый кандидат самой альтернативы 113 ещё не был обработан. После обработки подчинённого кандидата альтернативы 113 будет выполнена проверка, что связанных с альтернативой исключений вариации нет, и можно принять решение об окончательном совпадении. В другом случае, когда обработка начинается с кандидата 113, будет обработан кандидат альтернативы вариации 112, в итоге будет принято решение о неокончательном совпадении, так как не завершена проверка исключений вариации. Информация о неокончательном совпадении будет передана головному кандидату 111 - кандидату шаблона Р, но сохранения в результат в таком случае пока не произойдёт. Сохранение совпавшего кандидата шаблона 111 в результат произойдёт только после его окончательного совпадения в следствие обработки и принятия решения как для кандидата 113, так и для кандидата 117. В рассматриваемом сценарии это произойдет после того, как будет принято решение о несовпадении исключения вариации. Состояние вариации накапливает информацию о совпадении или несовпадении альтернатив и исключений, начавшихся с одной и той же позиции, и позволяет принять окончательное решение о совпадении связанных альтернатив вариации после того, как будут приняты окончательные решения по всем связанным исключениям вариации. В частности, состояние вариации 114 накопит информацию о совпадении альтернативы 112 и позволит принять окончательное решение о совпадении вариации, начиная с позиции 0, после того как будет принято окончательное решение о несовпадении связанного исключения 115, начинающегося с этой же позиции 0. На этапе создания новых кандидатов по второй в тексте лексеме A1 создаётся структура, подобная той, что была создана при обработке первой лексемы А0. Различия будут только в стартовых позициях лексем, которые хранятся в кандидатах. В частности, по лексеме А из корневого индекса извлекаются две лексемы, относящиеся к двум элементам вариации: альтернативе и исключению. По ним создаются соответственно кандидаты 123 и 127, а также дерево кандидатов с корневым кандидатом шаблона 121. При этом элементы вариации 122 и 125 оказываются связанными с одним состоянием вариации 124. Итоговое состояние после шага обработки лексемы A1 показано на снимке 120.When processing the second token A1 in the text, a decision is made on the set of candidates and a match is found in the candidate tree corresponding to the root candidate 111 of the template P in position 0. This is due to the coincidence of its subordinate element - the alternative of the variation 112. The alternative of the variation 112 coincides by itself , since its subordinate element 113 matches. But its final match occurs only after the check of all variation exceptions created at the same position in the text completes with a negative result. The order in which the candidates are processed is irrelevant in this case. In the case where processing starts from candidate 117, it is first determined that the exclusion of variation 115 does not match, the decision for alternative 112 will be deferred, since the subordinate candidate of alternative 113 itself has not yet been processed. After the subordinate candidate of alternative 113 has been processed, it will be checked that there are no variations associated with the alternative, and a final match decision can be made. Alternatively, when processing starts from candidate 113, variation alternative candidate 112 will be processed, eventually a non-final decision will be made, since variation exception checking has not been completed. The information about the non-final match will be transferred to the head candidate 111 - the candidate of the template P, but in this case, saving to the result will not occur yet. Saving the matched template candidate 111 to the result will occur only after its final match as a result of processing and making a decision for both candidate 113 and candidate 117. In the scenario under consideration, this will happen after the decision about the variation exclusion mismatch is made. The variation state accumulates information about the coincidence or non-coincidence of alternatives and exceptions that started from the same position, and allows you to make the final decision on the coincidence of related variation alternatives after the final decisions on all associated variation exceptions are made. In particular, the state of the variation 114 will accumulate information about the coincidence of the alternative 112 and will allow making the final decision on the coincidence of the variation, starting from position 0, after the final decision is made about the mismatch of the associated exception 115, starting from the same position 0. At the stage of creating new candidates for the second token A1 in the text, a structure is created similar to the one that was created when processing the first token A0. The differences will only be in the starting positions of the tokens that are stored in the candidates. In particular, by lexeme A, two lexemes are extracted from the root index, referring to two elements of the variation: alternative and exclusion. According to them, candidates 123 and 127, as well as a tree of candidates with the root candidate of the template 121, are created respectively. In this case, the elements of the variation 122 and 125 are associated with one state of the variation 124. The final state after the processing step of token A1 is shown in snapshot 120.

Обработка следующей в тексте лексемы В₂ приведёт к созданию кандидата лексемы 138, который является вторым элементом последовательности 126 в исключении 125. Вместе с этим будет принято решение о неокончательном совпадении альтернативы 122 и всего кандидата шаблона 121, начавшегося с позиции 1. Кандидаты 123 и 127 будут удалены из набора кандидатов, а кандидат 138 будет добавлен в набор кандидатов. Дерево кандидатов, соответствующее корневому кандидату 121, на данном шаге не является мусором, потому что оно достижимо через кандидата 138, который присутствует в наборе кандидатов. Итоговое состояние после шага обработки лексемы B2 показано на снимке 130.The processing of the next token B ₂ in the text will result in the creation of the candidate of the token 138, which is the second element of the sequence 126 in the exception 125. At the same time, a decision will be made about the non-final coincidence of alternative 122 and the entire candidate of template 121, starting from position 1. Candidates 123 and 127 will be removed from the set of candidates, and candidate 138 will be added to the set of candidates. The candidate tree corresponding to the root candidate 121 is not garbage at this step because it is reachable via candidate 138 that is present in the candidate set. The final state after the B2 token step is shown in Figure 130.

При обработке лексемы END конца текста произойдёт принятие решения по окончательному совпадению последовательности 126 и исключения вариации 125, начинающегося с позиции 1. Как следствие, принятие окончательного решения по состоянию вариации 124 приведёт к отмене совпадения всего кандидата шаблона 121. Итоговое состояние после шага обработки лексемы END конца текста показано на снимке 140.When processing the END token of the end of the text, a decision will be made on the final match of the sequence 126 and the exclusion of the variation 125, starting from position 1. As a result, the final decision on the state of the variation 124 will lead to the cancellation of the match of the entire template candidate 121. The final state after the step of processing the token END the end of the text is shown in snapshot 140.

В результате рассмотренных на фиг. 5 шагов будет, как и ожидалось, выявлено одно совпадение шаблона Р в позиции 0. Ссылка на кандидата 111 данного совпадения будет сохранена в результат. Благодаря двухсторонним связям между подчинёнными и головными кандидатами из кандидата шаблона можно получить детальную информацию о совпадении всех его подчинённых элементов вплоть до совпадения терминалов с конкретными лексемами текста. В частности, по кандидату шаблона 111 можно узнать, что его совпадение обусловлено совпадением альтернативы вариации 112, которая совпала из-за совпадения кандидата 113 лексемы А с лексемой текста в позиции 0.As a result, discussed in FIG. The 5 steps will, as expected, reveal one match of the pattern P at position 0. A reference to the candidate 111 of the given match will be stored in the result. Thanks to the two-way connections between subordinates and head candidates, from a template candidate it is possible to obtain detailed information about the coincidence of all its subordinate elements up to the coincidence of terminals with specific tokens of the text. In particular, by the template candidate 111, you can find out that its match is due to the match of the alternative variation 112, which matched due to the match of the candidate 113 of the lexeme A with the text token at position 0.

В приведенных правилах принятие решений осуществляется непосредственно на основе лексем текста. Поиск же ссылок на шаблоны и шаблонов в области действия основывается не непосредственно на лексемах, а на совпадении других шаблонов.In the above rules, decision-making is carried out directly on the basis of the lexemes of the text. The search for references to templates and templates in the scope is not based directly on tokens, but on the coincidence of other templates.

Ссылки на шаблоны, как и лексемы, являются терминальными элементами, однако кандидаты для них создаются по более сложным правилам. Они могут создаваться как другими кандидатами в процессе принятия решений, так и выбираться из корневого поискового индекса. Для ссылок два этих случая обрабатываются по-разному. В процессе принятия решений создаются кандидаты всех ссылок, которыеTemplate references, like tokens, are terminal elements, but candidates for them are created according to more complex rules. They can be created by other candidates in the decision-making process, or chosen from the root search index. For links, these two cases are handled differently. In the decision-making process, candidates are created for all links that

- 21 037156 могут находиться на некоторой позиции. Это связано с тем, что нельзя заранее определить, возможно ли совпадение ссылки в данной позиции. Каждый кандидат ссылки после создания регистрируется в состоянии проверки ссылок, где ассоциируется с шаблоном, на который ссылается, и с позицией, в которой ожидается его совпадение.- 21 037156 may be in some position. This is due to the fact that it is impossible to determine in advance whether a link match in a given position is possible. Each link candidate, once created, is registered in the link checking state, where it is associated with the referenced template and the position at which it is expected to match.

После принятия решений выполняется создание новых кандидатов ссылок, совпадение которых уже обнаружено. Для этого используется список кандидатов тегов, совпавших во время обработки данной лексемы. Для оптимизации в такой список сохраняются не все теги, а только те, на которые есть ссылки. Эта информация сохраняется в выражении тега во время трансляции шаблонов. Из поискового индекса выбираются совпавшие ссылки и для них создаются деревья кандидатов. В отличие от констант при создании кандидата ссылки сразу происходит принятие решения на основе следующей лексемы. В процессе этого возможны случаи, когда совпадение ссылки сразу же приводит к совпадению всего шаблона. На такой шаблон в свою очередь также могут быть ссылки. Простым примером такой ситуации является рекурсивное определение #Р = ?Р + ?АAfter decisions have been made, new link candidates are created, which have already been matched. To do this, a list of candidate tags is used that matched during the processing of the given token. For optimization, not all tags are saved to such a list, but only those to which there are links. This information is stored in the tag expression during template translation. Matching links are selected from the search index and candidate trees are generated for them. Unlike constants, when a link candidate is created, a decision is immediately made based on the next token. In the process, there may be cases where a link match immediately results in a match for the entire pattern. Such a template, in turn, may also have links. A simple example of such a situation is the recursive definition of # P =? P +? A

Приведенный шаблон будет совпадать с любым количеством лексем А. Для поиска такого шаблона список кандидатов шаблона, совпавших во время обработки лексемы, должен пополняться во время принятия решения для ссылок, создаваемых на основе этого же списка. Для предотвращения зацикливаний для каждого нового кандидата выполняется проверка, производилась ли обработка кандидата того же шаблона, совпадение которого началось с той же позиции.The given pattern will match any number of A tokens. To search for such a pattern, the list of pattern candidates that matched during token processing must be replenished at the time of decision making for links created on the basis of the same list. To prevent loops, for each new candidate, a check is performed to see if the candidate was processed in the same pattern that started matching from the same position.

Совпадение кандидатов ссылок, которые были созданы не на основе уже совпавших шаблонов, обнаруживается благодаря состоянию проверки ссылок. После совпадения шаблона, на который возможны ссылки, выполняется поиск кандидатов ссылок, ожидающих его совпадения в заданной позиции. В случае если такие кандидаты найдены, выполняется их связывание и дальнейшее принятие решения на основе следующей лексемы.Link candidate matches that were not generated from already matched templates are detected through the link validation state. After matching a template to which links are possible, a search is performed for link candidates waiting for it to match at a given position. If such candidates are found, they are linked and then a decision is made based on the next token.

Логика поиска шаблонов в области действия схожа с поиском ссылок. Однако в случае со ссылками начальная позиция совпадения шаблона задана строго, а область действия подразумевает совпадение одного шаблона в границах другого.The logic of searching for patterns in the scope is similar to searching for links. However, in the case of links, the start position of the pattern match is strictly specified, and the scope implies the match of one pattern within the bounds of another.

В отличие от кандидата ссылки, при создании кандидата шаблона в области действия он нигде не регистрируется. В состояние он попадает только после того, как совпадёт его дочерний элемент. Это связано с тем, что только в момент совпадения дочернего элемента становится известна позиция конца совпадения.Unlike the link candidate, when a template candidate is created in scope, it is not registered anywhere. It gets into the state only after its child element matches. This is due to the fact that only at the moment of the match of the child element does the position of the end of the match become known.

Для каждого тега, используемого в качестве области действия, хранится список кандидатов, ожидающих его совпадения. Данный список отсортирован по начальному и конечному номеру лексемы. Это позволяет при совпадении шаблона быстро выполнять поиск кандидатов, совпавших внутри данного шаблона. После совпадения шаблона с помощью двоичного поиска в отсортированном списке обнаруживается первый кандидат, на которого может повлиять совпадение. Начиная с данной позиции, в списке происходит проверка границ совпадения каждого кандидата и передача информации головному кандидату в случае, если найдено совпадение в области действия. Из-за того, что список отсортирован, первый найденный кандидат, находящийся за границей области действия совпавшего шаблона, останавливает просмотр списка. Совпавшие кандидаты из списка удаляются.For each tag that is used as a scope, a list of candidates waiting to be matched is stored. This list is sorted by start and end token number. This allows, when a pattern is matched, to quickly search for candidates that match within the given pattern. After a pattern matches, a binary search finds the first candidate in the sorted list that can be affected by the match. Starting from this position, the list checks the boundaries of the match for each candidate and transfers the information to the head candidate if a match is found in the scope. Because the list is sorted, the first candidate found that is outside the scope of the matched template stops scanning the list. Matching candidates are removed from the list.

Рекомендации по тестированию.Testing recommendations.

Особенностью изобретения является высокая сложность положенного в его основу алгоритма. В связи с этим рекомендуется вести реализацию устройства с использованием методологии разработка через тестирование. Разработка через тестирование предполагает сначала написание теста, покрывающего желаемую функциональность, а затем создание требуемой функциональности для прохождения созданного теста. Тесты, создаваемые в процессе такой разработки, как правило являются модульными и тестируют функции и процедуры объектов. Наличие модульных тестов значительно упрощает внесение изменений в существующий код, позволяя достаточно точно обнаруживать ошибки из-за новых изменений. Однако нужно учитывать, что для данного изобретения написание модульных тестов достаточно трудоемко. Это связано с большим количеством возможных состояний, возникающих в процессе принятия решений. В дополнение к модульным тестам рекомендуется создавать интеграционные тесты, в которых по принципу черного ящика происходит проверка соответствия фактических результатов поиска ожидаемым результатам. Для проверки рассмотренного варианта осуществления изобретения было реализовано более трехсот автоматизированных тестов, которые разделяются на следующие группы:A feature of the invention is the high complexity of the underlying algorithm. Therefore, it is recommended to implement the device using the test-driven development methodology. Test-driven development involves first writing a test that covers the desired functionality and then creating the required functionality to pass the generated test. The tests generated during this development process are typically unit tests and test the functions and procedures of the objects. Having unit tests makes it much easier to make changes to existing code, allowing you to accurately detect bugs due to new changes. However, it should be borne in mind that for this invention, writing unit tests is quite laborious. This is due to the large number of possible states that arise in the decision-making process. In addition to unit tests, it is recommended that you create integration tests that check in a black box that the actual search results match the expected results. To test the considered embodiment of the invention, more than three hundred automated tests were implemented, which are divided into the following groups:

тесты лексического анализатора текста;lexical text analyzer tests;

тесты синтаксического анализатора шаблонов;template parser tests;

тесты транслятора синтаксического дерева;syntax tree translator tests;

тесты выражений для представления шаблонов на уровне исполнителя поиска;expression tests for representing templates at the search performer level;

тесты кандидатов;tests of candidates;

тесты состояния проверки ссылок;link checking status tests;

тесты состояния проверки шаблонов в области действия;pattern validation status tests in the scope;

- 22 037156 тесты поискового индекса.- 22 037156 tests of the search index.

Модульные тесты разрабатывались в соответствии с принципом разработки через тестирование в процессе реализации устройства и тестируют изолированную функциональность. В модульных тестах активно применяются объекты-имитации, реализующие заданное в тестовом случае программное окружение. Наиболее трудоемким является моделирование внешнего окружения в тестах принятия решений кандидатами, однако такие тесты позволяют наиболее точно локализовать ошибки реализации.The unit tests are designed according to the test-driven design principle during device implementation and test isolated functionality. In unit tests, imitation objects are actively used that implement the program environment specified in the test case. Modeling the external environment in candidate decision-making tests is the most time-consuming, but such tests allow the most accurate localization of implementation errors.

Интеграционные тесты представляют собой тесты поиска шаблонов и представлены двумя группами: тесты, в которых шаблоны и лексемы создаются в виде объектов программы, и тесты, в которых шаблоны и текст представлены в виде строк. Такой подход необходим для возможности тестирования функций исполнителя поиска без привязки к транслятору с языка описания шаблонов. Также ручное создание поисковых выражений упрощает отладку. Для простоты при программном создании тестовых лексем не используется лексический анализатор. Вместо этого лексемы создаются вручную и представляют собой буквы латинского алфавита. Интеграционные тесты второй группы проверяют те же самые ситуации, возникающие в процессе принятия решений, однако позволяют также проверить совместную работу лексического анализатора текста, транслятора с языка описания шаблонов и исполнителя поиска на реальных данных.Integration tests are pattern search tests and are presented in two groups: tests, in which templates and tokens are created as program objects, and tests, in which templates and text are represented as strings. This approach is necessary to test the functions of the search executor without binding to the translator from the template description language. Also, manually creating search expressions makes debugging easier. For simplicity, a lexical analyzer is not used to programmatically create test tokens. Instead, tokens are created by hand and are letters of the Latin alphabet. Integration tests of the second group check the same situations that arise in the decision-making process, however, they also allow you to check the joint operation of the lexical text analyzer, the translator from the template description language, and the search performer on real data.

Описание о сновных тестовых случаев приводится в следующей таблице._______A description of the main test cases is given in the following table ._______

Название Name Шаблон Template Лексемы Lexemes Совпадения Coincidences 1. Лексема 1. Lexeme А; BUT; А BUT А BUT 2. Лексема с повторением 2. Lexeme with repetition [2] А; [2] A; АА AA АА AA 3. Лексема с вариативным повторением 3. Lexeme with variable repetition [1-2] А; [1-2] A; АА AA АА AA 4. Последовательность 4. Sequence А + В; A + B; АВ AB АВ AB 5. Последовательность с повторением 5. Sequence with repetition [2] (А + В); [2] (A + B); АВ АВ AB AB АВ АВ AB AB 6. Последовательность с вариативным повторением 6. Sequence with variable repetition [1-2] (А + В); [1-2] (A + B); АВ АВ AB AB АВАВ ABAB 7. Последовательность с повторяющейся вложенной последовательностью 7. Sequence with repeating nested sequence [2-3] (А + В) + С; [2-3] (A + B) + C; АВ АВС AB ABC АВ АВС AB ABC 8. Последовательность, начинающаяся с необязательного элемента 8. Sequence starting with an optional element ?А + В ? A + B В AT В AT 9. Последовательность, оканчивающаяся необязательным элементом 9. Sequence ending with an optional element А + ?В A +? B А BUT А BUT 10. Последовательность с необязательным элементом в середине 10. Sequence with an optional element in the middle А + 7А + В A + 7A + B АВ AB АВ AB 11. Последовательность с необязательным элементом в середине, содержащим исключение 11. Sequence with an optional element in the middle containing an exception А + ?{В, ~В} +В A +? {B, ~ B} + B АВ AB АВ AB 12. Вложенная необязательная последовательность 12. Nested optional sequence С + (?А + ?В) + D C + (? A +? B) + D CABD CABD CABD CABD 13. Вариация 13. Variation {А, В} {A, B} АВ AB А, В A, B 14. Вариация с повторением 14. Variation with repetition [2] {А, В} [2] {A, B} АВ AB АВ AB 15. Вариация с вариативным повторением 15. Variation with variable repetition [1-2] {А, В} [1-2] {A, B} АВ AB АВ AB 16. Вариация с исключением 16. Variation with an exception {А,ЧА + В)} {A, CHA + B)} ААВ AAB А BUT 17. Повторяющаяся вариация с исключением 17. Repetitive variation with exclusion [1-3] {С, D,~(C + D + B)} [1-3] {C, D, ~ (C + D + B)} CCDB CCDB С, D C, D 18. Вложенные исключения 18. Nested exceptions {A, ~{A + D, -{A + D + C}} {A, ~ {A + D, - {A + D + C}} ADC ADC А BUT 19. Четное число одинаковых вложенных исключений 19. Even number of identical nested exceptions {А, ~{А, ~А}} {A, ~ {A, ~ A}} А BUT А BUT

- 23 037156- 23 037156

20. Нечетное число одинаковых вложенных исключений 20. An odd number of identical nested exceptions {Л,~{А,ЧА,~А}}} {L, ~ {A, CHA, ~ A}}} А BUT — - 21. Исключение на втором повторении 21. Elimination on the second rep [1-2] {С, В, ~{B + D}}} [1-2] {C, B, ~ {B + D}}} CBD CBD С WITH 22. Вариация необязательных элементов в последовательности 22. Variation of optional elements in sequences С + {?А, ?В} + D C + {? A,? B} + D CD CD CD CD 23. Отмена нескольких альтернатив 23. Cancellation of several alternatives [1-3] {A, B,B + D,~(A + B + D)} [1-3] {A, B, B + D, ~ (A + B + D)} AABD AABD A, BD A, BD 24. Исключение в середине последовательности 24. Exception in the middle of a sequence {А, ~(А + В + С + D)} + В {A, ~ (A + B + C + D)} + B АВСЕ AND ALL АВ AB 25. Несколько исключений разной длины 25. Multiple exceptions of varying length {А, ~(А + D), ~(А + В + С)} {A, ~ (A + D), ~ (A + B + C)} ABCEAD ABCEAD — - 26. Исключение меньшей длины, чем альтернатива 26. Eliminate shorter length than alternative {А + В + С, ~(А + D)} {A + B + C, ~ (A + D)} АВЕ ABE — - 27. Пересечение совпадений разной длины 27. Intersection of matches of different lengths [3] {А, В, ~(В + А + В)} [3] {A, B, ~ (B + A + B)} АВАВААВВ AVABAABB а₂ва, аввa ₂ va, abv 28. Ссылка в середине последовательности 28. Link in the middle of the sequence #Р1 =А + Р2 + С; #Р2 = В; # P1 = A + P2 + C; # P2 = B; АВС ABC АВС (#Р1), В (#Р2) ABC (# P1), B (# P2) 29. Необязательная ссылка в середине последовательности 29. Optional link in the middle of the sequence #Р1 =А + ?Р2 + С; #Р2 = В; # P1 = A +? P2 + C; # P2 = B; АС AS АС (#Р1) AC (# P1) 30. Повторяющаяся ссылка 30. Duplicate link #Р1 =А + [2]Р2 + С; #Р2 = В; # P1 = A + [2] P2 + C; # P2 = B; АВВС ABBC АВВС (#Р1), Bi (#Р2), В₂ (#Р2)ABBC (# P1), Bi (# P2), B ₂ (# P2) 31. Цепочка ссылок 31. Link chain СЧ СП Ри Ри < СО О II II II - Ν П Ри Ри Ри 4k Чк Чк MF SP Ri Ri <SO O II II II - Ν P Ri Ri Ri 4k Chk Chk АВС ABC АВС (#Р1), ВС (#Р2), С (#РЗ) ABC (# Р1), BC (# Р2), С (# РЗ) 32. Несколько ссылок на один шаблон 32. Multiple links to one template и СП СП · Ри Ри СО + + + < < со II II II Ри Ри Ри Чк Чк Чк and JV SP · Ri Ri CO + + + <<co II II II Ri Ri Ri Chk Chk Chk АВВС ABBC АВВС (#Р1), АВВ (#Р2), ВВ (#РЗ) ABBC (# P1), ABB (# P2), BB (# RZ) 33. Вариация ссылок 33. Variation of links #Р1=А+{Р2, Р3} + С; #Р2 - В; #РЗ = В + С; # P1 = A + {P2, P3} + C; # P2 - B; # РЗ = B + C; АВСС ABCC АВСС (#Р1), В (#Р2), ВС (#РЗ) ABCC (# Р1), В (# Р2), ВС (# РЗ) 34. Ссылка в исключении 34. Link in exclusion #Р1 = {А, ~Р2}; #Р2 = А + В; # P1 = {A, ~ P2}; # P2 = A + B; ААВ AAB А (#Р1), АВ (#Р2) A (# P1), AB (# P2) 35. Ссылка в вариации с исключением 35. Reference in variation with an exception #P1 = {Р2, ~(А + В + С)}; #Р2 = А + В; # P1 = {P2, ~ (A + B + C)}; # P2 = A + B; АВАВС ABAVS АоВ (#Р1), АоВ (#Р2), А₂В (#Р2)АоВ (# Р1), АоВ (# Р2), А ₂ В (# Р2) 36. Правая рекурсия 36. Right recursion #Р = А + ?Р # P = A +? P ААА AAA ААА AAA

- 24 037156- 24 037156

37. Левая рекурсия 37. Left recursion #Р = ?Р + ?А # P =? P +? A AAA AAA AAA AAA 38. Шаблон внутри 38. Template inside #Р1 = В@Р2; # P1 = B @ P2; АВСВ ABCV В (#Р1), B (# P1), области действия scope #Р2 = А + В + С; # P2 = A + B + C; АВС (#Р2) ABC (# P2) 39. Шаблон на границе 39. Pattern on the border #Р1 = А@Р2; # Р1 = А @ Р2; АВСВ ABCV А(#Р1), A (# P1), области действия scope #Р2 = А + В + С; # P2 = A + B + C; АВС (#Р2) ABC (# P2) 40. Шаблон, покрывающий всю область действия 40. Template covering the entire scope #Р1 = (А + В) @ Р2; #Р2 = А + В; # P1 = (A + B) @ P2; # P2 = A + B; АВ AB АВ (#Р1), АВ (#Р2) AB (# P1), AB (# P2) 41. Необязательный шаблон в области действия 41. Optional template in scope #Р1 =А + ?В@Р2; #Р2 = А + В + С; # P1 = A +? B @ P2; # P2 = A + B + C; АВСАВ ABCAB АВ (#Р1), Аз (#Р2) AB (# P1), Az (# P2) 42. Вложенные области действия 42. Nested scopes # P1 = А @ Р2 @ РЗ @ Р4; # Р2 = В + А + В; # РЗ = С + В + А + В + С; # Р4 = D + С + В + А + В + С # P1 = А @ Р2 @ РЗ @ Р4; # P2 = B + A + B; # РЗ = С + В + А + В + С; # P4 = D + C + B + A + B + C DCBABCD DCBABCD А (#Р1), ВАВ (#Р2), СВАВС (#РЗ), DCBABCD A (# P1), BAB (# P2), SVAVS (# RZ), DCBABCD + D; + D; (#Р4) (# P4) 43. Ссылка в середине 43. Link in the middle #P1 = А + Р2 + С; # P1 = A + P2 + C; АВС ABC АВС (#Р1) ABC (# P1) последовательности; один тег sequences; one tag Р2 = В; P2 = B; 44. Повторяющаяся 44. Repetitive #Р1 = А + [2]Р2 + С; # P1 = A + [2] P2 + C; АВВС ABBC АВВС (#Р1) AVVS (# P1) ссылка; один тег link; one tag Р2 = В; P2 = B; 45. Цепочка ссылок; один тег 45. Chain of links; one tag #Р1 =А + Р2; Р2 = В + РЗ; РЗ = С; # P1 = A + P2; P2 = B + RZ; RZ = C; АВС ABC АВС (#Р1) ABC (# P1) 46. Несколько ссылок на один шаблон; один тег 46. Several links to one template; one tag #Р1=А + Р3 + С; Р2 = А + Р3; РЗ = В + В; # P1 = A + P3 + C; P2 = A + P3; RZ = B + B; АВВС ABBC АВВС (#Р1) AVVS (# P1) 47. Вариация ссылок; один тег 47. Variation of links; one tag #P1 = A + {Р2, РЗ} + С; Р2 = В; РЗ = В + С; # P1 = A + {P2, P3} + C; P2 = B; RZ = B + C; АВСС ABCC АВСС (#Р1) ABCC (# P1) 48. Ссылка в исключении; один тег 48. Link in exclusion; one tag #Р1 = {А, ~Р2}; Р2 = А + В; # P1 = {A, ~ P2}; P2 = A + B; ААВ AAB А(#Р1) A (# P1) 49. Ссылка в вариации с 49. Link in variation with #P1 = {Р2, ~(А + В + С)}; # P1 = {P2, ~ (A + B + C)}; АВ АВС AB ABC АоВ (#Р1) AoB (# P1) исключением; один тег exception; one tag Р2 = А + В; P2 = A + B; 50. Шаблон внутри области действия; один 50. Template within the scope; one #Р1=В@Р2; Р2 = А + В + С; # P1 = B @ P2; P2 = A + B + C; АВСВ ABCV В (#Р1) B (# P1) тег tag 51. Шаблон на границе области действия; один 51. Template on the border of the scope; one #Р1=А@Р2; Р2 = А + В + С; # Р1 = А @ Р2; P2 = A + B + C; АВСВ ABCV А(#Р1) A (# P1) тег tag 52. Шаблон, покрывающий всю 52. A template that covers the entire #Р1 = (А + В) @ Р2; # P1 = (A + B) @ P2; АВ (#Р1) AB (# P1) область действия; один scope; one Р2 = А + В; P2 = A + B; АЬ Ab тег tag 53. Необязательный шаблон в области 53. Optional template in scope #P1 = А + ?В @ Р2; # P1 = A +? B @ P2; АВСАВ ABCAB АВ (#Р1) AB (# P1) действия; один тег actions; one tag Р2 = А + В + С; P2 = A + B + C; #Р1 = А@Р2@РЗ @Р4; Р2 = В + А + В· # Р1 = А @ Р2 @ РЗ @ Р4; P2 = B + A + B 54. Вложенные области 54. Nested areas РЗ = С + В + А + В + С; RZ = C + B + A + B + C; DCBABCD DCBABCD А (#Р1) A (# P1) действия; один тег actions; one tag P4=D+C+B+A+B+C+ D; P4 = D + C + B + A + B + C + D;

Испытания производительности.Performance tests.

Были выполнены испытания производительности созданного варианта осуществления изобретения. В испытаниях сравнивались времена поиска шаблонов и эквивалентных им регулярных выражений на одном и том же наборе текстов. Использовалось следующее окружение испытательного стенда:Performance tests of the created embodiment were performed. The tests compared the search times for patterns and their equivalent regular expressions on the same set of texts. The following test bench environment was used:

ОС: Windows 10x64;OS: Windows 10x64;

программная платформа: .NET Core 2.1.104;software platform: .NET Core 2.1.104;

среда выполнения программ: .NET Shared Framework Host 2.0.5;runtime environment: .NET Shared Framework Host 2.0.5;

процессор: Intel Core i7-8700 3.20 ГГц;processor: Intel Core i7-8700 3.20 GHz;

ОЗУ: 64 ГБ.RAM: 64 GB.

В качестве набора текстов использовались 19 текстовых файлов суммарным объемом 207 316 символов. Файлы были созданы из статей с новостных порталов, посвященных бизнесу и финансам. В тестах использовалась стандартная реализация регулярных выражений, входящая в поставку используемой платформы .NET Core.As a set of texts, 19 text files with a total volume of 207,316 characters were used. The files were created from articles from news portals dedicated to business and finance. The tests used the standard regex implementation that ships with the .NET Core framework in use.

В рамках испытания происходил замер непосредственно времени выполнения поиска. И текстовые данные, и шаблоны предварительно загружались в память. Это увеличивало точность сравнения, исключало длительные обращения к диску, время которых может значительно отличаться от теста к тесту. Создание объектов регулярных выражений и трансляция шаблонов также происходили на этапе подготовки к испытанию. Для уменьшения влияния динамической компиляции кода, использующейся в .NET, наAs part of the test, the search execution time was measured directly. Both text data and templates were preloaded into memory. This increased the accuracy of the comparison, excluded long-term disk accesses, the time of which could differ significantly from test to test. The creation of regular expression objects and the translation of templates also took place during the preparatory stage for testing. To reduce the impact of dynamic compilation of code used in .NET on

- 25 037156 этапе подготовки также запускались несколько итераций поиска.- 25 037156 during the preparation phase, several search iterations were also launched.

Сравнение производилось по трем различным классам шаблонов: вариации лексем;The comparison was made for three different classes of patterns: lexeme variations;

слова на расстоянии;words at a distance;

сложные шаблоны: телефон, URL-адрес, адрес электронной почты, хэш-тег.complex patterns: phone, url, email, hash tag.

Для первых двух классов шаблоны были сгенерированы автоматически на основе названий компаний и их биржевых меток (тикеров). Для каждого из тестов было сгенерировано 3383 шаблона. Далее в примерах символом A обозначается полное название компании, а символом B - её биржевой тикер.For the first two classes, templates were generated automatically based on company names and their stock labels (tickers). For each of the tests, 3383 templates were generated. In the examples below, A stands for the full name of a company, and B stands for its stock ticker.

Шаблоны вариаций лексем выявляют упоминание компаний. На языке описания шаблонов они имеют вид {А, В}Variation patterns of tokens reveal company mentions. In the template description language, they look like {A, B}

Эквивалентные им регулярные выражение имеют вид (?i)\ЬА\Ь|\bB\b где директива (?i) активирует режим регистро-независимого поиска, a\b обозначает границу слова.Equivalent regular expressions have the form (? I) \ bA \ b | \ bB \ b where the directive (? I) activates case-insensitive search mode, a \ b denotes a word boundary.

Шаблоны слов на расстоянии выявляют упоминания компании в контексте её биржевого тикера. Упоминанием считается нахождение в тексте названия компании и ее биржевого тикера на расстоянии не более пяти слов. На языке описания шаблонов выражение для поиска такой конструкций имеет вид {А .. 0-5 .. В, В . . 0-5 .. А};Remote word patterns reveal mentions of a company in the context of its stock ticker. A mention is considered to be in the text of the name of the company and its stock ticker at a distance of no more than five words. In the template description language, an expression for searching for such constructions is {A .. 0-5 .. B, B. ... 0-5 .. A};

Схожее по смыслу регулярное выражение из-за громоздкости целиком не приводится. Для выражения конструкции следования на расстоянии не более пяти слов используется регулярное выражение (\bA\b) (?: [\s,. ! ? () ] +\w+) { 0, 5}?[\s, . :; ! ? () ] + (\bB\b) где между A и B описывается конструкция из буквенных или буквенно-цифровых последовательностей, разделённых специальными или пробельными символами, повторяемая не более пяти раз.Due to its cumbersomeness, a regular expression with a similar meaning is not given in its entirety. The regular expression (\ bA \ b) (?: [\ S ,.!? ()] + \ W +) {0, 5}? [\ S,. :; ! ? ()] + (\ bB \ b) where between A and B describes a construction of alphabetic or alphanumeric sequences separated by special or whitespace characters, repeated no more than five times.

Сложные шаблоны представляют собой написанные вручную шаблоны для выявления телефонного номера, URL-адреса, адреса электронной почты и хэш-тега. Для языка описания шаблонов используются следующие определения:Sophisticated templates are hand-written templates for identifying phone number, URL, email, and hash tag. The following definitions are used for the template description language:

#PhoneNumber = ?+ + {Num, ( + Num + )} + [2+]Space} + Num);#PhoneNumber =? + + {Num, (+ Num +)} + [2+] Space} + Num);

#Email = Word + [0+] {Word, +, -} + + Domain;#Email = Word + [0+] {Word, +, -} + + Domain;

#Url = {http, https} + :// + Domain + ?Path + ?Query;#Url = {http, https} +: // + Domain +? Path +? Query;

Domain = Word + [1 + ] + Word + [0+]{Word, , -});Domain = Word + [1 +] + Word + [0 +] {Word,, -});

Path = / + [0+] {Word, /, %};Path = / + [0+] {Word, /,%};

Query = ? + [0-1] (QueryParam + [0+](& + QueryParam));Query =? + [0-1] (QueryParam + [0 +] (& + QueryParam));

QueryParam = Identifier + = + Identifier;QueryParam = Identifier + = + Identifier;

Identifier = {AlphaNum, Alpha, _} + [0+]{Word, _};Identifier = {AlphaNum, Alpha, _} + [0 +] {Word, _};

#HashTag = # + Identifier;#HashTag = # + Identifier;

На языке регулярных выражений телефонный номер, адрес электронной почты, URL-адрес и хэштег описываются, соответственно, конструкциями (?<=\s)\+?(\d+|$\d+$)([-\s]\d+){2,}(?=\s)In regular expression language, the phone number, email address, URL, and hashtag are described, respectively, by (? <= \ S) \ +? (\ D + | \ (\ d + \)) ([- \ s] \ d + ) {2,} (? = \ S)

[a-zA-Z0-9_.+-]+@([\w-]+{?:\.[\w-]+)*) (https?:\/\/)([\w-]+{?:\.[\w-]+)*)(\/[\w\/%+]+)?(?:\?((\w+=\w+)(?:&(\w+=\ w+))*))?[a-zA-Z0-9 _. + -] + @ ([\ w -] + {?: \. [\ w -] +) *) (https?: \ / \ /) ([\ w-] + {?: \. [\ w -] +) *) (\ / [\ w \ /% +] +)? (?: \? ((\ w + = \ w +) (?: & (\ w + = \ w +)) *))?

\B(\#[a-zA-Z]+\b)\ B (\ # [a-zA-Z] + \ b)

В ходе выполнения испытательных тестов и определения временных значений поиска для всех трех наборов шаблонов было скорректировано как количество испытательных запусков для вычисления среднего значения, так и количество холостых итераций. Время работы тестов поиска по шаблонам и теста поиска сложных конструкций с помощью регулярных выражений не превышает секунды, поэтому для увеличения точности использовалось десять холостых запусков и двадцать измеряемых итераций. Поиск с помощью регулярных выражений для тестов с большим количеством шаблонов выполняется более десяти секунд, поэтому запуск холостых: итераций не выполнялся, а для измерения использовались три итерации.Both the number of test runs to calculate the mean and the number of dummy iterations were adjusted during the run of the test tests and the determination of the search time values for all three sets of templates. The running time of the pattern search tests and the test of searching for complex structures using regular expressions does not exceed a second, so ten dummy runs and twenty measured iterations were used to increase the accuracy. A regex search for tests with a large number of patterns takes more than ten seconds, so a dummy run: no iterations were performed, and three iterations were used to measure.

Результаты измерений и округлённое значение n, показывающее отношение времени поиска по регулярным выражениям к времени поиска по шаблонам, приведены в следующей таблице.The measurement results and the rounded n-value showing the ratio of the regular expression search time to the pattern search time are shown in the following table.

- 26 037156- 26 037156

Тип испытательного теста Test type Время поиска по регулярным выражениям, секунд Regular expression search time, seconds Время поиска по шаблонам, с Pattern search time, s п P Вариации констант Variations of constants 14 14 0,047 0.047 298 298 Следование через слова Following through words 15 15 0,2 0.2 75 75 Сложные конструкции Complex constructions 0,051 0.051 0,155 0.155 0,33 0.33

Результаты испытаний оказались предсказуемыми: в двух первых тестах при поиске 3383 конструкций поиск по шаблонам оказался значительно быстрее. Это связано как с возможностью выявления совпадений с множеством шаблонов за один просмотр текста, так и с работой на уровне лексем. В тесте поиска четырех сложных конструкций, где работа идет на уровне, близком к уровню символов, поиск по шаблонам оказался медленней всего лишь примерно в три раза.The test results turned out to be predictable: in the first two tests, when searching for 3383 constructs, the pattern search turned out to be much faster. This is due both to the ability to identify matches with many templates in one look at the text, and to work at the token level. In the test of searching for four complex structures, where the work is proceeding at a level close to the level of symbols, the pattern search was only about three times slower.

Заключение.Conclusion.

Таким образом, обычному специалисту в данной области должно быть очевидно, что устройство, построенное в соответствии с предложенным изобретением, обеспечивает быстрый поиск в тексте совпадений с шаблонами за счёт предварительного разбора текста на лексемы и проверки всех совпадений всех шаблонов за один последовательный просмотр лексем текста. Достигаются высокая скорость, полнота и точность поиска, независимость поиска от языка текста и безопасность, означающая избежание затягивания поиска на продолжительное время, воспринимаемое пользователем как зацикливание или бесконечный поиск. Устройство, построенное в соответствии с предложенным изобретением, позволяет быстро выявлять наборы понятий, сущностей и их отношений в текстах на естественном языке.Thus, it should be obvious to an ordinary specialist in this field that a device constructed in accordance with the proposed invention provides a quick search for matches with patterns in the text by pre-parsing the text into tokens and checking all matches of all patterns in one sequential scan of the text tokens. Achieved high speed, completeness and accuracy of the search, independence of the search from the language of the text and security, which means avoiding delaying the search for a long time, perceived by the user as a loop or endless search. The device constructed in accordance with the proposed invention allows you to quickly identify sets of concepts, entities and their relationships in natural language texts.

Изобретение раскрыто в некоторых вариантах реализации. Должно быть очевидно, что раскрытое устройство может быть реализовано с модификациями, не отступая от изобретения. Приложенная формула изобретения покрывает всё множество вариаций и модификаций, находящихся в рамках сущности настоящего изобретения.The invention is disclosed in some embodiments. It should be obvious that the disclosed device can be implemented with modifications without departing from the invention. The appended claims cover many variations and modifications that fall within the spirit of the present invention.

Claims

1. A method of searching for matches with patterns in the text to identify a given set of concepts, entities and relationships in a text in a natural or machine-readable language, characterized by the fact that

A) the text is pre-parsed into lexemes, which include at least words and word separators;

B) in the template description language, create a set of templates corresponding to the given concepts, entities and relations, in which each template is a formal grammar consisting at least of sequences and (or) variations and (or) repetitions of text tokens, and (or) occurrences of other patterns;

C) translate a set of templates into search expression trees with search indexes, which allow for a given token or a given template identifier to quickly find all search expressions that begin with a given token or a given template;

D) for a given set of templates, a set of candidates is created, each of which stores information about matching text tokens with elements of the search expression tree for matching, and the matching order corresponds to the order of traversing the tree from leaves to the root, matching a sequence requires matching all its elements in a given order , the match of the variation requires the match of at least one of its elements, the match of the repetition requires the match of its element a specified number of times, the match of the text lexeme is performed, at least with or without capital and lowercase letters;

E) then scan the lexemes of the text once sequentially and for each token perform at least the following actions:

i) all patterns starting with the current token are searched in the search indices, candidates are created to check for text matches with these patterns, and they are added to the set of candidates;

ii) for each candidate from the set of candidates, the current token of the text is compared with the next element or elements of the candidate search expression tree for a match;

iii) if the next element of the search expression tree coincides with the current token of the text and is the last in the traversal order, for example, the root, then the candidate is considered to be completely matched and is transferred from the set of candidates to the set of results; if the next element of the search expression tree coincides with the current token of the text and is not the last in the traversal order, then the candidate is considered partially matched and is left in the set of candidates; if the next element of the search expression tree does not match the current token of the text, then the candidate is considered non-matching and is removed from the set of candidates;

iv) to take into account different variants of text coincidence with the checked templates, create and add to the set of candidates logical copies of those candidates for which different variants of matching with the text are possible, and the logical copies of the candidates contain the same information about the coincidence of elements of the search expression tree with already scanned text tokens, and diversified

- 27 037156 information about the match of an element or elements of the search expression tree with the current text token.

2. The method of searching for matches with patterns in the text according to claim 1, in which

A) the templates being checked allow recursive definitions through the same templates and / or through other templates;

B) the number of candidates created in the search process and (or) the amount of resources consumed by candidates and (or) the depth of recursion when enumerating matching options is limited;

3. The method for searching the text for matches with patterns according to claim 1, wherein the patterns support parameters for generalizing patterns and / or for refining search results.

4. The method for searching the text for matches with patterns according to claim 2, wherein the patterns support parameters for generalizing patterns and / or for refining search results.