RU2630427C2

RU2630427C2 - Method and system of semantic processing text documents

Info

Publication number: RU2630427C2
Application number: RU2016133365A
Authority: RU
Inventors: Дмитрий Владимирович Мительков; Андрей Юрьевич Новиков; Борис Борисович Сатин
Original assignee: Дмитрий Владимирович Мительков
Priority date: 2016-08-12
Filing date: 2016-08-12
Publication date: 2017-09-07
Also published as: RU2016133365A

Abstract

FIELD: physics.

SUBSTANCE: method of semantic processing text documents provides an addition of metainformation of each text document presented in the natural language and stored in a database together with meta information, in the semantic method - a discourse graph. The discourse graph of the user request natural language and the text document is obtained. An evaluation of each text document is performed relative to the user request, taking into account the semantic features, and the user is provided with a ranked array of text documents.

EFFECT: increasing the completeness and accuracy of processing text documents.

16 cl, 4 dwg, 3 tbl

Description

ОБЛАСТЬ ТЕХНИКИ, К КОТОРОЙ ОТНОСИТСЯ ИЗОБРЕТЕНИЕFIELD OF THE INVENTION

Настоящее изобретение имеет отношение к системам автоматизированной обработки информации, а именно к системам оценки релевантности текстовых документов запросу пользователя с использованием семантических признаков текста и ранжированию по ценности массива текстовых документов.The present invention relates to automated information processing systems, and in particular to systems for assessing the relevance of text documents to a user’s request using semantic features of the text and ranking by the value of the array of text documents.

УРОВЕНЬ ТЕХНИКИBACKGROUND

Количество информации в настоящий век информационных технологий удваивается каждые 1,5-2 года. По этой причине поиск релевантной информации по определенной предметной области является достаточно сложной и актуальной задачей.The amount of information in the present century of information technology doubles every 1.5-2 years. For this reason, the search for relevant information on a specific subject area is a rather complicated and relevant task.

Задача поиска и ранжирования релевантных текстовых документов в сети Интернет успешно решается поисковыми системами такими, как ЯНДЕКС, GOOGLE, YAHOO!. Основной вектор при решении обозначенной задачи в данных системах направлен на оперативность представления результата поиска, в то время как точность поиска достигается за счет использования в качестве критерия ключевых слов и частоты встречаемости этих слов в массиве текстовых документов (модель tf-idf). Недостатком данного способа является игнорирование смыслового содержания текстового документа при его оценке, что и приводит к низкой точности обработки текстовых документов. Данный недостаток частично нивелируется поисковыми системами за счет, например, формирования персонализированной модели ранжирования на электронном устройстве, связанном с пользователем [патент RU 2580516] - данный способ не подходит для работы с массивом электронных документов, хранящихся в изолированной базе данный, так как в качестве признака формирования персонализированной модели ранжирования используются характерные свойства веб-ресурсов, связанных с пользователем; другой способ [патент RU 2383922] позволяет получить большее тематическое разнообразие в документах наивысшего ранжирования, причем эти документы должны быть тематически насыщены - данный способ снижает оперативность обработки текстовых документов при незначительном повышении их точности, т.к. тематически насыщенные документы не позволяют сразу определить место нахождения в документе интересующей пользователя информации.The task of searching and ranking relevant text documents on the Internet is successfully solved by search engines such as Yandex, GOOGLE, YAHOO !. The main vector in solving the indicated problem in these systems is aimed at the efficiency of presenting the search result, while the accuracy of the search is achieved by using keywords as a criterion and the frequency of occurrence of these words in an array of text documents (tf-idf model). The disadvantage of this method is the neglect of the semantic content of a text document when evaluating it, which leads to low accuracy of processing text documents. This drawback is partially mitigated by search engines due to, for example, the formation of a personalized ranking model on an electronic device associated with the user [patent RU 2580516] - this method is not suitable for working with an array of electronic documents stored in an isolated database, as this is a feature the formation of a personalized ranking model uses the characteristic properties of web resources associated with the user; another method [patent RU 2383922] allows to obtain a greater thematic diversity in documents of the highest ranking, and these documents should be thematically saturated - this method reduces the efficiency of processing text documents with a slight increase in their accuracy, because thematically rich documents do not allow you to immediately determine the location in the document of information of interest to the user.

Наиболее близкий к заявленной группе изобретений относится способ автоматизированной семантической индексации текста на естественном языке, который раскрыт в патенте РФ №2518946 (опубл. 10.06.2014). В этом способе текст в цифровой форме сегментируют на элементарные единицы первого уровня, включающие в себя, по меньшей мере, слова; сегментируют по графематическим правилам текст в цифровой форме на предложения; формируют на основе морфологического анализа для каждой элементарной единицы первого уровня, представляющей собой слово, элементарную единицу второго уровня, включающую в себя нормализованную словоформу, именуемую далее леммой; подсчитывают частоту встречаемости каждой элементарной единицы первого уровня для двух и более соседних единиц первого уровня в данном тексте и объединяют среди элементарных единиц первого уровня последовательности слов, следующих друг за другом в данном тексте, в элементарные единицы третьего уровня, представляющие собой устойчивые сочетания слов, в случае, если для каждых двух и более следующих друг за другом слов в данном тексте разности подсчитанных частот встречаемости этих слов для первого появления данной последовательности слов и для нескольких последующих их появлений для каждой пары слов последовательности остаются неизменными; выявляют, в процессе многоступенчатого семантико-синтаксического анализа путем обращения к заранее сформированным в базе данных лингвистическим и эвристическим правилам в заранее заданной лингвистической среде, в каждом из сформированных предложений семантически значимый объект и его атрибут, являющиеся единицами четвертого уровня; сохраняют в памяти каждый семантически значимый объект и атрибут; выявляют, в процессе многоступенчатого семантико-синтаксического анализа путем обращения к заранее сформированным в базе данных лингвистическим и эвристическим правилам в заранее заданной лингвистической среде, в каждом из сформированных предложений семантически значимые отношения между выявленными единицами четвертого уровня - семантически значимыми объектами, а также, между семантически значимыми объектами и атрибутами; присваивают каждому семантически значимому отношению соответствующий тип из хранящейся в базе данных предметной онтологии по тематике той предметной области, к которой относится индексируемый текст; сохраняют в памяти каждое семантически значимое отношение вместе с присвоенным ему типом; выявляют частоты встречаемости элементарных единиц четвертого уровня на всем тексте; формируют в пределах данного текста для каждого из выявленных семантически значимых отношений, связывающих как соответствующие семантически значимые объекты, так и семантически значимый объект и его атрибут, множество триад, которые являются элементарными единицами пятого уровня; индексируют на множестве сформированных триад по отдельности все связанные семантически значимыми отношениями семантически значимые объекты с их частотами встречаемости, все атрибуты с их частотами встречаемости и все сформированные триады; сохраняют в базе данных сформированные элементарные единицы второго, третьего, четвертого и пятого уровней с их частотами встречаемости, а также полученные индексы вместе со ссылками на конкретные предложения данного текста; ранжируют сформированные элементарные единицы второго и третьего уровней по смысловому весу путем сравнения их смыслового веса с заранее заданным пороговым значением; удаляют триады, в которых элементарные единицы второго и третьего уровней имеют смысловой вес ниже порогового.Closest to the claimed group of inventions relates to a method of automated semantic indexing of text in a natural language, which is disclosed in the patent of the Russian Federation No. 2518946 (publ. 06/10/2014). In this method, the text is digitally segmented into elementary units of the first level, including at least words; Segment digitally into sentences according to graphematic rules; form on the basis of morphological analysis for each elementary unit of the first level, which is a word, an elementary unit of the second level, which includes a normalized word form, hereinafter referred to as the lemma; calculate the frequency of occurrence of each elementary unit of the first level for two or more adjacent units of the first level in this text and combine among the elementary units of the first level the sequence of words following each other in this text into elementary units of the third level, which are stable combinations of words, if, for every two or more consecutive words in a given text, the difference in the calculated frequencies of occurrence of these words for the first occurrence of a given sequence with s and a number of subsequent occurrences for each pair of sequences of words remain unchanged; identify, in the process of multi-stage semantic-syntactic analysis by referring to linguistic and heuristic rules pre-generated in the database in a predefined linguistic environment, in each of the generated sentences, a semantically significant object and its attribute, which are units of the fourth level; store in memory every semantically significant object and attribute; reveal, in the process of multi-stage semantic-syntactic analysis by referring to linguistic and heuristic rules pre-generated in the database in a predefined linguistic environment, in each of the generated sentences there are semantically significant relations between the identified units of the fourth level - semantically significant objects, as well as between semantically significant significant objects and attributes; assign to each semantically significant relation the corresponding type from the subject ontology stored in the database on the subject matter of the subject area to which the indexed text belongs; store in memory each semantically significant relation together with the type assigned to it; identify the frequency of occurrence of elementary units of the fourth level throughout the text; form within the given text for each of the identified semantically significant relations, linking both the corresponding semantically significant objects and the semantically significant object and its attribute, a multitude of triads that are elementary units of the fifth level; index on a set of formed triads individually all semantically meaningful relations related semantically significant objects with their frequencies of occurrence, all attributes with their frequencies of occurrence and all formed triads; save in the database the formed elementary units of the second, third, fourth and fifth levels with their frequencies of occurrence, as well as the resulting indices, together with links to specific sentences of this text; rank the formed elementary units of the second and third levels by semantic weight by comparing their semantic weight with a predetermined threshold value; remove triads in which the elementary units of the second and third levels have a semantic weight below the threshold.

Недостатками данного способа являются:The disadvantages of this method are:

1) сформированные триады не позволяют выражать смысловое содержание документа, т.к. они не содержат действия, что не позволяет точно произвести поиск релевантных информационным потребностям пользователя фактов. Как известно, семантика (смысл) конкретного понятия раскрывается только при его использовании с конкретной целью, например: понятие "газета" может употребляться как источник информации и как средство для битья мух;1) the formed triads do not allow expressing the semantic content of the document, because they do not contain actions, which does not allow you to accurately search for facts relevant to the user's information needs. As you know, the semantics (meaning) of a particular concept is disclosed only when it is used for a specific purpose, for example: the concept of "newspaper" can be used as a source of information and as a means to beat flies;

2) семантически значимые отношения выявляют только между семантически значимыми объектами, а также, между семантически значимыми объектами и атрибутами, оставляя без внимания семантические отношения между отдельными смыслосодержащими текстовыми элементами, минимальным из которых является клауза. Без учета данных семантических отношений невозможно учесть смысл всего документа;2) semantically meaningful relationships are revealed only between semantically meaningful objects, as well as between semantically meaningful objects and attributes, ignoring the semantic relations between individual meaning-containing text elements, the smallest of which is clause. Without taking into account these semantic relations, it is impossible to take into account the meaning of the entire document;

3) ранжирование текстовых документов осуществляется за счет подсчета частоты встречаемости семантически значимых объектов и атрибутов, значимость которых определяется исходя из семантической структуры предложения, содержащей указанные объекты и атрибуты, при этом никак не учитывается оценка текстового документа относительно запроса пользователя.3) ranking of text documents is carried out by calculating the frequency of occurrence of semantically significant objects and attributes, the significance of which is determined based on the semantic structure of the sentence containing the specified objects and attributes, while the assessment of the text document regarding the user's request is not taken into account.

РАСКРЫТИЕ ИЗОБРЕТЕНИЯSUMMARY OF THE INVENTION

Задачей, для решения которой предлагается настоящее изобретение, является увеличение эффективности работы пользователя при поиске релевантной информации за счет увеличения полноты и точности обработки текстовых документов, достигаемых путем использования в качестве признаков отбора текстовых документов помимо ключевых слов еще и семантических признаков текста, при этом оперативность обработки увеличивается вследствие интеграции предложенной системы в систему обработки текстовых документов с использованием ключевых слов.The problem to which the present invention is proposed is to increase the user’s efficiency in the search for relevant information by increasing the completeness and accuracy of processing text documents, achieved by using, as a selection of text documents, besides keywords also semantic features of the text, while processing efficiency increases due to the integration of the proposed system into a text document processing system using keywords.

Поставленная задача решается с помощью способа и системы семантической обработки текстовых документов.The problem is solved using the method and system of semantic processing of text documents.

Способ семантической обработки текстовых документов заключается в следующем:The method of semantic processing of text documents is as follows:

текстовые документы на естественном языке, при добавлении в базу данных, поочередно поступают в модуль формирования дискурсного графа текстовых документов, в котором сегментируют текст на элементарные смыслосодержащие дискурсивные единицы - клаузы; в пределах клаузы на основе морфологического и синтаксического анализа формируют функционально-ролевую структуру (ФРС), состоящую из действия, субъекта действия, объекта действия - игроков ФРС клаузы и их атрибутов;text documents in a natural language, when added to the database, are fed into the module for generating a discourse graph of text documents, in which the text is segmented into elementary meaning-containing discursive units - clauses; within clause, on the basis of morphological and syntactic analysis, form a functional role structure (FRS), consisting of an action, a subject of action, an object of action - players of the clause and their attributes;

выявляют референциальную связность между игроками клауз, по результатам которой кореферент с наибольшей степенью общности заменяют кореферентом с более конкретным значением; определяют риторические отношения между клаузами и более крупными текстовыми элементами, в результате чего получают дискурсный граф текстового документа, который сохраняют в метаданных текстового документа в базе данных и по которому индексируют текстовый документ;revealing the referential connectivity between the players of the clauses, according to the results of which the co-referent with the greatest degree of generality is replaced by a co-referent with a more specific value; determine the rhetorical relationship between clauses and larger text elements, resulting in a discursive graph of a text document, which is stored in the metadata of the text document in the database and by which the text document is indexed;

получают дискурсный граф для запроса пользователя на естественном языке аналогичным образом;receive a discourse graph for user request in natural language in a similar way;

оценивают каждый текстовый документ относительно запроса пользователя;evaluate each text document in relation to a user request;

ранжируют текстовые документы по ценности;rank text documents by value;

отсекают документы ниже некоторого эмпирически определенного порога.cut off documents below a certain empirically determined threshold.

Способу по настоящему изобретению присущи следующие особенности:The method of the present invention has the following features:

при выявлении референциальной связности между игроками ФРС клауз могут учитывать как анафорические ссылки, так и катафорические ссылки, причем антецедент может быть выражен как местоимением, так и семантически тождественным понятием; при формировании запроса пользователя дополнительно помечают ФРС, представляющие факты известные пользователю, и ФРС, составляющие суть запроса;when revealing the referential connectivity between the players, the clauses can take into account both anaphoric links and cataphoric links, and the antecedent can be expressed as a pronoun or a semantically identical concept; when generating a user request, the Fed is additionally marked, representing facts known to the user, and the Fed, which constitute the essence of the request;

при оценке соответствия ФРС текстового документа и запроса соответствующие объекты, субъекты, действия и атрибуты могут сравнивать как по точному соответствию, так и с учетом семантического тождества понятий;when assessing the Fed's compliance with a text document and a request, the corresponding objects, subjects, actions and attributes can be compared both in exact correspondence and taking into account the semantic identity of concepts;

при оценке соответствия риторических отношений между ФРС в текстовом документе и запросе риторические отношения между ФРС, представляющие факты известные пользователю, оценивают по точному соответствию, а риторические отношения между ФРС, представляющими факты известные пользователю, и ФРС, составляющими суть запроса могут оценивать по точному соответствию или с использованием эвристических правил, хранящихся в онтологии предметной области;when assessing the conformity of the rhetorical relations between the Fed in a text document and the request, the rhetorical relations between the Fed representing facts known to the user are evaluated by exact match, and the rhetorical relations between the Fed representing facts known to the user and the Fed constituting the essence of the request can be evaluated by exact matching or using heuristic rules stored in the domain ontology;

оценку текстовых документов относительно запроса пользователя производят на основе мультипликативной свертки следующих оценок: оценка соответствия ФРС текстового документа и запроса, при этом учитывают соответствие атрибутов и вершин ФРС, оценка соответствия риторических отношений между ФРС в текстовом документе и запросе, оценка расстояния от пересечения соответствующих графов до ядра текстового документа.assessment of text documents with respect to a user's request is made on the basis of a multiplicative convolution of the following ratings: conformity assessment of the Fed text document and query, taking into account the correspondence of attributes and vertices of the Fed, conformity assessment of rhetorical relations between the Fed in the text document and query, estimation of the distance from the intersection of the corresponding columns to kernels of a text document.

Система семантической обработки текстовых документов включает совокупность взаимосвязанных друг с другом модулей системы обработки текстовых документов с использованием ключевых слов: модуль формирования запроса пользователя на естественном языке, модуль формирования запроса пользователя по ключевым словам, модуль пополнения текстовых документов, модуль индексации текстовых документов, база данных, хранящая текстовые документы и метаинформацию о них, модуль оценки текстовых документов относительно запроса пользователя по ключевым словам, модуль ранжирования по ценности текстовых документов с учетом ключевых слов, хранилище результатов поиска по ключевым словам, модуль представления результатов обработки текстовых документов.The system of semantic processing of text documents includes a set of interconnected modules of a system for processing text documents using keywords: module for generating a user request in natural language, module for generating a user request for keywords, module for replenishing text documents, module for indexing text documents, database, storing text documents and meta-information about them, a module for evaluating text documents regarding a user's request by key In other words, a ranking module for the value of text documents based on keywords, a repository of search results for keywords, a module for presenting the results of processing text documents.

Для осуществления вышеописанного способа система семантической обработки текстовых документов также включает модуль формирования дискурсного графа текстового документа, модуль формирования дискурсного графа запроса пользователя, модуль онтологии предметной области, модуль оценки текстовых документов относительно запроса пользователя с учетом семантических признаков, модуль ранжирования по ценности текстовых документов с учетом семантических признаков, хранилище результатов поиска с учетом семантических признаков.To implement the above method, the system of semantic processing of text documents also includes a module for generating a discourse graph of a text document, a module for generating a discourse graph of a user request, an ontology module for a subject area, a module for evaluating text documents with respect to a user request taking into account semantic features, and a ranking module for the value of text documents taking semantic features, a repository of search results based on semantic features.

Особенностью системы семантической обработки текстовых документов является интеграция в ней подсистемы обработки текстовых документов по ключевым словам и подсистемы обработки текстовых документов с использованием семантических признаков, что позволяет достичь более высоких показателей полноты и точности обработки текстовых документов при сохранении оперативности обработки на прежнем уровне.A feature of the semantic processing of text documents is the integration of the subsystem for processing text documents by keywords and the subsystem for processing text documents using semantic features, which allows to achieve higher completeness and accuracy of processing text documents while maintaining the processing efficiency at the same level.

ОПИСАНИЕ ЧЕРТЕЖЕЙDESCRIPTION OF DRAWINGS

На фиг. 1 приведена схема взаимодействия рабочих модулей системы семантической обработки текстовых документов.In FIG. 1 shows a diagram of the interaction of the working modules of the semantic processing of text documents.

На фиг. 2 приведена схема формирования дискурсного графа текстового документа.In FIG. 2 shows a diagram of the formation of a discourse graph of a text document.

На фиг. 3 приведен фрагмент сети понятий лингвистической онтологии и отношений между ними.In FIG. Figure 3 shows a fragment of a network of concepts of linguistic ontology and the relations between them.

На фиг. 4 приведен алгоритм оценки текстового документа относительно запроса пользователя.In FIG. 4 shows an algorithm for evaluating a text document regarding a user request.

ОСУЩЕСТВЛЕНИЕ ИЗОБРЕТЕНИЯDETAILED DESCRIPTION OF THE INVENTION

Система семантической обработки текстовых документов интегрирована в систему обработки текстовых документов с использованием ключевых слов и состоит из взаимосвязанных друг с другом модулей (фиг. 1). Данная интеграция выполнена в целях нивелирования снижения быстродействия системы обработки информации при применении способа семантической обработки информации. Быстродействие модулей, в которых реализован способ семантической обработки информации, ниже быстродействия модулей, реализующих механизм ключевых слов, вследствие применения в способе семантической обработки информации более сложных признаков, которые используются при оценке текстовых документов относительно запроса пользователя, что позволяет достичь большей точности обработки текстовых документов.The system of semantic processing of text documents is integrated into the system of processing text documents using keywords and consists of modules interconnected with each other (Fig. 1). This integration was carried out in order to level the decrease in the speed of the information processing system when applying the method of semantic information processing. The speed of modules that implement the method of semantic processing of information is lower than the speed of modules that implement the keyword mechanism due to the use of more complex features in the method of semantic processing of information that are used when evaluating text documents with respect to a user’s request, which allows to achieve greater accuracy in processing text documents.

Работа модулей системы семантической обработки текстовых документов осуществляется в следующем порядке, описанном ниже.The work of the modules of the semantic processing system of text documents is carried out in the following order, described below.

Модуль формирования запроса пользователя на естественном языке 101 и модуль представления результатов обработки текстовых документов 118, представляют собой единое прикладное программное и/или системное программное обеспечение (или их комбинацию), установленное на персональный компьютер (настольный компьютер, ноутбук, нетбук и т.п.) или беспроводное электронное устройство (мобильный телефон, смартфон, планшет и т.п.). Исключительно для иллюстративных целей следует предполагать, что модуль формирования запроса пользователя на естественном языке 101 и модуль представления результатов обработки текстовых документов 118 реализованы как клиентское программное обеспечение, установленное на ноутбук, такой как LENOVO™ THINKPAD™ Х220 ноутбук, работающий на операционной системе WINDOWS™. Задачей модуля формирования запроса пользователя на естественном языке 101 является получение от пользователя запроса на естественном языке и передача его по каналам связи в модуль формирования запроса пользователя по ключевым словам 102 и в модуль формирования дискурсного графа запроса пользователя 112. Задачей модуля представления результатов обработки текстовых документов 118 является представление пользователю ранжированной выборки релевантных текстовых документов.The module for generating a user request in natural language 101 and the module for presenting the results of processing text documents 118 are a single application software and / or system software (or a combination thereof) installed on a personal computer (desktop computer, laptop, netbook, etc. ) or a wireless electronic device (mobile phone, smartphone, tablet, etc.). For illustrative purposes only, it should be assumed that the natural language user query generation module 101 and the text document processing presentation module 118 are implemented as client software installed on a laptop, such as a LENOVO ™ THINKPAD ™ X220 laptop running on the WINDOWS ™ operating system. The task of the module for generating a user request in natural language 101 is to receive a request in natural language from the user and transmitting it via communication channels to the module for generating the user request by keywords 102 and to the module for generating the discourse graph of user request 112. The task of the module for presenting the results of processing text documents 118 is the presentation to the user of a ranked selection of relevant text documents.

Реализация каналов связи не ограничена и будет зависеть от исполнения (стационарный или мобильный) модуля формирования запроса пользователя на естественном языке 101.The implementation of communication channels is not limited and will depend on the execution (stationary or mobile) of the user request generation module in natural language 101.

Запрос на поиск релевантных задачам пользователя текстовых документов, реализуемый модулем формирования запроса пользователя на естественном языке 101, осуществляется на естественном языке (может быть применен любой язык индоевропейской группы при условии наличия для этого языка морфологического и синтаксического парсера). При этом длина запроса никак не ограничивается. Пользователь для получения более точных результатов поиска может включить в запрос факты известные пользователю и, непосредственно, вопрос, на который необходимо найти ответ. Например: "После 43 лет членства в ЕС Великобритания решила выйти из Евросоюза. К каким последствиям может привести брекзит?".The search request for text documents relevant to the user's tasks, implemented by the module for generating a user request in natural language 101, is carried out in natural language (any language of the Indo-European group can be applied, provided that a morphological and syntactic parser is available for this language). Moreover, the request length is not limited in any way. To obtain more accurate search results, the user can include facts known to the user in the request and, directly, a question that needs to be answered. For example: “After 43 years of EU membership, the UK decided to withdraw from the European Union. What are the consequences of Brexit?”

Модули 100, 102-116, 120 могут исполняться как прикладное программное и/или системное программное обеспечение (или их комбинация). Модули 100, 102-116, 120 могут быть установлены на один сервер или могут быть распределены и выполнятся с помощью нескольких серверов. Сервер представляет собой стандартный компьютерный сервер. В примере варианта реализации настоящей технологии, сервер представляет собой сервер Dell™ PowerEdge™, на котором используется операционная система Microsoft™ WindowsServer™.Modules 100, 102-116, 120 may be executed as application software and / or system software (or a combination thereof). Modules 100, 102-116, 120 can be installed on a single server or can be distributed and executed using multiple servers. The server is a standard computer server. In an example embodiment of the present technology, the server is a Dell ™ PowerEdge ™ server that uses the Microsoft ™ WindowsServer ™ operating system.

После передачи модулем формирования запроса пользователя на естественном языке 101 запроса пользователя, последний обрабатывается параллельно двумя способами: (i) способом поиска релевантных текстовых документов с использования в качестве критерия поиска ключевых слов и частоты встречаемости этих слов в массиве текстовых документов (модель tf-idf) - реализуется модулями 102, 104, 105, 106, (ii) способом семантической обработки текстовых документов - реализуется модулями 111, 112, 114, 115, 116.After the user request 101 is transmitted to the user query generation module in natural language 101, the latter is processed in parallel in two ways: (i) by searching for relevant text documents using the keywords as search criteria and the frequency of occurrence of these words in the text document array (tf-idf model) - implemented by modules 102, 104, 105, 106, (ii) by the semantic processing of text documents - implemented by modules 111, 112, 114, 115, 116.

Функциональность способа (i) в большинстве случаев известна, но, кратко говоря, способ реализован следующим образом: модуль формирования запроса пользователя по ключевым словам 102 получает поисковый запрос от модуля формирования запроса пользователя на естественном языке 101 и производит его обработку (нормализацию поискового запроса, извлечение ключевых слов, и т.д.), а также расширяет запрос дополнительными словами (словосочетаниями); для выполнения поиска релевантных текстовых документов модуль формирования запроса пользователя по ключевым словам 102 передает нормализованный запрос модулю оценки текстовых документов относительно запроса пользователя по ключевым словам 104, который посредством доступа к внутреннему индексу (не иллюстрировано) базы данных 103, хранящей текстовые документы и метаинформацию о них, оценивает каждый текстовый документ относительно ключевых слов запроса с использованием одной из поисковых функций tf-idf, например "Okapi ВМ25"; модуль ранжирования по ценности текстовых документов с учетом ключевых слов 105 производит ранжирование по релевантности текстовых документов в порядке убывания, отсекая при этом текстовые документы, ценность которых ниже некоторого экспертно установлено порога, например 0.55; полученные результаты передаются в модуль представления результатов обработки текстовых документов 118 для отображения пользователю и в модуль оценки текстовых документов относительно запроса пользователя с учетом семантических признаков 114 для более точной оценки с использованием способа семантической обработки текстовых документов(й).The functionality of method (i) is known in most cases, but, in short, the method is implemented as follows: the user query generation module by keywords 102 receives a search query from the user query generation module in natural language 101 and processes it (normalizing the search query, extracting keywords, etc.), and also expands the query with additional words (phrases); in order to search for relevant text documents, the user query generating module for keywords 102 transmits a normalized query to the text document assessment module for the user's query for keywords 104, which, by accessing the internal index (not illustrated) of the database 103 that stores text documents and meta-information about them , evaluates each text document with respect to query keywords using one of the tf-idf search functions, for example, "Okapi BM25"; the ranking module for the value of text documents based on keywords 105 performs ranking according to the relevance of text documents in descending order, cutting off text documents whose value is below a certain expert threshold, for example 0.55; the results are transmitted to the module for presenting the results of processing text documents 118 for display to the user and to the module for evaluating text documents relative to the user's request, taking into account semantic features 114 for a more accurate assessment using the method of semantic processing of text documents (s).

Модуль формирования запроса пользователя по ключевым словам 102 расширяет запрос дополнительными словами (словосочетаниями), что в конечном итоге приводит к повышению полноты обработки текстовых документов. Расширение запроса выполняют следующим образом:The user query generation module for keywords 102 expands the query with additional words (phrases), which ultimately leads to an increase in the completeness of processing text documents. Request expansion is performed as follows:

1. За счет использования лингвистической онтологии [Лукашевич Н.В. Тезаурусы в задачах информационного поиска. - М.: Изд-во Московского университета, 2011. - 396 с.]: запрос может дополняться текстовыми входами понятия, входящими в запрос, например, текстовому входу (слово или словосочетание запроса) ЕВРОСОЮЗ соответствует понятие ЕВРОПЕЙСКИЙ СОЮЗ, на которое ссылаются также следующие текстовые входы: ЕВРОПЕЙСКИЙ СОЮЗ, ЕВРОПЕЙСКОЕ СООБЩЕСТВО, ЕВРОПЕЙСКОЕ ЭКОНОМИЧЕСКОЕ СООБЩЕСТВО, ЕДИНАЯ ЕВРОПА, ЕС, ЕЭС, ОБЪЕДИНЕННАЯ ЕВРОПА, которые могут использоваться для расширения запроса пользователя.1. Through the use of linguistic ontology [Lukashevich N.V. Thesauri in information retrieval problems. - M .: Moscow University Press, 2011. - 396 p.]: The request can be supplemented with text inputs of the concept included in the request, for example, a text input (word or phrase of the request) EUROPEAN UNION corresponds to the concept of EUROPEAN UNION, which is also referred to by the following text inputs: EUROPEAN UNION, EUROPEAN COMMUNITY, EUROPEAN ECONOMIC COMMUNITY, UNITED EUROPE, EU, EEC, UNITED EUROPE, which can be used to expand the user's request.

2. За счет использования распределенного представления слов, например, инструмента Word2Vec [Mikolov Т., Sutskever I., Chen K., Corrado G., and Dean J.. 2013b. Distributed representations of words and phrases and their compositionality. In NIPS, Pp. 3111-3119]: запрос может дополняться наиболее близкими к каждому слову запроса словами из распределенного векторного представления слов (минимальный коэффициент близости устанавливается пользователем эмпирическим путем, например, более 0.58), сформированного путем обучения алгоритма Skip-Gram на большом текстовом корпусе (более 10⁹ слов), например, для слов ВОЕННОСЛУЖАЩИЙ и ОФИЦЕР список расширяющих запрос слов представлен в таблице 2 (только слова, имеющие коэффициент близости более 0.58).2. Through the use of a distributed representation of words, for example, the Word2Vec tool [T. Mikolov, I. Sutskever, Chen K., Corrado G., and Dean J. .. 2013b. Distributed representations of words and phrases and their compositionality. In NIPS, Pp. 3111-3119]: a query can be supplemented with words closest to each query word from a distributed vector representation of words (the minimum proximity coefficient is set empirically by the user, for example, more than 0.58) formed by training the Skip-Gram algorithm on a large text body (more than 10 ⁹ words), for example, for the words MILITARY OFFICER and OFFICER, the list of words expanding the query is presented in Table 2 (only words with a proximity coefficient of more than 0.58).

Лингвистическая онтология и вектор распределенного представления слов, полученный, например, за счет использования инструмента Word2Vec, хранятся в модуле онтологии предметной области 120.The linguistic ontology and the distributed word representation vector obtained, for example, by using the Word2Vec tool, are stored in the ontology module of the subject area 120.

Способ семантической обработки текстовых документов(й) включает в себя три этапа:The method of semantic processing of text documents (s) includes three stages:

1. Формирование дискурсного графа текстового документа (запроса пользователя) (фиг. 2).1. The formation of the discourse graph of a text document (user request) (Fig. 2).

2. Оценка текстовых документов относительно запроса пользователя с использованием семантических признаков (фиг. 4).2. Evaluation of text documents regarding the user's request using semantic features (Fig. 4).

3. Ранжирование результатов оценки.3. Ranking of evaluation results.

Первый этап формирования дискурсного графа текстового документа (запроса пользователя) предназначен для выделения семантических признаков в текстовом документе (запросе пользователя) для последующей оценки по этим признакам релевантности запроса пользователя текстовому документу. Дискурсный граф для текстового документа и для запроса пользователя формируется по одной и той же схеме (фиг. 2), поэтому далее будет рассмотрена только схема формирования дискурсного графа текстового документа, которая включает выполнение следующих шагов:The first stage in the formation of the discourse graph of a text document (user request) is intended to highlight semantic features in a text document (user request) for subsequent evaluation of the relevance of a user request to a text document based on these signs. The discourse graph for a text document and for the user’s request is formed according to the same scheme (Fig. 2), therefore, only the scheme for generating a discourse graph of a text document will be considered below, which includes the following steps:

1. Ввести текстовый документ.1. Enter a text document.

Специалистам должно быть понятно, что операции этого и последующих этапов осуществляются с запоминанием промежуточных результатов, например, в оперативном запоминающем устройстве (ОЗУ).Professionals should be clear that the operations of this and subsequent steps are carried out with storing intermediate results, for example, in random access memory (RAM).

2. Морфологически и синтаксически проанализировать содержание текстового документа 202.2. Morphologically and syntactically analyze the contents of the text document 202.

Данный шаг выполняют с помощью синтаксических парсеров, способных производить разбор текстового документа на соответствующем языке, например, таких как: ABBYY Compreno, Stanford CoreNLP toolkit, AOT, Link Parser. Результат разбора представляет собой так называемую "синтаксическую структуру" - набор токенов с расставленными связями, где токен представляет собой нормализованное слово (лемму) с присущими ему морфологическими характеристиками.This step is performed with the help of parsers that can parse a text document in the corresponding language, for example, such as: ABBYY Compreno, Stanford CoreNLP toolkit, AOT, Link Parser. The result of the parsing is the so-called "syntactic structure" - a set of tokens with spaced links, where the token is a normalized word (lemma) with its inherent morphological characteristics.

3. Сегментировать текст на клаузы 204 (минимальные смыслосодержащие текстовые элементы) на основе синтаксического анализа.3. Segment the text into clauses 204 (minimal meaningful text elements) based on parsing.

Клауза также является элементарной дискурсивной единицей (ЭДЕ). Точность сегментации текста на клаузы в настоящее время составляет около 95%, она может выполняться, например, с помощь метода, описанного в [Vanessa Wei Feng and Graeme Hirst. 2012. Text-level discourse parsing with rich linguistic features. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pages 60-68. Association for Computational Linguistics].Clause is also an elementary discursive unit (EDU). The accuracy of text segmentation into clauses is currently about 95%; it can be performed, for example, using the method described in [Vanessa Wei Feng and Graeme Hirst. 2012. Text-level discourse parsing with rich linguistic features. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pages 60-68. Association for Computational Linguistics].

4. Применить эвристические правила для формирования функционально-ролевых структур (ФРС) клауз 206.4. Apply heuristic rules for the formation of functional role structures (FRS) of clauses 206.

Функционально-ролевой структурой называется упрощенное унифицированное представление некоторого возможного события в определенной предметной области [Новиков А.Ю. Концептуальные основы автоматизации дискурсного анализа текста на основе семантики // Научно-технический журнал: "Наукоемкие технологии", №9 - М.: Радиотехника, 2010. - С. 77-83]. Такая структура отображает всех основных игроков (объект, субъект), собственно процесс (действие), место действия - вершины ФРС, а также их атрибутивные свойства. Причем, при формировании вершин ФРС переходят от слов (именных структур) к понятиям, используя лингвистическую онтологию (например, [Лукашевич Н.В., Добров Б.В. Проектирование лингвистических онтологий для информационных систем в широких предметных областях // Онтология проектирования, Т. 5, №1. - 2015. - С. 47-69]), в которой каждому текстовому входу соответствует понятие с определенным идентификатором (id). Например, разбор запроса пользователя из представленного выше примера, позволяет выделить две клаузы и сформировать следующие ФРС (структура ФРС описана языком XML):Functional role structure is a simplified unified representation of some possible event in a specific subject area [Novikov A.Yu. Conceptual foundations of automation of discourse text analysis based on semantics // Scientific and technical journal: "High technologies", No. 9 - M .: Radio engineering, 2010. - P. 77-83]. Such a structure displays all the main players (object, subject), the actual process (action), the place of action - the tops of the Fed, as well as their attributive properties. Moreover, when forming Fed vertices, they switch from words (name structures) to concepts using a linguistic ontology (for example, [Lukashevich N.V., Dobrov B.V. Designing linguistic ontologies for information systems in wide subject areas // Design Ontology, T . 5, No. 1. - 2015. - S. 47-69]), in which each text input corresponds to a concept with a specific identifier (id). For example, parsing a user’s request from the above example allows you to select two clauses and form the following Fed (the structure of the Fed is described in XML):

Значения семантических отношений приведены в таблице 1.The values of semantic relations are shown in table 1.

Семантические роли для каждого слова формируют за счет эвристических правил, полученных путем статистического сопоставления синтаксических цепочек (полученных, например, с помощью Link Parser), семантическому отношению [Столяров М.Г., Новиков А. Ю. Концептуальные основы унифицированного подхода к автоматической обработке разноязычных текстов, использующего семантику предметной области // Научно-технический журнал: "Наукоемкие технологии", №12 - М.: Радиотехника, 2009. - С. 50-56]. Данное сопоставление выполняет эксперт. Например, последовательность связей "Dmc-Sp-PP" можно интерпретировать как отношение "Действие - Субъект действия".The semantic roles for each word are formed by heuristic rules obtained by statistical comparison of syntactic chains (obtained, for example, using Link Parser), the semantic relation [Stolyarov MG, Novikov A. Yu. Conceptual foundations of a unified approach to the automatic processing of multilingual texts using the semantics of the subject area // Scientific and technical journal: "High technology", No. 12 - M .: Radio engineering, 2009. - P. 50-56]. This comparison is performed by an expert. For example, the sequence of "Dmc-Sp-PP" bonds can be interpreted as the relation "Action - Subject of action".

ФРС, входящие в запрос пользователя, дополнительно разделяют на ФРС, представляющие факты известные пользователю, и на ФРС, составляющие суть запроса, и соответственно помечают. Данная процедура выполняется на основе простой эвристики, например, вопросительные клаузы составляют суть запроса, а утвердительные - факты известные пользователю. Также определяют тип вопросительных ФРС, составляющие суть запроса: ли-вопрос (уточняющие вопросы направлены на выявление истинности выраженных в них суждений; во всех этих вопросах присутствует частица "ли", включенная в словосочетания "верно ли", "действительно ли", "надо ли" и т.д.), к-вопрос (восполняющие вопросы предназначены для выявления новых свойств у исследуемого игрока ФРС, для получения новой информации; грамматический признак вопросительное слово типа "Кто?", "Что?", "Когда?", "Где?" и т.п.), n-вопрос (проблемные вопросы - ФРС, заключающие в себе вопрос о каком-нибудь неясном для пользователя обстоятельстве; такие предложения оформляются при помощи наречий "зачем?", "отчего?", "почему?" и т.д.).The Fed included in the user’s request is further divided into the Fed, representing the facts known to the user, and the Fed, which constitute the essence of the request, and are accordingly marked. This procedure is performed based on simple heuristics, for example, interrogative clauses are the essence of the request, and affirmative clauses are facts known to the user. They also determine the type of interrogative FRS that make up the essence of the request: whether a question (clarifying questions are aimed at revealing the truth of the opinions expressed in them; in all these questions there is a particle “whether” included in the phrases “true”, “really”, “it is necessary whether "etc.), k-question (complementary questions are intended to identify new properties of the studied Fed player, to obtain new information; grammatical sign is an interrogative word such as" Who? "," What? "," When? ", “Where?”, Etc.), n-question (problematic issues - Fed concluding there is a question about some circumstance that is not clear to the user; such proposals are drawn up using the adverbs “why?”, “why?”, “why?”, etc.).

5. Выявить местоименную кореференцию 208. Местоименная кореференция снижает полноту обработки текстовых документов вследствие употребления в тексте наряду с именами понятий их местоименных заменителей. Выявление местоименной кореференции осуществляют на синтаксическом уровне с использованием только средств морфологического и синтаксического анализа. Так, например, перефразируем известную фразу академика Л.В. Щербы: "[Глокая куздра] штеко будланула бокра, а после [она] закурдячипа бокренка." Приведенный пример показывает, что неочевидность наличия референтов для большинства слов данного предложения не мешает определять кореференцию между словами предложения только лишь за счет морфологии и синтаксиса без использования знаний об упомянутых референтах. Задачу выявления местоименной кореференции можно решить с использованием, например, Stanford CoreNLP toolkit.5. Identify the pronoun coreference 208. The pronominal corereference reduces the completeness of processing text documents due to the use in the text along with the names of the concepts of their pronouns. Identification of pronoun coreference is carried out at the syntactic level using only means of morphological and syntactic analysis. So, for example, we rephrase the famous phrase of academician L.V. Shcherby: "[Gloky cuzdra] shtokan budana bokra, and after [she] zakurdychipa bokrenka." The above example shows that the non-obviousness of the presence of referents for most words of a given sentence does not prevent the determination of the correlation between the words of a sentence only due to morphology and syntax without using knowledge of the mentioned referents. The task of identifying pronominal coreference can be solved using, for example, the Stanford CoreNLP toolkit.

6. Выявить кореференцию между семантически тождественными понятиями 210. Данный шаг необходим вследствие использования автором текстового документа вместо употребленных им ранее слов (антецедентов), их заменителей (анафоров, катафоров): синонимов, меронимов, холонимов, тождественных слов и др. Отличительной особенностью способа семантической обработки текстовых документов является то, что он способен выявлять референциальную связность между семантически тождественными понятиями за счет применения алгоритмов, использующих лингвистическую онтологию (ЛО), например ЛО [Лукашевич Н.В., Добров Б.В. Проектирование лингвистических онтологий для информационных систем в широких предметных областях // Онтология проектирования, Т. 5, №1. - 2015. - С. 47-69] и распределенное представление слов, например, инструмент Word2Vec [Mikolov Т., Sutskever I., Chen K., Corrado G., and Dean J.. 2013b. Distributed representations of words and phrases and their compositionality. In NIPS, Pp. 3111-3119].6. To identify the correlation between semantically identical concepts 210. This step is necessary because the author uses a text document instead of the previously used words (antecedents), their substitutes (anaphors, cataphores): synonyms, meronyms, holonyms, identical words, etc. A distinctive feature of the semantic method processing of text documents is that it is able to identify the referential connectivity between semantically identical concepts through the use of algorithms using linguistic ontology (LO), for example LO [Lukashevich N.V., Dobrov B.V. Design of linguistic ontologies for information systems in wide subject areas // Design Ontology, Vol. 5, No. 1. - 2015. - S. 47-69] and a distributed representation of words, for example, the Word2Vec tool [Mikolov T., Sutskever I., Chen K., Corrado G., and Dean J. .. 2013b. Distributed representations of words and phrases and their compositionality. In NIPS, Pp. 3111-3119].

Система отношений ЛО представляет собой набор отношений: выше-ниже (например, мотострелковые войска - танковые войска), часть-целое (например, танкист - танковые войска). С каждым отношением связан свой набор аксиом вывода. В качестве аксиом в ЛО используются свойства транзитивности и наследования.The LO system of relations is a set of relations: higher and lower (for example, motorized rifle troops - tank troops), part-whole (for example, a tanker - tank troops). Each relation has its own set of inference axioms. As axioms in LO, the properties of transitivity and inheritance are used.

Семантическая близость между двумя понятиями с, и с, оценивается на основе рассмотрения пути отношений, который существует между этими единицами в ЛО.The semantic affinity between the two concepts c, and c, is estimated on the basis of considering the relationship path that exists between these units in the LO.

Между понятиями в ЛО могут существовать пути разной конфигурации, ЛО является связной и всегда существует путь отношений от одного произвольного понятия ЛО к другому понятия ЛО. Однако в работе [Лукашевич Н.В., Добров Б.В. Проектирование лингвистических онтологий для информационных систем в широких предметных областях // Онтология проектирования, Т. 5, №1. - 2015. - С. 47-69] обосновано ограничение конфигурации путей между понятиями c₁ и c₂, которые рассматриваются при оценке семантической близости понятий, а именно, либо путь должен состоять из совокупности иерархических отношений, направленных в одну сторону (пути P_up и P_down), например, последовательность отношений от вида к роду, либо такой путь должен включать ровно один перегиб, т.е. изменение направления движения. При этом рассматриваются перегибы двух видов: перегиб-сверху, например, сначала несколько отношений от видовых понятий к родовым, затем несколько отношений от родовых понятий к видовым, перегиб-снизу (пути P_updown и P_downup). Ограничение просмотра путей между понятиями именно такими типами связано с тем, что любой иерархический путь P_up или P_down между понятиями может быть сведен к пути длиной в одно отношение с помощью правил транзитивности и наследования, а пути с перегибами P_updown и P_downup - к пути длиной в два отношения. Таким образом, доказывается потенциальная близость понятий, соединенных путями P_up, P_down, P_updown и P_downup.Between concepts in LO, there can exist paths of different configurations, LO is connected and there is always a path of relations from one arbitrary concept of LO to another concept of LO. However, in the work [Lukashevich N.V., Dobrov B.V. Design of linguistic ontologies for information systems in wide subject areas // Design Ontology, Vol. 5, No. 1. - 2015. - P. 47-69] substantiated the restriction of the configuration of paths between concepts c ₁ and c ₂ , which are considered when assessing the semantic proximity of concepts, namely, either the path should consist of a set of hierarchical relations directed in one direction (paths P _up and P _down ), for example, a sequence of relations from species to genus, or such a path should include exactly one inflection, i.e. change of direction. In this case, inflections of two types are considered: inflection from above, for example, first several relationships from generic concepts to generic, then several relationships from generic concepts to species, inflection from below (paths P _updown and P _downup ). Restriction viewing paths between concepts namely such types due to the fact that any hierarchical path P _up or P _down between concepts may be reduced to a path length in one relationship using transitivity rules and inherit and track with knuckle P _updown and P _downup - to two-way paths. Thus, we prove the potential closeness of concepts connected by the paths P _up , P _down , P _updown and P _downup .

В работе [Лукашевич Н.В., Добров Б.В. Проектирование лингвистических онтологий для информационных систем в широких предметных областях // Онтология проектирования, Т. 5, №1. - 2015. - С. 47-69] представленные типы путей используются в алгоритме разрешения многозначности понятий, т.е. для текстовых входов t_t и t_j таких, что t_i=t_j, определяются соответствующие понятие c_i и c_j. Данная задача является обратной к решаемой на настоящем шаге задаче выявления кореференции между именными группами или текстовыми входами. Необходимо для текстовых входов t_i и t_j, таких, что t_i≠t_j, верно определить понятие c_k, которое является общим референтом для t_i и t_j, т.е. t_i и t_j должны являться текстовыми входами одного понятия или текстовыми входами понятий, принадлежащих к одному семантическому полю. Принадлежность понятий к одному семантическому полю определяют следующим образом: одно из понятий с, берут за начальную точку семантического поля и последовательно с помощью проходов по ЛО путями P_up, P_down, P_updown и P_downup пытаются достигнуть конечной точки, в качестве которой выступает понятие c_j. Если конечная точка достигнута, то считают, что понятия c_i и c_j принадлежат к одному семантическому полю и, следовательно, текстовые входы t_i и t_j являются кореферентными. Так как, в ЛО всегда существует путь от одного произвольного понятия до другого, решение о кореференции двух понятий принимают в зависимости от метрики семантической близости m_ij, которая рассчитывается как длина пути между c_i и c_j. Минимальная метрика семантической близости m_ij, при которой принимают решение о кореференции понятия c_i и c_j, устанавливается пользователем эмпирическим путем (например, m_ij<5). Например, в предложении: "Под аплодисменты зрителей прошел практически весь трехактный "Дон Кихот". Балет, впервые поставленный еще в 1860 году в Большом театре великим Мариусом Петипа и переживший с тех пор немало редакций, по-прежнему не утратил своей молодости. В Лондон театральную постановку привезли в самой новой редакции…", кореферентами являются ДОН КИХОТ, БАЛЕТ и ТЕАТРАЛЬНАЯ ПОСТАНОВКА. Путь между понятиями БАЛЕТ и ТЕАТРАЛЬНАЯ ПОСТАНОВКА имеет один перегиб сверху (фиг. 3), а метрика семантической близости ту равна двум, следовательно, между данными понятиями существует референциальная связь.In the work [Lukashevich N.V., Dobrov B.V. Design of linguistic ontologies for information systems in wide subject areas // Design Ontology, Vol. 5, No. 1. - 2015. - P. 47-69] the presented types of paths are used in the algorithm for resolving the ambiguity of concepts, ie for text inputs t _t and t _j such that t _i = t _j , the corresponding concepts c _i and c _{j are} defined. This task is the inverse to the problem of identifying coreference between noun phrase groups or text inputs, which is being solved at the present step. It is necessary for the text inputs t _i and t _j , such that t _i ≠ t _j , to correctly define the concept c _k , which is a common referent for t _i and t _j , i.e. t _i and t _j must be text inputs of one concept or text inputs of concepts belonging to the same semantic field. The affiliation of concepts to one semantic field is defined as follows: one of the concepts c, is taken as the starting point of the semantic field and sequentially using passages through the LO paths P _up , P _down , P _updown and P _downup try to reach the end point, which is the concept c _j . If the endpoint is reached, then it is considered that the concepts c _i and c _j belong to the same semantic field and, therefore, the text inputs t _i and t _j are co-referenced. Since, in LO, there is always a path from one arbitrary concept to another, the decision on the co-reference of two concepts is made depending on the semantic proximity metric m _ij , which is calculated as the path length between c _i and c _j . The minimum semantic proximity metric m _ij , at which a decision is made on coreference of the concepts c _i and c _j , is established empirically by the user (for example, m _ij <5). For example, in the sentence: “Almost the entire three-act Don Quixote passed to the applause of the audience. The ballet, first staged back in 1860 by the great Marius Petipa at the Bolshoi Theater and having survived many editions since then, still has not lost its youth. To London the theatrical production was brought in the latest edition ... ", the co-directors are DON QUIXOTE, BALLET and THEATERING. The path between the concepts of BALLET and THEATER PERFORMANCE has one inflection from above (Fig. 3), and the metric of semantic proximity is equal to two, therefore, there is a referential relationship between these concepts.

Недостатком использования ЛО является трудоемкость ее наполнения. Кроме того лексическая мощность языка постоянно пополняется за счет научных открытий объектов реального мира, заимствования слов из иностранных языков, сленговых выражений и т.д. ЛО не способна гибко и оперативно реагировать на эти изменения. В связи с этим в случае отсутствия в ЛО одного или обоих понятий для выявления кореференции между семантически тождественными понятиями 210 применяют алгоритм, использующий инструмент распределенного представление слов, например, алгоритм Skip-Gram инструмента Word2Vec [Mikolov Т., Sutskever I., Chen K., Corrado G., and Dean J.. 2013b. Distributed representations of words and phrases and their compositionality. In NIPS, Pp. 3111-3119]. Модель распределенного представления слов основана на гипотезе, гласящей, что семантическая близость слов отражает их совместную встречаемость в сходных контекстах. Векторные представления слов формируют путем автоматического извлечения статистики их совместной встречаемости, полученной из текстовых корпусов большого объема (например, используя статьи из Wikipedia). Эта информация фиксируется в так называемых семантических или контекстных векторах, сходство которых отражает меру семантической близости слов.The disadvantage of using LO is the complexity of its filling. In addition, the lexical power of the language is constantly replenished due to scientific discoveries of objects of the real world, borrowing words from foreign languages, slang expressions, etc. LO is not able to flexibly and quickly respond to these changes. In this regard, in the absence of one or both concepts in the LO, an algorithm using the distributed word representation tool, for example, the Skip-Gram algorithm of the Word2Vec tool [Mikolov T., Sutskever I., Chen K., is used to identify coreference between semantically identical concepts 210 , Corrado G., and Dean J. .. 2013b. Distributed representations of words and phrases and their compositionality. In NIPS, Pp. 3111-3119]. The model of the distributed representation of words is based on the hypothesis that the semantic proximity of words reflects their common occurrence in similar contexts. Vector representations of words are formed by automatically extracting statistics on their joint occurrence obtained from large-volume text bodies (for example, using articles from Wikipedia). This information is recorded in the so-called semantic or context vectors, the similarity of which reflects the measure of the semantic proximity of words.

Для вычисления метрики семантической близости ту кандидатов в кореференты c_i и c_j используют алгоритм Skip-Gram инструмента Word2vec. Например, в тексте: "Ночью устроили построение, но в темноте было не разобрать, какой военнослужащий нас строил. Утром мы узнали, что это был офицер нашей будущей роты", - текстовые входы "военнослужащий" и "офицер" имеет один и тот же референт. Результаты работы Word2vec по этим словам отражены в таблице 2. Метрика семантической близости т в данном случае равнаTo calculate the metric of semantic proximity of candidates for coreferents c _i and c _j use the Skip-Gram algorithm of the Word2vec tool. For example, in the text: “They built a building at night, but it wasn’t possible to make out in the dark which serviceman was building us. In the morning we learned that it was an officer of our future company”, the text entries “serviceman” and “officer” have the same referent. The results of Word2vec for these words are shown in Table 2. The metric of semantic proximity t in this case is

m=(m_ij+m_ji)/2=(0.58981+0.66234)=0,626075.m = (m _ij + m _ji ) / 2 = (0.58981 + 0.66234) = 0.626075.

Метрика семантической близости двух кандидатов в кореференты m_ij, где m_min<m_ij≤1, является численным критерием принятия решения о наличие кореференции между этими кандидатами. Порог m_min принятия решения подбирают эмпирическим путем.The semantic proximity metric of two candidates for coreferents m _ij , where m _min <m _ij ≤1, is a numerical criterion for deciding whether there is a coreference between these candidates. The decision threshold m _{min is} selected empirically.

7. Кореферент с наибольшей степенью общности заменить кореферентом с более конкретным значением 212. Данный шаг необходим для более точного определения риторических отношений между клаузами и более крупными текстовыми элементами 214. По результатам данного шага текст, приведенный на шаге 6, будет представлен в следующем виде: "Ночью устроили построение, но в темноте было не разобрать, какой офицер нас строил. Утром мы узнали, что это был офицер нашей будущей роты". Слово "офицер" имеет более конкретное значение, как по таксономии понятий, так и с учетом дальнейшего уточнения: "офицер нашей будущей роты".7. To replace the coreferent with the highest degree of generality with a coreferent with a more specific value 212. This step is necessary to more accurately determine the rhetorical relations between clauses and larger text elements 214. Based on the results of this step, the text given in step 6 will be presented in the following form: "They built a building at night, but in the dark there was no way to figure out which officer was building us. In the morning we found out that it was an officer of our future company." The word "officer" has a more specific meaning, both in terms of the taxonomy of concepts and with further clarification: "officer of our future company."

8. Определить риторические отношения между клаузами и более крупными текстовыми элементами 214. Список риторическими отношений между текстовыми фрагментами был предложен в рамках теории риторических структур [Mann, W.C., Thompson, S.A. Rhetorical structure theory and text analysis. - Amsterdam. Benjamins, 1992. - 66 с]; используемые отношения представлены в таблице 3. Риторические отношения определяют, используя, например, метод предложенный в [Li Jiwei, Li Rumeng, Hovy Eduard. Recursive Deep Models for Discourse Parsing \ Proceedings of the 2014 Conference on Empirical Methods in EMNLP, 2014. pages 2061-2069], используя обучение рекурсивной нейронной сети на размеченном риторическими отношениями текстовом корпусе.8. Define rhetorical relations between clauses and larger textual elements 214. A list of rhetorical relations between textual fragments was proposed in the framework of the theory of rhetorical structures [Mann, W.C., Thompson, S.A. Rhetorical structure theory and text analysis. - Amsterdam. Benjamins, 1992. - 66 s]; the relations used are presented in Table 3. Rhetorical relations are determined using, for example, the method proposed in [Li Jiwei, Li Rumeng, Hovy Eduard. Recursive Deep Models for Discourse Parsing \ Proceedings of the 2014 Conference on Empirical Methods in EMNLP, 2014. pages 2061-2069], using training for a recursive neural network on a rhetorically marked text box.

9. Связать ФРС риторическими отношениями для получения дискурсного графа 216. ФРС связывают определенными на шаге 8 риторическими отношениями. В итоге получают дискурсный граф текстового документа, в вершинах которого находятся ФРС клауз и их связки, и ребра которого определяются риторическими отношениями между ними. Пример описания дискурсного графа представлен выше в виде структуры XML, риторические отношения между ФРС и их тип содержатся под тегом <RhetorRel>. Формально дискурсный граф текстового документа представим следующими выражениями:9. Associate the Fed with rhetorical relations to obtain a discourse graph 216. The Fed is connected with the rhetorical relations defined in step 8. The result is a discourse graph of a text document, at the vertices of which are the Fed Clause and their bundles, and the edges of which are determined by rhetorical relations between them. An example of the description of the discourse graph is presented above in the form of an XML structure, the rhetorical relations between the Fed and their type are contained under the <RhetorRel> tag. Formally, the discourse graph of a text document is represented by the following expressions:

- множество из n связок ФРС в текстовом документе;

- the set of n Fed bundles in a text document;

M^rhe - матрица риторических отношений между связками ФРС

текстового документа,M ^rhe - matrix of rhetorical relations between the ligaments of the Fed

text document

где

Where

rhe_r - r-ое риторическое отношение между связками ФРС;rhe _r is the rth rhetorical relationship between the ligaments of the Fed;

, где

where

10. Сформированный дискурсный граф текстового документа дописать к метаинформации текстового документа 218. На данном шаге сформированный дискурсный граф текстового документа переписывают из ОЗУ в базу данных 103, хранящуюся в ПЗУ (например, на НЖМД) и добавляют к соответствующей метаинформации обрабатываемого текстового документа.10. Add the generated discourse graph of the text document to the meta-information of the text document 218. At this step, the generated discourse graph of the text document is copied from RAM to the database 103 stored in the ROM (for example, on hard disk) and added to the corresponding meta-information of the processed text document.

Дискурсный граф текстовых документов формируется модулем формирования дискурсного графа текстовых документов 111 при добавлении в базу данных 103 нового текстового документа, который вводится через модуль пополнения текстовых документов 100, задачей которого является приведение текстового документа к единому унифицированному виду (очистка документа от тегов, приведение к единой кодировке и т.д.). Модуль индексации текстовых документов 113 производит индексирование текстовых документов, поступающих от модуля формирования дискурсного графа текстовых документов 113, по вершинам ФРС с учетом выявленной кореференции, с использованием модели tf.idf.The discourse graph of text documents is generated by the module for generating the discourse graph of text documents 111 when a new text document is added to the database 103, which is entered through the text document replenishment module 100, the task of which is to bring the text document to a single unified form (cleaning the document from tags, bringing to a single encoding, etc.). The text document indexing module 113 performs indexing of text documents received from the discourse graph generation module of text documents 113, according to the vertices of the Fed, taking into account the identified coreference, using the tf.idf model.

Дискурсный граф запроса пользователя поступает в модуль оценки текстовых документов относительно запроса пользователя с учетом семантических признаков 114, который реализует второй этап способа семантической обработки текстовых документов, а именно оценку текстовых документов относительно запроса пользователя с использованием семантических признаков, которая включает выполнение следующих шагов (фиг. 4):The discourse graph of the user request is received in the module for evaluating text documents with respect to the user request taking into account semantic features 114, which implements the second stage of the method of semantic processing of text documents, namely the evaluation of text documents with respect to the user request using semantic features, which includes the following steps (Fig. four):

1. Ввод дискурсных графов текстового документа и запроса пользователя 402.1. Entering discourse graphs of a text document and user request 402.

2. Для каждой ФРС запроса и текстового документа, в которых существует их пересечение по вершинам ФРС клауз (id вершин совпадают), выполняют расчет следующих оценок:2. For each Fed request and a text document in which there is their intersection along the vertices of the Fed clause (the id of the vertices match), the following estimates are calculated:

1) Оценка соответствия атрибутов 404:1) Attribute conformity assessment 404:

гдеWhere

a ^d, a ^q - наименования атрибутов вершин ФРС соответственно текстового документа и запроса; a ^d , a ^q are the names of the attributes of the Fed vertices of a text document and query, respectively;

m_reƒ - метрика референциальной близости наименований атрибутов отражающая вероятность, с которой a ^d и a ^q принадлежат общему референту a ^r;m _reƒ is the metric of the referential proximity of attribute names reflecting the probability with which a ^d and a ^q belong to the common referent a ^r ;

k_min - порог принятия решения о кореферентности двух текстовых элементов, принимается в зависимости от необходимого значения F-меры (совокупное значение полноты и точности), характеризующей эффективность отбора текстовых документов.k _min - decision threshold of coreference two text elements, taken depending on the desired value F-measure (the total value of the completeness and accuracy), characterized by an effective selection of text documents.

2) Оценка соответствия вершин 406.2) Verification of the correspondence of vertices 406.

Для ФРС, представляющих факты известные пользователю, функция соответствия вершин имеет вид:For the Fed, representing the facts known to the user, the vertex matching function has the form:

id - идентификационный номер понятия, соответствующего вершине ФРС;id - identification number of the concept corresponding to the top of the Fed;

m - количество атрибутов ФРС.m is the number of attributes of the Fed.

Для ФРС, составляющих суть запроса и представленных к-вопросом необходимо восполнить знания, заданные вопросительным словом, по которому определяют недостающего игрока ФРС, например, вопросительное слово "Где?" требует обязательного наличия в ФРС документа вершины типа "Место действия". Соответствие вопросительного слова вершинам ФРС определяют на основе эвристических правил, хранящихся в онтологии предметной области 120. В этом случае выражение (4) приобретает следующий вид: Gm^k For the Fed, which constitute the essence of the request and are presented by the k-question, it is necessary to fill in the knowledge given by the question word, by which the missing Fed player is determined, for example, the question word "Where?" requires the presence in the Fed of a document of a peak of type "Place of Action". Compliance with interrogative word tops the Fed is determined based on heuristic rules stored in the domain ontology 120. In this case, the expression (4) takes the following form: Gm ^k

- фактор-множество вершины, представленной вопросительным словом, соответствующее множеству Gm^k игроков.

is the factor set of the vertex represented by the question word, corresponding to the set Gm ^{k of} players.

Например, фактор-множество вершины типа "Место действия" представляет собой множество всех возможных мест для конкретной ФРС.For example, a factor set of a vertex of the type “Place of Action” represents the set of all possible places for a particular Fed.

3) Суммарная оценка соответствия ФРС 408:3) The total conformity assessment of the Fed 408:

k, q - ФРС текстового документа и запроса соответственно.k, q - Fed text document and query, respectively.

Суммарная оценка соответствия ФРС представляет собой нормированную функцию соответствия всех ее вершин.The total Fed conformity score is a normalized correspondence function of all its vertices.

3. Расчет оценки соответствия риторических отношений между ФРС 410: при оценке соответствия риторических отношений между ФРС в текстовом документе и запросе риторические отношения между ФРС, представляющими факты известные пользователю, оценивают по точному соответствию, а риторические отношения между ФРС, представляющими факты известные пользователю, и ФРС, составляющими суть запроса, могут оценивать по точному соответствию и с использованием эвристических правил, хранящихся в онтологии предметной области 120, которые дополняют риторические отношения между ФРС, представляющими факты известные пользователю, и ФРС, составляющими суть запроса. Эвристические правила определяются вопросительным словом ли-вопроса (n-вопроса), например: для ли-вопросов: "Верно-ли?" характерно риторическое отношение "Обеспечение возможности", "Может-ли?" характерно риторическое отношение "Оценка"; для n-вопросов: "Почему?" характерно риторическое отношение "(Не) волитивная причина". При этом новые знания, соответствующие сути запроса, должны являться сателлитом для ФРС, составляющих суть запроса.3. Calculation of the conformity assessment of rhetorical relations between the Fed 410: when assessing the conformity of the rhetorical relations between the Fed in a text document and the request, the rhetorical relations between the FRS representing facts known to the user are evaluated by exact correspondence, and the rhetorical relations between the FRS representing facts known to the user and The Fed constituting the essence of the request can be evaluated by exact correspondence and using heuristic rules stored in the ontology of subject area 120, which complement rhetoric RP G relationship between the Fed, representing facts known to the user, and the Federal Reserve, is the essence of the request. Heuristic rules are determined by the question-word of a question-question (n-question), for example: for a question: "Is it true?" the rhetorical attitude of “Providing Opportunity”, “Can it?” is characteristic characteristic rhetorical attitude "Assessment"; for n-questions: "Why?" the rhetorical attitude of "(He) is a voluntary cause." At the same time, new knowledge corresponding to the essence of the request should be a satellite for the Fed, which are the essence of the request.

Функция соответствия риторических отношений между ФРС имеет вид:The correspondence function of rhetorical relations between the Fed has the form:

где

- риторические отношения между ФРС;Where

- rhetorical relations between the Fed;

q^NKN - новые знания, которые необходимо получить;q ^NKN - new knowledge that needs to be obtained;

t₁, q_i - ФРС текстового документа и запроса соответственно;t ₁ , q _i - Fed text document and request, respectively;

тогда функция соответствия между всеми риторическими отношениями ФРС текстового документа и запроса будет иметь вид:

then the correspondence function between all the rhetorical relations of the Fed of the text document and the request will look like:

n - количество риторических отношений в дискурсном графе текстового документа.n is the number of rhetorical relations in the discourse column of a text document.

4. Оценка расстояния от пересечения графов до ядра дискурсного графа текстового документа 412 позволяет оценить насколько найденный фрагмент близок к ядру (реме) текстового документа:4. An estimate of the distance from the intersection of the graphs to the core of the discourse graph of a text document 412 allows us to evaluate how close the found fragment is to the core (bump) of a text document:

где

- пересечение дискурсных графов текстового документа и запроса;Where

- intersection of discourse graphs of a text document and query;

- ядро дискурсного графа текстового документа (корневая связка ФРС);

- The core of the discourse graph of a text document (root bundle of the Fed);

n - расстояние от

до

.n is the distance from

before

.

Чем ближе найденный фрагмент к ядру текстового документа, тем больше рема всего документа соответствует реме запроса.The closer the found fragment to the core of the text document, the more the mode of the entire document corresponds to the request mode.

5. Вывод значения функции соответствия текстового документа запросу пользователя 414 (ценность текстового документа) рассчитывается на основе мультипликативной свертки частных оценок:5. The output of the value of the function of correspondence of a text document to a user’s request 414 (value of a text document) is calculated on the basis of a multiplicative convolution of private estimates:

где n - количество ФРС в текстовом документе.where n is the number of Fed in a text document.

Оценке соответствия модулем оценки текстовых документов относительно запроса пользователя с учетом семантических признаков 114 подвергаются текстовые документы, еще" не просмотренные пользователем, хранящиеся в хранилище результатов поиска по ключевым словам 106, по результатам которой производится их ранжирование модулем ранжирования по ценности текстовых документов с учетом семантических признаков 115 (третий этап способа семантической обработки текстовых документов). В процессе ранжирования модуль ранжирования по ценности текстовых документов с учетом семантических признаков 115 может отсекать текстовые документы, ценность которых ниже порога, заданного пользователем. Результаты ранжирования сохраняются в хранилище результатов поиска с учетом семантических признаков 116 и предоставляются пользователю посредством модуля представления результатов обработки текстовых документов 118.Conformity assessment by a text document assessment module regarding a user's request based on semantic attributes 114 undergoes text documents that have not yet been "viewed by the user, stored in the search results repository by keywords 106, based on which they are ranked by the ranking module according to the value of text documents taking into account semantic features 115 (the third stage of the method of semantic processing of text documents). In the ranking process, the ranking module by the value of texts x instruments based on semantic features 115 can cut text documents whose value is below the user defined threshold. ranking results are stored in the repository search results based on semantic features 116 and provided to the user by presenting module text processing document 118 results.

Claims

1. The method of semantic processing of text documents, which consists in the fact that

morphologically and syntactically analyze the contents of a text document and user request;

segment text into clauses;

form the functional role structure of the clauses;

Referential connectivity between players of functional-role structures of clauses is revealed;

the co-referent with the greatest degree of generality is replaced by a co-referent with a more specific meaning;

determine the rhetorical relationship between clauses and larger text elements;

bind functional-role structures of clauses with rhetorical relations to obtain a discourse graph of a text document and user request;

the generated discourse graph of the text document is added to the meta-information of the text document;

index a text document by players of functional-role structures of clauses;

for each functional-role structure, the clauses of the user’s request evaluate the conformity of the functional-role structures of the clauses of the text document;

conformity assessment of rhetorical relations between functional role structures of clauses of a text document and request is made;

evaluate the distance from the intersection of the discourse graphs of the user’s request and the text document to the core of the discourse graph of the text document;

determine the value of the function of matching the text document to the user's request;

text documents are ranked by value based on semantic attributes;

cut off text documents whose value is below the threshold specified by the user;

provide the user with the processing of text documents.

2. The method according to claim 1, in which the functional role structures of the clauses included in the user’s request are further divided into functional role structures representing facts known to the user and functional role structures constituting the essence of the request, and are accordingly marked.

3. The method according to p. 2, which further determines the type of interrogative functional-role structures of clauses that make up the essence of the request.

4. The method according to p. 1, in which the referential connectivity between the players of functional role structures of the clauses is revealed in such a way that both anaphoric references and cataphoric links are taken into account, and the antecedent can be expressed as a pronoun or a semantically identical concept.

5. The method according to claim 4, in which the referential connectivity between the players of functional-role structures of clauses is revealed using at least partially linguistic ontology.

6. The method according to p. 4, in which the referential connectivity between the players of functional-role structures of clauses is revealed using at least a partially distributed representation of words.

7. The method according to p. 1, in which when assessing the correspondence of each functional role structure of the clause of the user's request to the functional role structures of the clauses of the text document, the correspondence of the attributes and vertices of the functional role structures of the clauses is additionally evaluated.

8. The method according to claim 7, in which the conformity assessment of the attributes of the functional role structures of the clauses is performed both by the exact coincidence of the attribute names and by taking into account the metric of the referenced proximity of the attribute names.

9. The method according to claim 7, in which the conformity assessment of the vertices of the functional role structures of the clauses is performed both by the exact coincidence of the identification numbers of the concepts defining the vertex, and taking into account the factor set of the vertex represented by the question-word k-question, which is determined on the basis of heuristic rules stored in the domain ontology.

10. The method according to p. 1, in which when assessing the conformity of rhetorical relations between functional role structures of clauses in a text document and a request, the rhetorical relations between functional role structures of clauses representing facts known to the user are evaluated by exact correspondence, and rhetorical relations between functional role structures of the clauses representing facts known to the user and functional role structures of the clauses that make up the essence of the request can be estimated as by exact correspondence consequence, and with the use of heuristic rules stored in the domain ontology, which complement the rhetorical relations between the functional-role structure of clauses representing facts known to the user, and functional-role structure of clauses that make up the essence of the request.

11. The system of semantic processing of text documents that implements the method according to p. 1 and including:

a module for generating a user request in a natural language;

a module for generating a user request by keywords;

text document indexing module;

a database that stores text documents and meta-information about them;

a module for evaluating text documents regarding a user's request for keywords;

keyword ranking module for text documents based on keywords;

a repository of search results for keywords;

a module for presenting the results of processing text documents.

module for generating a discourse graph of text documents;

a module for generating a discourse graph of a user request;

domain ontology module;

a module for evaluating text documents regarding a user's request taking into account semantic features;

a module for ranking by the value of text documents taking into account semantic features;

storage of search results based on semantic features,

moreover, the module for generating a user request in natural language is designed to transmit a user request to the module for generating a user request by keywords and to the module for generating a discourse graph of a user request, the module for ranking the value of text documents based on keywords ranks text documents according to relevance and transfers the results by storage of search results by keywords in the module for evaluating text documents regarding a user’s request with Chet semantic features.

12. The system of claim 11, wherein the user query generation module for keywords extends the query with additional words (phrases) using at least partially a linguistic ontology.

13. The system of claim 11, wherein the user query generation module for keywords extends the query with additional words (phrases) using at least a partially distributed representation of the words.

14. The system of claim 11, wherein the module for presenting the results of processing text documents presents to the user the results of processing text documents by keywords until the processing of text documents is completed taking into account semantic features.

15. The system according to claim 11, in which the text documents that have not yet been viewed by the user and stored in the repository of search results for keywords are subject to conformity assessment by the text document assessment module with respect to the user's request.

16. The system of claim 11, wherein in the ranking process, the ranking module for the value of text documents, taking into account semantic features, can cut off text documents whose value is below a threshold specified by the user.