RU2618375C2

RU2618375C2 - Expanding of information search possibility

Info

Publication number: RU2618375C2
Application number: RU2015126477A
Authority: RU
Inventors: Татьяна Владимировна Даниэлян; Евгений Михайлович Инденбом
Original assignee: Общество с ограниченной ответственностью "Аби ИнфоПоиск"
Priority date: 2015-07-02
Filing date: 2015-07-02
Publication date: 2017-05-03
Also published as: RU2015126477A

Abstract

FIELD: information technology.

SUBSTANCE: semantic-syntactic analysis of the search query is carried out in the method of organizing the search in the corpus of electronic texts, including the construction of a ranked list of possible lexical values for query words, where each of the lexical values is associated with the corresponding semantic class. A list of synonyms is composed for lexical values from the ranked list. The synonyms are ranked for lexical values and the query options are formed with the ranked synonyms. The conformity assessment of the query options to the original search query is calculated. Text fragments are searched in the corpus of electronic texts that satisfy the query for the query options. The search includes a semantic-syntactic analysis of the found text fragments. The conformity assessment of the lexical values of words in the found fragments to the lexical values of words of the original query option is calculated. The found text fragments are ranked according to the calculated conformity assessment.

EFFECT: increasing the information search effectiveness by obtaining results that have an increased relevance, with a high speed.

20 cl, 14 dwg

Description

ОБЛАСТЬ ИЗОБРЕТЕНИЯFIELD OF THE INVENTION

[0001] Настоящее изобретение относится к технологиям информационного поиска, в частности, реализация данного изобретения имеет отношение к поиску электронного контента, например, в интернете и других электронных ресурсах, таких как текстовые корпуса, словари, глоссарии, энциклопедии и способам представления результатов поиска.[0001] The present invention relates to information retrieval technologies, in particular, the implementation of the present invention relates to the search for electronic content, for example, on the Internet and other electronic resources such as text boxes, dictionaries, glossaries, encyclopedias and methods for presenting search results.

УРОВЕНЬ ТЕХНИКИBACKGROUND

[0002] Широко известны поисковые технологии, которые позволяют осуществлять поиск, основываясь на ключевых словах, вводимых пользователем в составе поискового запроса.[0002] Search technologies are widely known that allow searches based on keywords entered by a user as part of a search query.

[0003] Однако, из-за омонимии и омографии, имеющейся в естественных языках, результат поиска, основанный на поиске по ключевым словам, может включать значительное количество нерелевантной и малорелевантной информации. Например, если пользователь ищет тексты, содержащие слово "page" в смысле "паж" (придворная должность), он получит множество нерелевантной информации, где "page" относится к интернет-страницам, страницам газет, журналов, страницам устройств памяти и т.д. Это происходит потому, что эти значения гораздо более частотны, чем "page" в лексическом значении "паж". Аналогично в русском языке по ключевому слову "стекло" можно получить все тексты, содержащие глагол "течь" во всевозможных словоформах.[0003] However, due to the homonymy and homography available in natural languages, a search result based on a keyword search may include a significant amount of irrelevant and low relevant information. For example, if a user searches for texts containing the word "page" in the sense of "page" (court position), he will receive a lot of irrelevant information, where "page" refers to web pages, pages of newspapers, magazines, pages of memory devices, etc. . This is because these values are much more frequent than the "page" in the lexical meaning "page". Similarly, in the Russian language for the keyword "glass" you can get all the texts containing the verb "flow" in all kinds of word forms.

[0004] Существующие системы позволяют использовать простые языки запросов для поиска документов, которые содержат, или не содержат слова или слово, указанные пользователем. Однако пользователь не имеет возможности указать, должны ли эти слова находиться в одном предложении или нет. Также, пользователь не может формулировать свой запрос сразу для некоторого множества слов, принадлежащих некоторому классу или обладающих некоторыми свойствами или характеристиками. Как правило, эти системы не позволяют формулировать запрос в виде обычного вопроса на естественном языке.[0004] Existing systems allow the use of simple query languages to search for documents that contain or do not contain words or a word specified by the user. However, the user is not able to specify whether these words should be in the same sentence or not. Also, the user cannot formulate his query immediately for a certain set of words belonging to a certain class or possessing some properties or characteristics. As a rule, these systems do not allow to formulate a request in the form of a common question in a natural language.

[0005] Для уточнения искомого значения часто приходится добавлять в запрос дополнительные слова. Кроме того, иногда сам пользователь не может определить, какое из значений слова его на самом деле интересует. Например, если он ищет варианты словоупотребления неизвестного ему слова на иностранном языке. Большой и несистематизированный объем выдачи позволяет увидеть все варианты значений искомого слова или словосочетания.[0005] To clarify the desired value, it is often necessary to add additional words to the query. In addition, sometimes the user himself cannot determine which of the meanings of the word he is actually interested in. For example, if he is looking for variants of the use of an unknown word in a foreign language. A large and unsystematized volume of output allows you to see all the options for the meanings of the searched word or phrase.

[0006] Другой проблемой является то, что одна и та же информация может быть представлена как в разных документах, так и в одном и том же документе при помощи разных слов и выражений, при этом могут использоваться синонимы и перефразировки (paraphases).[0006] Another problem is that the same information can be presented both in different documents and in the same document using different words and phrases, while synonyms and paraphases can be used.

[0007] Данное изобретение является развитием решений, изложенных ранее в Патентных заявках США №13/173,649 и 13/173,369, поданных 30 июня 2011, и №12/983,220, поданной 31 декабря 2010, а также патентной заявки США №14/142,701, поданной 27 декабря 2013. Данное изобретение также частично использует технологию анализа, запатентованную в США (Патент №8,078,450).[0007] This invention is a development of the solutions set forth previously in US Patent Applications No. 13 / 173,649 and 13 / 173,369, filed June 30, 2011, and No. 12 / 983,220, filed December 31, 2010, as well as US patent application No. 14 / 142,701, filed December 27, 2013. This invention also partially utilizes analysis technology patented in the United States (Patent No. 8,078,450).

РАСКРЫТИЕ ИЗОБРЕТЕНИЯSUMMARY OF THE INVENTION

[0008] Настоящее изобретение представляет собой способ и систему организации информационного поиска в корпусах электронных текстов для компьютерной системы и показа результатов поиска в интерфейсе пользователя, метод, заключающийся в том, что, по меньшей мере, один раз производят следующую последовательность действий: получение запроса на поиск, включающего одну или несколько групп слов; снятие омонимии, т.е. для каждого слова запроса однозначно выбирается одно лексическое значение либо формируется список лексических значений с соответствующими весами. Лексическое значение является реализацией в конкретном языке некоторых семантических значений. Для того, чтобы получить наиболее полную информацию по заданному запросу, каждое лексическое значение запроса может быть "расширено" добавлением списка его синонимов. Однако синонимы могут быть не вполне эквивалентными, поэтому каждый синоним получает некоторую оценку (вес), и список ранжируется, так что каждый список упорядочивается по убыванию оценки. Выполняется поиск. Поиск производится таким образом, что запрашиваются не только слова или лексические значения, присутствующие в запрос, но и синонимы из полученного списка. В соответствии с оценкой (весом) синонима, найденный результат также получает некоторую оценку, которая непосредственно зависит от оценки (веса) синонима. Результаты поиска ранжируются в соответствии с полученными оценками.[0008] The present invention is a method and system for organizing information retrieval in electronic text cases for a computer system and displaying search results in a user interface, the method consisting in the following sequence of actions being performed at least once: receiving a request for a search that includes one or more groups of words; removal of homonymy, i.e. for each query word, one lexical meaning is unambiguously selected or a list of lexical values is formed with the corresponding weights. The lexical meaning is the implementation in a particular language of some semantic meanings. In order to get the most complete information on a given query, each lexical meaning of the query can be "expanded" by adding a list of its synonyms. However, synonyms may not be completely equivalent, so each synonym gets some rating (weight), and the list is ranked so that each list is sorted in descending order of rating. Search in progress. The search is performed in such a way that not only the words or lexical meanings present in the query are requested, but also synonyms from the list received. In accordance with the assessment (weight) of the synonym, the result found also receives some assessment, which directly depends on the assessment (weight) of the synonym. Search results are ranked according to the estimates.

[0009] Дополнительно, данный способ может быть применен не только к отдельным словам, но и к группам слов. Такие эквивалентные или частично эквивалентные речевые обороты будем называть перефразировками. Указанный способ также включает поиск фрагментов в корпусах электронных текстов, удовлетворяющих запросу, и показ пользователю результатов поиска. В некоторых реализациях список лексических значений для групп слов, образующих запрос, может формироваться на основе запроса к семантической иерархии и фильтроваться на основе семантико-синтаксического анализа запроса, чтобы исключить те лексические значения, сочетания которых невозможны.[0009] Additionally, this method can be applied not only to individual words, but also to groups of words. Such equivalent or partially equivalent speech turns will be called paraphrases. The specified method also includes searching for fragments in the corpus of electronic texts that satisfy the request, and showing the user search results. In some implementations, the list of lexical meanings for groups of words forming a query can be formed on the basis of a query on a semantic hierarchy and filtered on the basis of semantic-syntactic analysis of a query to exclude those lexical meanings, combinations of which are impossible.

[0010] В одной реализации выполняется полнотекстовый поиск, т.е. поиск на произвольных проиндексированных корпусах с последующим анализом найденных фрагментов фильтрации поисковой выдачи по возможным лексическим значениям поискового запроса.[0010] In one implementation, a full-text search is performed, i.e. search on arbitrary indexed cases with subsequent analysis of the found fragments of the search results filtering by possible lexical values of the search query.

[0011] В других реализациях может проводиться семантический поиск на предварительно обработанных по методу глубинного семантико-синтаксического анализа и проиндексированных корпусах текстов для поиска конкретных лексических значений.[0011] In other implementations, a semantic search may be performed on pre-processed deep semantic syntax analysis and indexed text corps to search for specific lexical meanings.

[0012] Осуществление настоящего изобретения позволяет пользователю искать и находить наиболее полную и релевантную информацию и получать результаты поиска в ранжированном по релевантности виде. В случае, если запрос формулируется в виде вопроса на естественном языке, анализатор используется для анализа запроса, для распознавания его синтаксической структуры и построения семантической структуры и, таким образом, и "понимания" системой смысла запроса. Таким образом, пользователь может получить только релевантные результаты запроса. Техническим результатом, на достижение которого направлено заявленное изобретение, является повышение эффективности информационного поиска за счет получения результатов поиска, имеющих повышенную степень релевантности, с высокой скоростью.[0012] The implementation of the present invention allows the user to search and find the most complete and relevant information and obtain search results in a ranked by relevance form. If the request is formulated as a question in natural language, the analyzer is used to analyze the request, to recognize its syntactic structure and build a semantic structure and, thus, the system “understands” the meaning of the request. Thus, the user can get only relevant query results. The technical result to which the claimed invention is directed is to increase the efficiency of information retrieval by obtaining search results having a high degree of relevance, with high speed.

КРАТКОЕ ОПИСАНИЕ ЧЕРТЕЖЕЙBRIEF DESCRIPTION OF THE DRAWINGS

[0013] Фиг. 1 иллюстрирует общую схему одной из реализаций данного изобретения.[0013] FIG. 1 illustrates a general outline of one embodiment of the present invention.

[0014] Фиг. 1А иллюстрирует общую схему метода глубинного анализа корпуса текстов и построения индексов согласно одной из реализаций данного изобретения.[0014] FIG. 1A illustrates a general outline of a method for deep analysis of a corpus of texts and construction of indices according to one implementation of the present invention.

[0015] Фиг. 2 иллюстрирует последовательность структур, строящихся в процессе анализа предложения согласно одной или нескольким реализациям изобретения.[0015] FIG. 2 illustrates a sequence of structures constructed in a bid analysis process according to one or more implementations of the invention.

[0016] Фиг. 3 иллюстрирует пример синтаксического дерева, полученного в результате точного синтаксического анализа предложения.[0016] FIG. 3 illustrates an example of a syntax tree obtained by precise parsing of a sentence.

[0017] Фиг. 4 иллюстрирует схему семантической структуры, полученной в результате анализа предложения.[0017] FIG. 4 illustrates a diagram of the semantic structure obtained by analyzing sentences.

[0018] Фиг. 5 иллюстрирует фрагмент семантической иерархии, согласно одной или нескольким реализациям данного изобретения.[0018] FIG. 5 illustrates a fragment of a semantic hierarchy according to one or more implementations of the present invention.

[0019] Фиг. 6 представляет собой схему, иллюстрирующую языковые описания 610, согласно одной из возможных реализаций изобретения.[0019] FIG. 6 is a diagram illustrating language descriptions 610 according to one possible implementation of the invention.

[0020] Фиг. 7 представляет собой схему, иллюстрирующую морфологические описания, согласно одной из возможных реализаций изобретения.[0020] FIG. 7 is a diagram illustrating morphological descriptions according to one possible implementation of the invention.

[0021] Фиг. 8 иллюстрирует синтаксические описания, согласно одной из возможных реализаций изобретения.[0021] FIG. 8 illustrates syntactic descriptions in accordance with one possible implementation of the invention.

[0022] Фиг. 9 иллюстрирует семантические описания, согласно одной из возможных реализаций изобретения.[0022] FIG. 9 illustrates semantic descriptions, according to one possible implementation of the invention.

[0023] Фиг. 10 является схемой, иллюстрирующей лексические описания, согласно одной или нескольким реализациям данного изобретения.[0023] FIG. 10 is a diagram illustrating lexical descriptions according to one or more implementations of the present invention.

[0024] На Фиг. 11А-В представлены примеры запросов и полученных результатов поиска.[0024] In FIG. 11A-B show examples of queries and received search results.

[0025] Фиг. 12 иллюстрирует пример схемы аппаратного обеспечения.[0025] FIG. 12 illustrates an example hardware circuitry.

ОПИСАНИЕ ПРЕДПОЧТИТЕЛЬНЫХ ВАРИАНТОВ ОСУЩЕСТВЛЕНИЯDESCRIPTION OF PREFERRED EMBODIMENTS

[0026] Реализация данного изобретения раскрывает способ расширенного информационного поиска в текстах на естественном языке и способы показа результатов поиска.[0026] An embodiment of the present invention discloses a method for advanced information retrieval in natural language texts and methods for displaying search results.

[0027] Среди методов информационного поиска различают полнотекстовый поиск и семантический поиск. Полнотекстовый поиск может выполняться на произвольных корпусах, имеющих обычный полнотекстовый (прямой или обратный) индекс. Для такого поиска не требуется длительная предобработка, индекс достаточно компактен, ресурсы для него практически не ограничены. По такой схеме работают известные поисковые системы - Google, Yahoo, Yandex и др. Недостатком является большой объем получаемой в некоторых случаях нерелевантной информации. Семантический поиск предполагает предварительную обработку корпуса текстов, на которых производится поиск, как правило, с разметкой, например, по частям речи, сущностям, классам и т.п. Это влечет сложность построения индекса, значительное увеличение его объема и, как следствие, снижение скорости поиска. Однако, достоинством является высокая точность поиска, релевантность получаемых результатов.[0027] Among the methods of information retrieval, full-text search and semantic search are distinguished. Full-text search can be performed on arbitrary cases that have the usual full-text (forward or reverse) index. Such a search does not require lengthy pre-processing, the index is quite compact, and resources for it are practically unlimited. Famous search engines such as Google, Yahoo, Yandex, etc. work according to this scheme. The disadvantage is the large amount of irrelevant information received in some cases. Semantic search involves pre-processing the corpus of texts on which the search is performed, usually with markup, for example, in parts of speech, entities, classes, etc. This entails the complexity of constructing the index, a significant increase in its volume and, as a consequence, a decrease in the search speed. However, the advantage is the high accuracy of the search, the relevance of the results.

[0028] В Патенте США U.S. Patent 8,078,450 описан метод, включающий глубинный синтаксический и семантический анализ текстов на естественном языке, основанный на исчерпывающих лингвистических моделях. Метод использует широкий спектр лингвистических описаний, как универсальных семантических механизмов, так относящихся к конкретному языку, что позволяет отразить все реальные сложности языка без упрощения и искусственных ограничений, не опасаясь при этом неуправляемого роста сложности. Этот метод используется как для снятия омонимии в запросе на поиск, так и для построения семантического индекса, а созданные для него лингвистические описания применяются как для получения множества альтернативных способов формулирования запроса, так и для оценки степени релевантности найденных результатов.[0028] In US Patent U.S. Patent 8,078,450 describes a method that includes in-depth syntactic and semantic analysis of natural language texts, based on comprehensive linguistic models. The method uses a wide range of linguistic descriptions, as universal semantic mechanisms that relate to a specific language, which allows you to reflect all the real complexities of the language without simplification and artificial restrictions, without fear of uncontrolled growth of complexity. This method is used both to remove homonymy in a search query, and to build a semantic index, and the linguistic descriptions created for it are used both to obtain many alternative ways of formulating a query, and to assess the degree of relevance of the results found.

[0029] С некоторыми модификациями метод применим и для полнотекстового, и для семантического поиска, поэтому опишем общую схему, указывая, что необходимо сделать дополнительно для того или другого типа поиска.[0029] With some modifications, the method is applicable to both full-text and semantic search, so we will describe the general scheme, indicating what needs to be done additionally for this or that type of search.

[0030] Фиг. 1 иллюстрирует общую схему метода 100 организации информационного поиска в корпусах текстов согласно одной из реализаций данного изобретения. Тексты, на которых будет производиться поиск, предварительно должны быть проиндексированы (не показано на Фиг. 1), это означает, что для каждого корпуса или текста строится один или более индексов. Для полнотекстового поиска это может быть обычный - прямой или обратный индекс. Для семантического поиска корпус подвергается глубинному семантико-синтаксическому анализу по методу U.S. Patent 8,078,450 и индексируются параметры текста, существенные для семантического поиска.[0030] FIG. 1 illustrates a general outline of a method 100 for organizing information retrieval in text bodies according to one implementation of the present invention. The texts on which the search will be performed must first be indexed (not shown in Fig. 1), this means that for each case or text one or more indexes are constructed. For full-text search, this can be a regular one - forward or reverse index. For semantic search, the corpus is subjected to in-depth semantic-syntactic analysis according to the U.S. method. Patent 8,078,450 and indexed text parameters relevant to semantic search.

[0031] Последовательность операций, выполняемых на предварительном этапе, включающем семантико-синтаксический анализ, для последующего использования текстов в семантическом поиске проиллюстрирована на Фиг. 1А. Глубинный семантико-синтаксический анализ 190 включает лексико-морфологический, синтаксический и семантический анализ каждого предложения корпуса текстов, в результате которых строятся независимые от языка семантические структуры (language-independent semantic structures), в которых каждому слову текста сопоставлен соответствующий семантический класс. Это одновременно означает снятие омонимии (disambiguation), т.е. теперь для каждого слова в каждом предложении фиксируется, в каком именно лексическом значении в данном контексте используется данное слово языка.[0031] The sequence of operations performed in a preliminary step, including semantic-syntactic analysis, for subsequent use of texts in semantic search is illustrated in FIG. 1A. 190 semantic-syntactic analysis includes lexical-morphological, syntactic and semantic analysis of each sentence of the corpus of texts, as a result of which language-independent semantic structures are constructed, in which each word of the text is associated with a corresponding semantic class. This simultaneously means the removal of homonymy (disambiguation), i.e. Now, for each word in each sentence, it is fixed in which lexical meaning in this context the given word of the language is used.

[0032] Глубинный семантико-синтаксический анализ 106 производится над каждым предложением каждого корпуса текстов 105 с использованием лингвистических описаний, как исходного языка, так и универсальных семантических описаний, что позволяет анализировать не только поверхностную синтаксическую структуру, но и глубинную, семантическую, выражающую смысл высказывания, содержащегося в каждом предложении, а также связи между предложениями или фрагментами текста. Лингвистические описания могут включать лексические описания 101, морфологические описания 102, синтаксические описания 103 и семантические описания 104. Анализ 106 включает синтаксический анализ, реализованный в виде двухэтапного алгоритма (грубого синтаксического анализа и точного синтаксического анализа), использующий лингвистические модели и информацию различных уровней для вычисления вероятностей и генерации наиболее вероятной («лучшей») синтаксической структуры. Фиг. 2 иллюстрирует последовательность структур, строящихся в процессе анализа предложения согласно одной или нескольким реализациям изобретения. Затем строится независимая от языка семантическая структура (language-independent semantic structure) 252, которая представляет смысл исходного предложения.[0032] An in-depth semantic-syntactic analysis 106 is performed on each sentence of each corpus of texts 105 using linguistic descriptions of both the source language and universal semantic descriptions, which allows us to analyze not only a superficial syntactic structure, but also a deep, semantic, expressing the meaning of the sentence contained in each sentence, as well as the relationship between sentences or text fragments. Linguistic descriptions may include lexical descriptions 101, morphological descriptions 102, syntactic descriptions 103, and semantic descriptions 104. Analysis 106 includes parsing implemented as a two-stage algorithm (crude parsing and accurate parsing) using linguistic models and information of various levels to calculate probabilities and generating the most probable (“best”) syntactic structure. FIG. 2 illustrates a sequence of structures constructed in a bid analysis process according to one or more implementations of the invention. Then, a language-independent semantic structure 252 is constructed that represents the meaning of the original sentence.

[0033] Затем исходное предложение, синтаксическая структура исходного предложения и независимая от языка семантическая структура и другие параметры, извлеченные в процессе анализа, индексируются 108. Результатом является семантический индекс, представляющий собой набор коллекций индексов 109. Индекс в самом простом варианте реализации может быть представлен в виде таблицы, где каждому значению текстовой характеристики (например, слову, выражению или фразе, отношению между элементами предложения, морфологическое, лексическое, синтаксическое или семантическое свойство, а также и синтаксические и семантические структуры) в документе сопоставлен список адресов их вхождений в этот документ. Морфологические, синтаксические, лексические и семантические характеристики, а также структуры и фрагменты структур могут индексироваться тем же способом, как индексируются слова в документе.[0033] Then, the original sentence, the syntactic structure of the original sentence, and the language-independent semantic structure and other parameters extracted during the analysis are indexed 108. The result is a semantic index, which is a collection of index collections 109. An index in the simplest embodiment can be represented in the form of a table, where each value of the textual characteristic (for example, a word, expression or phrase, the relationship between the elements of the sentence, morphological, lexical, syntactic e or semantic property as well as syntactic and semantic structure) of the document is mapped to the address list of entries in the document. Morphological, syntactic, lexical and semantic characteristics, as well as structures and fragments of structures, can be indexed in the same way as words in a document are indexed.

[0034] В одной из реализаций данного изобретения индексы могут включать все или, по крайней мере, одно значение морфологических, синтаксических, лексических и семантических характеристик (параметров). Эти значения или параметры генерируются во время двухэтапного семантического анализа, далее описанного более детально. Индексы могут использоваться во многих задачах обработки естественного языка, в частности, для организации семантического поиска. Согласно одной из реализаций данного изобретения, морфологические, синтаксические, лексические и семантические описания структурированы и сохраняются в базе данных. Это множество описаний может включать, по крайней мере, морфологическую модель языка, модели синтаксических конструкций языка, лексико-семантические модели. Согласно одной из реализаций данного изобретения, для анализа сложных языковых структур, распознавания смысла предложения и корректной передачи заключенной в нем информации используется интегральная модель для описания синтаксиса и семантики.[0034] In one implementation of the present invention, indices may include all or at least one value of morphological, syntactic, lexical and semantic characteristics (parameters). These values or parameters are generated during the two-stage semantic analysis, which is described in more detail below. Indexes can be used in many tasks of natural language processing, in particular, for organizing semantic search. According to one implementation of the present invention, morphological, syntactic, lexical and semantic descriptions are structured and stored in a database. This set of descriptions may include at least a morphological model of the language, models of syntactic constructions of the language, lexical-semantic models. According to one implementation of the present invention, an integral model is used to describe syntax and semantics for analyzing complex language structures, recognizing the meaning of a sentence and correctly conveying the information contained in it.

[0035] Фиг. 2 иллюстрирует детальную схему метода анализа предложения согласно одной или нескольким реализациям изобретения. Обратимся к Фиг. 1А и Фиг. 2, где лексико-морфологическая структура 222 определяется на этапе анализа 106 исходного предложения 212. Затем производится синтаксический анализ, реализованный в виде двухэтапного алгоритма (грубого синтаксического анализа и точного синтаксического анализа), использующий лингвистические модели и информацию различных уровней для вычисления вероятностей и генерации наиболее вероятной («лучшей») синтаксической структуры.[0035] FIG. 2 illustrates a detailed diagram of a proposal analysis method according to one or more implementations of the invention. Turning to FIG. 1A and FIG. 2, where the lexical-morphological structure 222 is determined at the analysis stage 106 of the initial sentence 212. Then, parsing is performed, implemented as a two-stage algorithm (rough parsing and accurate parsing), using linguistic models and information of different levels to calculate probabilities and generate the most probable ("best") syntactic structure.

[0036] Грубый синтаксический анализ применяется к исходному предложению и включает, в частности, генерацию всех потенциально возможных лексических значений слов, образующих предложение или словосочетание, всех потенциально возможных отношений между ними, всех потенциально возможных составляющих. Применяются все возможные поверхностные синтаксические модели для каждого элемента лексико-морфологической структуры, затем строятся и обобщаются все возможные составляющие так, чтобы были представлены все возможные варианты синтаксического разбора предложения. В результате формируется граф обобщенных составляющих 232 для последующего точного синтаксического анализа. Граф обобщенных составляющих 232 включает все потенциально возможные связи в предложении. За грубым синтаксическим анализом следует точный синтаксический анализ на графе обобщенных составляющих, в результате которого из него "извлекаются" одно или несколько синтаксических деревьев 242, представляющих структуру исходного предложения. Построение синтаксического дерева 242 включает лексический выбор для вершин графа и выбор отношений между вершинами графа. Множество априорных и статистических оценок может быть использовано при выборе лексических вариантов и при выборе отношений из графа. Априорные и статистические оценки могут также быть использованы как для оценивания частей графа, так и для оценивания всего дерева. В одной из реализаций одно или несколько синтаксических деревьев строятся или упорядочиваются по убыванию оценки. Таким образом, лучшее синтаксическое дерево может быть построено первым. В этот момент также проверяются и строятся недревесные связи. Если первое синтаксическое дерево оказывается неподходящим, например, из-за невозможности установить необходимые недревесные связи, в качестве лучше рассматривается второе синтаксическое дерево и т.д. Лексический выбор по сути и означает снятие омонимии (Фиг. 1, 120).[0036] Rough parsing is applied to the original sentence and includes, in particular, the generation of all the potential lexical meanings of the words forming the sentence or phrase, all the potential relationships between them, all the potential components. All possible surface syntactic models are applied for each element of the lexical-morphological structure, then all possible components are constructed and generalized so that all possible variants of the syntactic analysis of the sentence are presented. As a result, a graph of generalized components 232 is formed for subsequent accurate parsing. The generalized component graph 232 includes all potential relationships in a sentence. Rough parsing is followed by precise parsing on the graph of the generalized components, as a result of which one or more syntax trees 242 representing the structure of the original sentence are "extracted" from it. The construction of the syntax tree 242 includes the lexical choice for the vertices of the graph and the choice of relations between the vertices of the graph. A lot of a priori and statistical estimates can be used when choosing lexical options and when choosing relationships from a graph. A priori and statistical estimates can also be used both for estimating parts of the graph and for estimating the entire tree. In one implementation, one or more syntax trees are constructed or ordered in descending order of rating. Thus, the best syntax tree can be built first. At this point, non-timber relationships are also being tested and built. If the first syntax tree is not suitable, for example, due to the inability to establish the necessary non-wood links, the second syntax tree, etc., is better considered. The lexical choice essentially means the removal of homonymy (Fig. 1, 120).

[0037] Поскольку упомянутый лексический выбор для вершин графа и выбор отношений между вершинами графа производится на основе априорных и статистических оценок, в одной из реализаций метода не только рассматриваются и оцениваются все варианты, но эти варианты также запоминаются и индексируются на этапе 108 с учетом их интегральных оценок. Т.е. в индексе 109 могут содержаться не только высоковероятные варианты разбора предложения, но и маловероятные с соответствующим весом, если такой разбор закончился успешно. Веса вариантов разбора используются впоследствии при вычислении оценки релевантности результата поиска. Для лучшей синтаксической структуры 246 затем строится независимая от языка семантическая структура 252.[0037] Since the aforementioned lexical choice for the vertices of the graph and the choice of relations between the vertices of the graph are made on the basis of a priori and statistical estimates, in one implementation of the method not only all options are considered and evaluated, but these options are also stored and indexed at step 108, taking into account their integral estimates. Those. index 109 may contain not only highly probable options for parsing the proposal, but also unlikely with the appropriate weight if such a parsing ended successfully. The weights of the parsing options are subsequently used in calculating the relevance score of the search result. For a better syntactic structure 246, a language-independent semantic structure 252 is then constructed.

[0038] Широкий спектр лексических, грамматических, синтаксических, прагматических, семантических характеристик извлекается на этом этапе анализа 106 и построения 107 семантических структур. Например, система может извлекать и хранить лексическую информацию и информацию о принадлежности лексических единиц семантическим классам, информацию о грамматических формах и линейном порядке, о синтаксических отношениях и поверхностных позициях, использовании определенных форм, аспектов, тональностей, таких как, положительная и негативная тональность, глубинных позиций, недревесных связей, семантем и т.д.[0038] A wide range of lexical, grammatical, syntactic, pragmatic, semantic characteristics is extracted at this stage of the analysis of 106 and the construction of 107 semantic structures. For example, a system can extract and store lexical information and information about lexical units belonging to semantic classes, information about grammatical forms and linear order, about syntactic relations and surface positions, the use of certain forms, aspects, tonalities, such as positive and negative tonality, deep positions, non-wood connections, semantems, etc.

[0039] Фиг. 3 иллюстрирует пример 300 синтаксической структуры, полученного в результате точного синтаксического анализа предложения "Этот мальчик сообразительный, он преуспеет в жизни". Это дерево содержит всю синтаксическую информацию о предложении, такую как лексические значения, части речи, грамматические значения, синтаксические отношения (позиции), синтаксические модели, типы недревесных связей и др. Например, $Demonstrative, $Modiffier_Atributive, $Subject, $Verb, $Complement_Attributive, $Preposition и $Adjunct_Locative - идентификаторы поверхностных позиций, a BE, BOY, LIVE, PREPOSITION, TO_SUCCEED, QUICK_WITTED - идентификаторы семантических классов. Например, "сообразительный" заполняет поверхностную позицию "$Modifier_Attributive" 360 управляющего слова "мальчик" (320) лексического класса "мальчик", принадлежащего семантическому классу BOY, что выражается в обозначении "мальчик:мальчик:BOY" (320).[0039] FIG. 3 illustrates an example 300 of the syntactic structure obtained by precise parsing of the sentence "This boy is smart, he will succeed in life." This tree contains all syntactic information about a sentence, such as lexical meanings, parts of speech, grammatical meanings, syntactic relations (positions), syntactic models, types of non-woody links, etc. For example, $ Demonstrative, $ Modiffier_Atributive, $ Subject, $ Verb, $ Complement_Attributive, $ Preposition and $ Adjunct_Locative are identifiers of surface positions, and BE, BOY, LIVE, PREPOSITION, TO_SUCCEED, QUICK_WITTED are identifiers of semantic classes. For example, “quick-witted” fills the superficial position “$ Modifier_Attributive” 360 of the control word “boy” (320) of the lexical class “boy”, belonging to the semantic class BOY, which is expressed in the notation “boy: boy: BOY” (320).

[0040] Как показано на Фиг. 2, этот подход двухэтапного синтаксического анализа приводит к построению синтаксической структуры исходного предложения, выбранной из одной или нескольких синтаксических структур и называемой "лучшей синтаксической структурой" 246. Так, Фиг. 3 иллюстрирует пример лучшей синтаксической структуры, полученной в результате синтаксического анализа предложения "Этот мальчик сообразительный, он преуспеет в жизни". Подход двухэтапного анализа следует принципу целостного и целенаправленного распознавания, то есть гипотезы о структуре части предложения проверяются с помощью доступных лингвистических описаний в рамках структуры всего предложения. При этом подходе отсутствует необходимость анализировать множество тупиковых вариантов разбора. В большинстве случаях такой подход позволяет существенно сократить количество вычислительных ресурсов, необходимых для анализа предложения.[0040] As shown in FIG. 2, this two-step parsing approach leads to the construction of the syntactic structure of the original sentence selected from one or more syntactic structures and called the “best syntactic structure” 246. So, FIG. 3 illustrates an example of the best syntactic structure obtained by parsing the sentence "This boy is smart, he will succeed in life." The approach of two-stage analysis follows the principle of holistic and purposeful recognition, that is, hypotheses about the structure of a part of a sentence are checked using available linguistic descriptions in the framework of the structure of the whole sentence. With this approach, there is no need to analyze many dead-end parsing options. In most cases, this approach can significantly reduce the amount of computing resources needed to analyze the proposal.

[0041] Фиг. 4 иллюстрирует пример 400 семантической структуры, полученной для предложения "Этот мальчик сообразительный, он преуспеет в жизни". В соответствии с Фиг. 4, эта структура содержит всю синтаксическую и семантическую информацию, такую как семантический класс, семантемы (на рисунке не показаны), семантические отношения (глубинные позиции), недревесные связи и пр.[0041] FIG. 4 illustrates an example of a 400 semantic structure obtained for the sentence "This boy is smart, he will succeed in life." In accordance with FIG. 4, this structure contains all syntactic and semantic information, such as a semantic class, semantems (not shown in the figure), semantic relations (deep positions), non-wood communications, etc.

[0042] В соответствии с Фиг. 4 недревесная связь 440 соединяет две части - две составляющие 410 и 420 сложного предложения "Этот мальчик сообразительный, он преуспеет в жизни". Также, референциальная недревесная связь 430 отражает анафорическую связь между словами "мальчик" и "он", чтобы определить субъекты двух частей сложного предложения. Эта связь также изображается в синтаксическом дереве (Фиг. 3) после проведения анализа и установления недревесных связей.[0042] In accordance with FIG. 4 non-wood connection 440 connects two parts - the two components 410 and 420 of the complex sentence "This boy is smart, he will succeed in life." Also, the non-wood reference 430 reflects the anaphoric connection between the words “boy” and “he” to identify the subjects of the two parts of a complex sentence. This relationship is also shown in the syntax tree (Fig. 3) after analysis and establishment of non-wood links.

[0043] Независимая от языка семантическая структура предложения представляется в виде ациклического графа (дерева, дополненного недревесными связями), где каждое слово определенного языка заменено универсальными (независимыми от языка) семантическими сущностями, называемыми здесь семантическими классами. Семантический класс - одна из самых важных семантических характеристик, которая может быть извлечена и использована для решения задач семантического поиска, классификации, кластеризации и фильтрации документов, написанных на одном или нескольких языках. Кроме семантических классов, также семантемы могут аккумулировать в независимых от языка структурах не только семантическую, но и синтаксическую, грамматическую и пр. зависимую от языка информацию.[0043] The language-independent semantic structure of a sentence is represented as an acyclic graph (tree supplemented by non-wood links), where each word of a particular language is replaced by universal (language-independent) semantic entities, called here semantic classes. The semantic class is one of the most important semantic characteristics that can be extracted and used to solve the problems of semantic search, classification, clustering and filtering of documents written in one or several languages. In addition to semantic classes, semantems can also accumulate in language-independent structures not only semantic, but also syntactic, grammar, etc. language-dependent information.

[0044] Семантические классы в используемых лингвистических описаниях упорядочены в семантическую иерархию, где "дочерний" семантический класс и его "потомки" наследуют значительную часть свойств "родительского" и всех предшествующих семантических классов ("предков"). Например, семантический класс SUBSTANCE (вещество) является дочерним классом достаточно широкого класса ENTITY (сущность), и в то же время он является "родителем", среди прочих, для семантических классов GAS (газ), LIQUID (жидкость), METAL (металл), WOOD_MATERIAL (дерево как материал), и т.д. Каждый семантический класс в семантической иерархии снабжен глубинной (семантической) моделью. Фиг. 5 иллюстрирует фрагмент описанной семантической иерархии.[0044] The semantic classes in the linguistic descriptions used are arranged in a semantic hierarchy, where the "child" semantic class and its "descendants" inherit a significant part of the properties of the "parent" and all previous semantic classes ("ancestors"). For example, the semantic class SUBSTANCE (substance) is a daughter class of a fairly wide class ENTITY (entity), and at the same time it is a “parent”, among others, for the semantic classes GAS (gas), LIQUID (liquid), METAL (metal) , WOOD_MATERIAL (wood as material), etc. Each semantic class in the semantic hierarchy is equipped with a deep (semantic) model. FIG. 5 illustrates a fragment of the described semantic hierarchy.

[0045] Таким образом, близкие по смыслу лексические значения в семантической иерархии сосредоточены, как правило, в одном "кусте", в одном семантическом классе или "родственных", т.е. расположенных близко, семантических классах.[0045] Thus, lexical meanings that are close in meaning in the semantic hierarchy are concentrated, as a rule, in one “bush”, in one semantic class, or “related”, ie closely spaced semantic classes.

[0046] В качестве другого примера, в семантической иерархии синонимичные лексические значения (синонимы), например, "еда", "пища", "продукты", как правило, находятся в одном семантическом классе и имеют те же или близкие семантические характеристики - семантемы. Тогда, если пользователь при поиске включает опцию "Искать синонимы", и хочет найти "пища", то сначала определяется его лексическое значение, семантический класс, и как результат, могут быть также найдены документы, где встречаются "еда" или "продукты" и, возможно, другие наиболее репрезентативные представители семантического класса FOOD. В таких случаях, результаты поиска могут быть более или менее релевантны, более или менее близки к искомому результату. Может быть введена мера релевантности, например, основанная на оценке "близости" лексического значения из запроса к найденному синониму, и, принимая во внимание контекст, порядок слов и другие факторы, она может быть распространена на предложение, фрагмент и т.п.[0046] As another example, in the semantic hierarchy, synonymous lexical meanings (synonyms), for example, "food", "food", "products", are usually in the same semantic class and have the same or similar semantic characteristics - semantems . Then, if the user, when searching, turns on the option “Search for synonyms” and wants to find “food”, then his lexical meaning, semantic class is determined first, and as a result, documents can also be found where “food” or “products” are found and perhaps the other most representative representatives of the semantic class FOOD. In such cases, the search results may be more or less relevant, more or less close to the desired result. A measure of relevance can be introduced, for example, based on an assessment of the "proximity" of the lexical meaning from the query to the found synonym, and, taking into account the context, word order and other factors, it can be extended to a sentence, fragment, etc.

[0047] Глубинная модель представляет собой множество глубинных позиций (типов семантических отношений в предложениях). Глубинные позиции отражают семантические роли дочерних составляющих (структурных единиц предложения) в различных предложениях с объектами данного семантического класса в качестве ядра родительской составляющей и возможные семантические классы в качестве заполнителей позиций. Эти глубинные позиции выражают семантические отношения между составляющими, например, "agent" (агенс), "addressee" (адресат), "instrument" (инструмент), "quantity" (количество), и т.д. Дочерний класс наследует и подстраивает глубинную модель родительского класса.[0047] The deep model is a set of deep positions (types of semantic relationships in sentences). Deep positions reflect the semantic roles of child components (structural units of a sentence) in various sentences with objects of a given semantic class as the core of the parent component and possible semantic classes as placeholders. These deep positions express semantic relations between components, for example, "agent" (agent), "addressee" (addressee), "instrument" (instrument), "quantity" (quantity), etc. The child class inherits and adjusts the deep model of the parent class.

[0048] Фиг. 6 представляет собой схему, иллюстрирующую языковые описания 610, согласно одной из возможных реализаций изобретения. Языковые описания 610 включают морфологические описания 101, синтаксические описания 102, лексические описания, 103 и семантические описания 104. Фиг. 7 представляет собой схему, иллюстрирующую морфологические описания, согласно одной из возможных реализаций изобретения. Фиг. 8 иллюстрирует синтаксические описания, согласно одной из возможных реализаций изобретения. Фиг. 9 иллюстрирует семантические описания, согласно одной из возможных реализаций изобретения.[0048] FIG. 6 is a diagram illustrating language descriptions 610 according to one possible implementation of the invention. Language descriptions 610 include morphological descriptions 101, syntactic descriptions 102, lexical descriptions 103, and semantic descriptions 104. FIG. 7 is a diagram illustrating morphological descriptions according to one possible implementation of the invention. FIG. 8 illustrates syntactic descriptions in accordance with one possible implementation of the invention. FIG. 9 illustrates semantic descriptions, according to one possible implementation of the invention.

[0049] Обратимся к Фиг. 6 и Фиг. 9. Являясь частью семантических описаний 104, семантическая иерархия 910 является ядром языковых описаний 610, которая объединяет независимые от языка семантические описания 104 и зависимые от языка лексические описания 103, морфологические описания 101 и синтаксические описания 102. Между этими описаниями существуют связи, которые показаны двойными стрелками 621, 622, 623 и 624. Семантическая иерархия может быть создана однажды, а затем может быть заполнена для каждого определенного языка. Семантический класс в конкретном языке включает лексические значения с соответствующими моделями. Семантические описания 104 не зависят от языка. Семантические описания 104 могут содержать описания глубинных составляющих и могут содержать семантическую иерархию, описания глубинных позиций, систему семантем и прагматических описаний.[0049] Referring to FIG. 6 and FIG. 9. As part of semantic descriptions 104, the semantic hierarchy 910 is the core of language descriptions 610, which combines language-independent semantic descriptions 104 and language-dependent lexical descriptions 103, morphological descriptions 101, and syntactic descriptions 102. There are relationships between these descriptions that are shown in double arrows 621, 622, 623 and 624. A semantic hierarchy can be created once, and then can be populated for each specific language. The semantic class in a particular language includes lexical meanings with appropriate models. The semantic descriptions 104 are language independent. Semantic descriptions 104 may contain descriptions of deep components and may contain a semantic hierarchy, descriptions of deep positions, a system of semantems and pragmatic descriptions.

[0050] Лексическое значение может иметь несколько поверхностных (синтаксических) моделей, сопровождаемых семантемами и прагматическими характеристиками. Синтаксические описания 102 и семантические описания 104 также связаны. Например, диатеза синтаксических описаний 102 может рассматриваться как "интерфейс" между зависимыми от языка поверхностными моделями и независимыми от языка глубинными моделями семантического описания 104.[0050] Several superficial (syntactic) models may accompany the lexical meaning, followed by semantems and pragmatic characteristics. Syntactic descriptions 102 and semantic descriptions 104 are also related. For example, the diathesis of syntactic descriptions 102 can be considered as an “interface” between language-dependent surface models and language-independent deep models of semantic description 104.

[0051] Фиг. 7 иллюстрирует пример морфологических описаний 101. Как показано на Фиг. 7, составляющие морфологических описаний 101 включают, но не ограничиваются описаниями словоизменения 710, грамматической системой (граммемами) 720, и описаниями словообразования 730. В одной из возможных реализаций изобретения грамматическая система 720 включает набор грамматических категорий, таких как «Часть речи», «Падеж», «Род», «Число», «Лицо», «Возвратность», «Время», «Вид» и их значения, здесь и далее называемые граммемами. Например, граммемы, означающие части речи, могут включать прилагательное, существительное, глагол и т.д.; граммемы в разных языках могут различаться, например, граммемы падежа для русского языка могут включать «Именительный», «Родительный», «Дательный» и т.д.; граммемы рода могут включать «Мужской», «Женский», «Средний» и т.д. Ссылаясь на Фиг. 7, описания словоизменения 710 описывают, как начальная форма слова может изменяться в зависимости от падежа, рода, числа, времени и т.д. и включают в широком смысле все возможные формы данного слова. Описания словообразования 730 описывают, какие новые слова могут быть построены с использованием данного слова. Граммемы - единицы грамматической системы 720 и, как показывает ссылка 722 и ссылка 724, граммемы могут быть использованы для построения описаний словоизменения 710 и описаний словообразования 730.[0051] FIG. 7 illustrates an example of morphological descriptions 101. As shown in FIG. 7, the components of morphological descriptions 101 include, but are not limited to, inflection descriptions 710, grammar system (grams) 720, and derivation descriptions 730. In one possible implementation of the invention, grammar system 720 includes a set of grammatical categories, such as “Part of speech”, “Case” ”,“ Genus ”,“ Number ”,“ Person ”,“ Reciprocity ”,“ Time ”,“ View ”and their meanings, hereinafter referred to as grammes. For example, grammes meaning parts of speech may include an adjective, noun, verb, etc .; grammes in different languages can vary, for example, case grammes for the Russian language can include “Nominative”, “Genitive”, “Dative”, etc .; gender grammes may include "Male", "Female", "Medium", etc. Referring to FIG. 7, inflection descriptions 710 describe how the initial form of a word can vary depending on case, gender, number, time, etc. and include in a broad sense all possible forms of a given word. Word formation descriptions 730 describe which new words can be constructed using a given word. Grammes are units of the grammatical system 720 and, as shown in reference 722 and reference 724, grammes can be used to construct descriptions of inflection 710 and descriptions of derivation 730.

[0052] Фиг. 8 иллюстрирует синтаксические описания 102. В одной из реализаций компоненты синтаксических описаний 102 могут содержать поверхностные модели 810, описания поверхностных позиций 820, правила анализа 860, описания недревесного синтаксиса 850, а также описания референциального и структурного контроля, описания управления и согласования и др. Синтаксические описания 102 используются для построения возможных синтаксических структур предложения для данного исходного языка, учитывая порядок слов, недревесные синтаксические явления (например, согласование, эллипсис и т.д.), референциальный контроль (управление) и другие явления.[0052] FIG. 8 illustrates syntactic descriptions 102. In one implementation, the components of the syntactic descriptions 102 may include surface models 810, descriptions of surface positions 820, analysis rules 860, descriptions of non-wood syntax 850, as well as descriptions of referential and structural control, descriptions of control and alignment, etc. Syntactic descriptions 102 are used to construct possible syntactic sentence structures for a given source language, taking into account the word order, non-wood syntactic phenomena (for example, (ellipsis, etc.), referential control (control) and other phenomena.

[0053] Фиг. 9 иллюстрирует семантические описания 104 согласно одной из возможных реализаций изобретения. В то время как поверхностные позиции 820 отражают синтаксические отношения и способы их реализации в конкретном языке, глубинные позиции 914 отражают семантические роли дочерних (зависимых) составляющих в глубинных моделях 912. Потому описания поверхностных позиций, и шире - поверхностные модели, могут быть специфичными для каждого конкретного языка.[0053] FIG. 9 illustrates semantic descriptions 104 according to one possible implementation of the invention. While surface positions 820 reflect syntactic relations and ways of their implementation in a particular language, deep positions 914 reflect the semantic roles of daughter (dependent) components in deep models 912. Therefore, descriptions of surface positions, and more broadly - surface models, can be specific to each specific language.

Описания глубинных моделей 920 содержат грамматические и семантические ограничения для заполнителей этих позиций. Свойства и ограничения глубинных позиций 914 и их заполнители в глубинных моделях 912 очень похожи и часто идентичны для различных языков.Descriptions of deep models 920 contain grammatical and semantic restrictions for placeholders for these items. The properties and limitations of the deep positions 914 and their placeholders in the deep models 912 are very similar and often identical for different languages.

[0054] Система семантем 930 представляет множество семантических категорий. Семантемы могут отражать лексические, грамматические свойства и атрибуты, а также дифференциальные свойства и стилистические, прагматические и коммуникативные характеристики. Для примера, семантическая категория "DegreeOfComparison" (степень сравнения) может быть использована для описания степеней сравнения, выраженных разными формами прилагательных, например, "easy", "easier" and "easiest". Так, семантическая категория "DegreeOfComparison" может включать семантемы, например "Positive", "ComparativeHigherDegree", "SuperlativeHighestDegree". В качестве другого примера, семантическая категория "RelationToReferencePoint" может быть использована для описания того, в каком линейном порядке - до или после объекта или события находится в предложении ссылка на него, и ее семантемами являются "Previous", "Subsequent". Еще один пример - семантическая категория "EvaluationObjective" может фиксировать наличие объективной оценки, такой как "Bad", "Good" и т.д. Лексические семантемы могут описывать специфические свойства объектов, например "быть плоским" ("being flat") или "быть жидким" ("being liquid") и используются в ограничениях на заполнители глубинных позиций. Классифицирующие дифференциальные семантемы используются для выражения дифференциальных свойств внутри одного семантического класса. Например, в английском языке "парикмахер" для мужчин переводится как "barber", и ему в семантическом классе "HAIRDRESSER" будет приписана семантема "RelatedToMen", в то время как в том же семантическом классе есть "hairdresser" и "hairstylist" и др.[0054] The semantem system 930 represents a variety of semantic categories. Semantems can reflect lexical, grammatical properties and attributes, as well as differential properties and stylistic, pragmatic and communicative characteristics. For example, the semantic category "DegreeOfComparison" (degree of comparison) can be used to describe the degrees of comparison expressed by different forms of adjectives, for example, "easy", "easier" and "easiest". So, the semantic category "DegreeOfComparison" can include semantems, for example, "Positive", "ComparativeHigherDegree", "SuperlativeHighestDegree". As another example, the semantic category "RelationToReferencePoint" can be used to describe in which linear order - before or after an object or event a link to it is in the sentence, and its semantems are "Previous", "Subsequent". Another example - the semantic category "EvaluationObjective" can record the presence of an objective assessment, such as "Bad", "Good", etc. Lexical semanthemes can describe specific properties of objects, for example, “being flat” or “being liquid” and are used in restrictions on placeholder placeholders. Classifying differential semantems are used to express differential properties within a single semantic class. For example, in English, “hairdresser” for men is translated as “barber”, and the semantema “RelatedToMen” will be assigned to him in the semantic class “HAIRDRESSER”, while in the same semantic class there is “hairdresser” and “hairstylist”, etc. .

[0055] Прагматические описания 940 служат для того, чтобы в процессе анализа текста фиксировать соответствующую тему, стиль или жанр текста, а также возможно приписать соответствующие характеристики объектам семантической иерархии. Например, "Economic Policy", "Foreign Policy", "Justice", "Legislation", "Trade", "Finance", etc.[0055] The pragmatic descriptions 940 are used to fix the corresponding theme, style or genre of the text during the analysis of the text, and it is also possible to attribute the corresponding characteristics to the objects of the semantic hierarchy. For example, "Economic Policy", "Foreign Policy", "Justice", "Legislation", "Trade", "Finance", etc.

[0056] Фиг. 10 является схемой, иллюстрирующей лексические описания 103, согласно одной или нескольким реализациям данного изобретения. Лексические описания 103 включают лексико-семантический словарь 1004, который включает в себя набор лексических значений 1012, образующих вместе со своими семантическими классами семантическую иерархию, где каждое лексическое значение может сопровождаться, но не ограничивается своей глубинной моделью 912, поверхностной моделью 810, грамматическим значением 1008 и семантическим значением 1010. Лексическое значение является реализацией в конкретном языке некоторого семантического значения - смысла и может объединять различные дериваты (например, слова, выражения, фразы), выражающие смысл с помощью различных частей речи, различных форм слова, однокоренных слов и пр. В свою очередь, семантический класс объединяет лексические значения близких по смыслу слов и выражений на разных языках.[0056] FIG. 10 is a diagram illustrating lexical descriptions 103 according to one or more implementations of the present invention. Lexical descriptions 103 include a lexical-semantic dictionary 1004, which includes a set of lexical meanings 1012, forming, together with their semantic classes, a semantic hierarchy where each lexical meaning may be accompanied, but not limited to its depth model 912, surface model 810, grammatical meaning 1008 and the semantic meaning 1010. The lexical meaning is the realization in a particular language of a certain semantic meaning - meaning and can combine different derivatives (for example, words, expressions, phrases) expressing meaning with the help of various parts of speech, different forms of words, cognates, etc. In turn, the semantic class combines the lexical meanings of words and expressions that are similar in meaning in different languages.

[0057] Любой параметр языковых описаний 610 - лексические значения, семантические классы, граммемы, семантемы и многое другое извлекается во время исчерпывающего анализа текста, и любой параметр может быть проиндексирован (создан индекс характеристики). Индексация семантических классов востребована во многих задачах, связанных с анализом текстов на естественном языке, таких как семантический поиск, классификация, кластеризация, фильтрация текстов и многие другие. Индексация лексических значений (в отличие от индексации просто слов) позволяет искать не просто слова или словоформы, но лексические значения, т.е. слова в определенном смысловом (семантическом) значении. Синтаксические структуры и семантические структуры также могут индексироваться и сохраняться для использования в семантическом поиске, классификации, кластеризации и фильтрации документов.[0057] Any parameter of the language descriptions 610 — lexical values, semantic classes, grammes, semantems, and much more is extracted during exhaustive analysis of the text, and any parameter can be indexed (a characteristic index is created). Indexing of semantic classes is in demand in many tasks related to the analysis of natural language texts, such as semantic search, classification, clustering, text filtering and many others. Indexing lexical meanings (as opposed to indexing just words) allows you to search not just words or word forms, but lexical meanings, i.e. words in a certain semantic meaning. Syntactic structures and semantic structures can also be indexed and stored for use in semantic search, classification, clustering and filtering of documents.

[0058] После того как построены универсальная семантическая структура для каждого предложения каждого текста в корпусе текстов, синтаксические и семантические структуры индексируются. Индексируются лексические значения как результат лексического выбора в каждой вершине семантической структуры, каждый параметр морфологических, синтаксических, лексических и семантических описаний может индексироваться таким же образом, как обычные слова. Индекс слов в документе обычно включает, по меньшей мере, одну таблицу, где каждое слово (лексема или словоформа), встретившееся в документе, сопровождается списком номеров или адресов позиций в этом документе. Согласно реализации данного изобретения, индекс может строиться для всех лексических и семантических значений, всех семантических классов, для любых значений морфологических, синтаксических, лексических и семантических параметров. Эти значения параметров генерируются в процессе двухступенчатого синтактико-семантического анализа, и полученные индексы могут быть использованы для достижения более высокой точности и релевантности семантического поиска в корпусах текстов на естественных языках. Например, пользователь может формулировать свой запрос с возможностью поиска предложений с существительными, имеющими свойство "being flat" или "being liquid" или предложений содержащих слова (существительные и/или глаголы), обозначающие какой-либо процесс, например, производства, разрушения, перемещения и т.п.[0058] After the universal semantic structure for each sentence of each text in the body of texts is built, syntactic and semantic structures are indexed. Lexical values are indexed as a result of lexical choice at each vertex of the semantic structure, each parameter of morphological, syntactic, lexical and semantic descriptions can be indexed in the same way as ordinary words. An index of words in a document usually includes at least one table, where each word (token or word form) that appears in the document is accompanied by a list of numbers or addresses of positions in this document. According to the implementation of this invention, the index can be built for all lexical and semantic values, all semantic classes, for any values of morphological, syntactic, lexical and semantic parameters. These parameter values are generated during a two-stage syntactic-semantic analysis, and the resulting indices can be used to achieve higher accuracy and relevance of the semantic search in cases of texts in natural languages. For example, a user can formulate his request with the ability to search for sentences with nouns that have the property “being flat” or “being liquid” or sentences containing words (nouns and / or verbs) that indicate a process, for example, production, destruction, movement etc.

[0059] В одной из возможных реализаций способа изобретения, комбинация из двух, трех или, вообще говоря, N чисел может быть использована для индексирования различных синтаксических, семантических или других параметров. Например, чтобы индексировать поверхностные или глубинные позиции могут быть использованы комбинации из двух чисел - номеров слов, которые в тексте связаны отношением, соответствующим данной позиции. Например, для семантической структуры предложения "Этот мальчик сообразительный, он преуспеет в жизни", представленной на Фиг. 4, глубинная позиция 'Sphere' (450) соотносит лексическое значение "succeed:TO_SUCCEED" (460) с лексическим значением "life:LIVE (470)". Более конкретно, лексическое значение "life:LIVE" заполняет глубинную 'Sphere' глагола "succeed:TO_SUCCEED". Когда строится индекс лексических значений, в соответствии с методом данного изобретения, вхождениям данных лексических значений присваиваются номера в соответствии с их положением в тексте, например, N1 и N2. Когда строится индекс глубинных позиций, каждой глубинной позиции ставится в соответствие список ее встречаемости в документе. Для примера, индекс глубинной позиции 'Sphere' будет, среди прочих, включать пару (N1, N2).[0059] In one possible implementation of the method of the invention, a combination of two, three or, generally speaking, N numbers can be used to index various syntactic, semantic or other parameters. For example, in order to index surface or deep positions, combinations of two numbers can be used - word numbers, which in the text are related by the relation corresponding to this position. For example, for the semantic structure of the sentence “This boy is smart, he will succeed in life”, presented in FIG. 4, the deep position 'Sphere' (450) correlates the lexical meaning "succeed: TO_SUCCEED" (460) with the lexical meaning "life: LIVE (470)". More specifically, the lexical meaning “life: LIVE” fills the deep “Sphere” of the verb “succeed: TO_SUCCEED”. When the index of lexical values is constructed, in accordance with the method of the present invention, the occurrences of these lexical values are assigned numbers in accordance with their position in the text, for example, N1 and N2. When the index of deep positions is built, each deep position is assigned a list of its occurrences in the document. For example, the 'Sphere' deep position index will, among others, include a pair (N1, N2).

[0060] Т.к. индексируются не только слова, но их лексические значения, семантические классы, синтаксические и семантические отношения, любые другие элементы синтаксических и семантических структур, становится возможным искать контекст не только по ключевым словам, но также контекст, содержащий определенные лексические или семантические значения, значения, принадлежащие определенным семантическим классам, контекст, включающий элементы с определенными синтаксическими и/или семантическими свойствами и/или морфологическими свойствами или наборами (комбинациями) таких свойств. Также, могут быть найдены предложения с недревесными синтаксическими явлениями, например, эллипсис, сочинение и др. Т.к. можно искать семантические классы, становится возможным искать семантически связанные слова и понятия.[0060] Because not only words are indexed, but their lexical meanings, semantic classes, syntactic and semantic relations, any other elements of syntactic and semantic structures, it becomes possible to search for a context not only by keywords, but also a context containing certain lexical or semantic meanings, meanings belonging to specific semantic classes, a context that includes elements with certain syntactic and / or semantic properties and / or morphological properties or sets and (combinations) of such properties. Also, sentences with non-wood syntactic phenomena, for example, ellipsis, composition, etc. can be found. you can search for semantic classes, it becomes possible to search for semantically related words and concepts.

[0061] Вернемся к описанию собственно метода изобретения, представленному на Фиг. 1. Запрос пользователя 110 может представлять собой, в общем случае, группу слов, в том числе, предложение, словосочетание и т.п. В частном случае - набор ключевых слов, которые должны встретиться в искомом фрагменте. Запрос подвергается семантико-синтаксическому анализу, как описано на Фиг. 2, результатом которого является семантическая структура и снятие омонимии (disambiguation) 120. Т.е. для каждого слова в запросе, в лучшем случае, определяется, в каком именно лексическом значении следует искать вхождение этого слова. Это возможно, если все прочие варианты разбора, т.е. лексические варианты, кроме первого, имеют существенно более низкую (ниже некоторого порогового значения) оценку. В худшем случае, если оценки вариантов различаются несущественно, для каждого слова определяется набор лексических вариантов (лексических значений), с соответствующими весами, т.е. ранжированный список лексических значений.[0061] Let us return to the description of the actual method of the invention presented in FIG. 1. The user request 110 may represent, in the General case, a group of words, including a sentence, a phrase, etc. In the particular case - a set of keywords that should be found in the search fragment. The query is semantically parsed as described in FIG. 2, the result of which is the semantic structure and the removal of homonymy (disambiguation) 120. That is, for each word in the query, at best, it is determined in which lexical meaning the search for the occurrence of this word should be sought. This is possible if all other parsing options, i.e. lexical options, except the first, have a significantly lower (below a certain threshold value) score. In the worst case, if the ratings of options differ insignificantly, for each word a set of lexical variants (lexical meanings) is determined, with corresponding weights, i.e. ranked list of lexical meanings.

[0062] Вес каждого лексического варианта в итоговой семантической структуре вычисляется и зависит от множества факторов - связности (сочетаемости слов) исходного запроса, интегральной оценки полученной в результате разбора семантической структуры, от априорной оценки (rating) лексического варианта, от статистической оценки сочетаемости и т.п.[0062] The weight of each lexical variant in the final semantic structure is calculated and depends on many factors - the connectivity (word compatibility) of the original query, the integral estimate obtained from the analysis of the semantic structure, the a priori rating (rating) of the lexical variant, the statistical estimation of compatibility and t .P.

[0063] Далее на этапе 130 для одного или более элементов (слов) запроса могут быть подобраны один или более синонимов. В одной из возможных реализаций могут быть использованы готовые списки синонимов, например, синсеты (synsets) WordNet. В реализации, данного изобретения списки синонимов формируются по крайней мере на основе взаимного расположения лексических значений в семантической иерархии и наличия у лексического значения тех или иных различительных и классифицирующих семантем. Например, в семантической иерархии имеется семантический класс PRINTED_MATTER, в котором есть лексические классы "пресса" и "печать". Может считаться, что эти лексические классы "достаточно близко" расположены по отношению друг к другу, и поэтому, в зависимости от совпадения/несовпадения прочих семантических характеристик (например, наличия/отсутствия каких-то различительных семантем) могут заменять друг друга с весом 1 или, например, 0.9. Т.е., например, имея запрос "В прессе появились сообщения о приближающейся к Земле комете", можно также искать запрос "В печати появились сообщения о приближающейся к Земле комете".[0063] Next, in step 130, one or more synonyms can be matched for one or more query elements (words). In one of the possible implementations, ready-made lists of synonyms can be used, for example, WordNet synsets. In the implementation of this invention, lists of synonyms are formed at least on the basis of the mutual arrangement of lexical meanings in the semantic hierarchy and the presence of lexical meanings of one or another distinguishing and classifying semantem. For example, in the semantic hierarchy there is a semantic class PRINTED_MATTER, in which there are lexical classes “press” and “print”. It can be considered that these lexical classes are "fairly close" located in relation to each other, and therefore, depending on the coincidence / mismatch of other semantic characteristics (for example, the presence / absence of some distinctive semantems), they can replace each other with a weight of 1 or , for example, 0.9. That is, for example, having the query "There were reports in the press about a comet approaching the Earth", you can also search for the query "There were reports about a comet approaching the Earth in the press".

[0064] Однако семантический класс PRINTED_MATTER включает также другие семантические классы, например, EDITION_AS_TEXT, PERIODICAL, NEWSPAPER и др. Они также содержат лексические классы, например, "периодика", "газета" и др. Сопоставляя их с исходным лексическим значением "пресса", может быть вычислен вес синонима "газета" относительно "пресса", который, грубо говоря, зависит от "расстояния" между ними в семантической иерархии, а также от наличия/отсутствия различительных семантем. "Расстояние" может быть вычислено с использованием метрики.[0064] However, the semantic class PRINTED_MATTER also includes other semantic classes, for example, EDITION_AS_TEXT, PERIODICAL, NEWSPAPER and others. They also contain lexical classes, for example, "periodicals", "newspaper", etc. Comparing them with the original lexical meaning "press" , the weight of the synonym “newspaper” relative to “press” can be calculated, which, roughly speaking, depends on the “distance” between them in the semantic hierarchy, as well as on the presence / absence of distinctive semantems. "Distance" can be calculated using metrics.

[0065] В зависимости от требований в отношении точности и (или) сложности вычислений метрика также может учитывать различные факторы, в том числе: наличие отношений родитель-потомок между родительскими семантическими классами в семантической иерархии, так чтобы родитель и потомок были разделены не более, чем определенным числом уровней семантической иерархии; наличие общего предка по определенным семантическим классам и расстояния между узлами, представляющими данные классы. Если обнаруживается, что лексические классы (значения) являются "близкими", метрика может учитывать наличие или отсутствие определенных различительных семантем и (или) другие факторы, например, схожесть/различие поверхностных моделей, в том числе, наличие идентичных поверхностных позиций и возможных их заполнителей.[0065] Depending on the requirements regarding the accuracy and / or complexity of the calculations, the metric can also take into account various factors, including: the presence of parent-child relationships between parent semantic classes in the semantic hierarchy, so that the parent and child are no more separated, than a certain number of levels of the semantic hierarchy; the presence of a common ancestor for certain semantic classes and the distance between nodes representing these classes. If lexical classes (meanings) are found to be “close,” the metric may take into account the presence or absence of certain distinctive semantems and / or other factors, for example, the similarity / difference of surface models, including the presence of identical surface positions and their possible placeholders .

[0066] Таким образом, на этапе 130 для одного или более элементов (слов) запроса могут быть подобраны один или более синонимов, причем каждый имеет свой коэффициент (вес, rating) относительно слова, первоначально присутствующего в запросе. Например, вес может иметь значения на отрезке (0; 1]. При этом наивысший вес (1), как правило, может иметь исходное слово, присутствующее в запросе.[0066] Thus, in step 130, one or more synonyms can be selected for one or more query elements (words), each having its own coefficient (weight, rating) relative to the word originally present in the query. For example, the weight may have values on the interval (0; 1]. In this case, the highest weight (1), as a rule, may have the original word present in the query.

[0067] На этапе 140 синонимы ранжируются, т.е. располагаются в соответствии с убыванием ранга. С учетом полученных синонимов формулируются дополнительные строки запроса. Эти дополнительные строки формируются как все возможные сочетания (декартово произведение) синонимов с сохранением линейного порядка. С учетом веса каждого входящего в запрос синонима. При этом опять наивысший вес (1) будет иметь исходная строка запроса.[0067] In step 140, synonyms are ranked, i.e. ranked according to decreasing rank. Taking into account the obtained synonyms, additional query strings are formulated. These additional lines are formed as all possible combinations (Cartesian product) of synonyms with the preservation of a linear order. Given the weight of each synonym in the request. In this case, again, the highest weight (1) will have the original query string.

[0068] На этапе 150 выполняется собственно запрос на поиск. Точнее, для поиска может использоваться более одной строки запроса. Фактически может выполняться несколько запросов одновременно или последовательно. Может использоваться вычислительная система, имеющая более одного процессора или компьютера. Строка запроса представлена искомыми лексическими значениям. Каждая строка запроса имеет свой вес, вычисленный на этапе 140 в зависимости от присутствия в запросе синонимов и веса каждого из входящих в запрос синонимов.[0068] In step 150, the actual search request is executed. More precisely, more than one query string can be used for a search. In fact, several requests can be executed simultaneously or sequentially. A computer system having more than one processor or computer may be used. The query string is represented by the searched lexical values. Each query line has its own weight calculated at step 140 depending on the presence of synonyms in the request and the weight of each of the synonyms included in the request.

[0069] На этапе 150 может применяться полнотекстовый или семантический поиск. Для полнотекстового поиска каждая строка запроса преобразуется в слова, и поиск выполняется на индексе, представляющем собой, в общем случае, индекс слов. Может также использоваться индекс N-грамм. В случае полнотекстового поиска может потребоваться дополнительный этап фильтрации результатов, включающий семантико-синтаксический разбор найденного фрагмента для того, чтобы убедиться, что слова в найденном фрагменте используются именно в том лексическом значении, которое имелось в строке запроса.[0069] In step 150, full-text or semantic search may be applied. For a full-text search, each query string is converted to words, and the search is performed on an index, which is, in general, an index of words. An N-gram index may also be used. In the case of full-text search, an additional step of filtering the results may be required, including semantic-syntactic analysis of the found fragment in order to make sure that the words in the found fragment are used exactly in the lexical meaning that was in the query string.

[0070] В случае семантического поиска на этапе 150 выполняется поиск в семантическом индексе, т.е. выполняется поиск конкретных лексических значений. Еще одной возможностью семантического поиска может являться поиск по семантическим классам с последующим уточнением по лексическим значениям. В еще одной реализации при семантическом поиске может выполняться поиск соответствующей запросу семантической структуры с последующим вычислением оценки степени совпадения. Индекс семантических структур, входящий в семантический индекс, также строится на предварительном этапе[0070] In the case of a semantic search, in step 150, a search is performed in the semantic index, i.e. searches for specific lexical meanings. Another possibility of semantic search can be a search on semantic classes with subsequent refinement by lexical values. In yet another implementation, in a semantic search, a search can be made for a semantic structure corresponding to the query, followed by calculation of an estimate of the degree of coincidence. The index of semantic structures included in the semantic index is also built at the preliminary stage

[0071] В обоих случаях каждый из найденных результатов (фрагментов) получает вес в зависимости от веса используемой строки запроса. Дополнительные штрафы, уменьшающие вес, могут использоваться в случае, например, ненулевого расстояния между словами запроса в найденном фрагменте и в случае изменения линейного порядка.[0071] In both cases, each of the results (fragments) found receives weight depending on the weight of the query string used. Additional fines that reduce weight can be used in the case of, for example, a non-zero distance between the query words in the found fragment and in the case of a change in the linear order.

[0072] На этапе 160 выполняется общее ранжирование найденных результатов. Ранжирование может производиться на основе полученных весов, также может применяться функция преобразования. Результаты, имеющие вес меньше некоторого порогового значения могут отбрасываться. Дополнительно, результаты поиска 170 могут отображаться на дисплее вычислительной системы в интерфейсе пользователя в соответствии с требованиями поисковой системы.[0072] In step 160, a general ranking of the results found is performed. The ranking may be based on the weights received, and a conversion function may also be used. Results having weight less than a certain threshold value may be discarded. Additionally, search results 170 may be displayed on a computer display in a user interface in accordance with the requirements of a search engine.

[0073] Аналогично тому, как строятся дополнительные строки запросов с использованием синонимов, для получения альтернативных строк запроса, выражающих тот же самый смысл, могут использоваться перефразировки. Перефразировки представляют собой множество пар строк, где любая строка может содержать одно или более слов. Такие пары могут быть получены, например, в результате сбора статистики обработки множества текстов. Например, такими перефразировками могут быть "в процессе решения задачи" и "при поиске решения задачи". В случае полнотекстового поиска они могут быть использованы аналогично использованию синонимов. Всякая пара перефразировок также может иметь априорно присвоенный им вес в зависимости от того, насколько можно считать их эквивалентными. Например, вес перефразировки может вычисляться в зависимости от частотности встречаемости в одинаковых или похожих контекстах.[0073] In the same way that additional query strings are constructed using synonyms, paraphrases can be used to obtain alternative query strings expressing the same meaning. Paraphrases are many pairs of lines, where any line can contain one or more words. Such pairs can be obtained, for example, by collecting statistics on processing multiple texts. For example, such paraphrases may be “in the process of solving a problem” and “when searching for a solution to a problem”. In the case of full-text search, they can be used similarly to the use of synonyms. Any pair of paraphrases can also have a priori weight assigned to them, depending on how much they can be considered equivalent. For example, paraphrase weight can be calculated based on the frequency of occurrence in the same or similar contexts.

[0074] При семантическом поиске также могут использоваться перефразировки. В одной из реализаций перефразировка может заменять фрагмент строки запроса до синтаксического анализа, если это оказывается целесообразным, например, по причине того, что другое, эквивалентное словосочетание является более частотным. Также может осуществляться динамическая генерация перефразировок, заключающаяся в следующем. После того, как строка запроса на этапе 120 подверглась семантико-синтаксическому анализу для снятия омонимии, для исходной строки построена семантическая структура. В соответствии с описанной технологией анализа, которая является составной частью общей технологии машинного перевода, описанной в ряде патентов US Patent US 8,195,447, US Patent 8,214,199 и др. на основе этой семантической структуры может быть произведен синтез эквивалентного предложения на любом языке, включая также исходный язык. Технология предполагает синтез не одного, а множества вариантов поверхностных синтаксических структур таких предложений с последующей оценкой каждого варианта и выбором вариантов с наивысшей оценкой. Поверхностные синтаксические структуры могут включать также различные лексические варианты. При решении задачи поиска перефразировок могут выбираться несколько лучших вариантов поверхностных структур с оценкой, превышающей некоторое пороговое значение.[0074] In semantic search, paraphrases can also be used. In one implementation, rephrasing can replace a fragment of a query string before parsing, if it is appropriate, for example, because another, equivalent phrase is more frequent. The dynamic generation of paraphrases can also be carried out, which consists in the following. After the query string at step 120 was subjected to semantic-syntactic analysis to remove homonymy, a semantic structure was constructed for the original string. In accordance with the described analysis technology, which is an integral part of the general machine translation technology described in a number of patents US Patent US 8,195,447, US Patent 8,214,199 and others. Based on this semantic structure, an equivalent sentence can be synthesized in any language, including also the source language . The technology involves the synthesis of not one, but a multitude of variants of the surface syntactic structures of such sentences with the subsequent evaluation of each variant and the choice of variants with the highest rating. Surface syntactic structures may also include various lexical variations. When solving the problem of searching for paraphrases, several best variants of surface structures can be selected with an estimate exceeding a certain threshold value.

[0075] При оценке вариантов поверхностных структур могут действовать те или иные правила. Например, для исходного предложения "Джон купил дом у реки" могут быть синтезированы различные поверхностные структуры перефразировок, например, "Дом у реки был куплен Джоном" и даже "Дом у реки был продан Джону". Эти варианты имеют вычисляемые оценки, которые зависят от ряда факторов, в том числе от степени подобия синтезируемой структуры по отношению к структуре исходного предложения, наличия соответствующих семантических классов, глубинных и поверхностных позиций и семантем, "степени родства" лексических классов, выбранных грамматических форм и т.д. Устанавливается некоторый порог "отклонения" от исходного предложения, и тогда варианты с оценкой, превышающей это значение, могут быть выбраны в качестве используемых перефразировок.[0075] These rules may apply when evaluating surface structure options. For example, for the original sentence “John bought a house by the river”, various surface paraphrases can be synthesized, for example, “The house by the river was bought by John” and even “The house by the river was sold to John”. These options have calculated estimates that depend on a number of factors, including the degree of similarity of the synthesized structure with respect to the structure of the initial sentence, the presence of corresponding semantic classes, deep and surface positions and semantems, the “degree of relationship” of lexical classes, selected grammatical forms, and etc. A certain threshold of “deviation” from the original sentence is set, and then options with an estimate exceeding this value can be selected as the rephrases used.

[0076] На Фиг. 11А представлен пример запроса с использованием синонимов и полученных результатов поиска. На Фиг. 11В представлен пример запроса и полученных результатов поиска с использованием перефразировок.[0076] In FIG. 11A presents an example of a query using synonyms and obtained search results. In FIG. 11B provides an example of a query and retrieved search results using paraphrases.

[0077] На Фиг. 12 приведен возможный пример вычислительного средства 1200, которое может быть использовано для внедрения настоящего изобретения, осуществленного так, как было описано выше. Вычислительное средство 1200 включает в себя, по крайней мере, один процессор 1202, соединенный с памятью 1204. Процессор 1202 может представлять собой один или более процессоров, может содержать одно, два или более вычислительных ядер. Память 1204 может представлять собой оперативную память (ОЗУ), а также содержать любые другие типы и виды памяти, в частности, устройства энергонезависимой памяти (например, флэш-накопители) и постоянные запоминающие устройства, например, жесткие диски и т.д. Кроме того, может считаться, что память 1204 включает в себя аппаратные средства хранения информации, физически размещенные где-либо еще в составе вычислительного средства 1200, например, кэш-память в процессоре 1202, память, используемую в качестве виртуальной и хранимую на внешнем либо внутреннем постоянном запоминающем устройстве 1210.[0077] In FIG. 12 is a possible example of computing means 1200 that can be used to implement the present invention, implemented as described above. Computing means 1200 includes at least one processor 1202 connected to a memory 1204. The processor 1202 may be one or more processors, may contain one, two, or more computing cores. Memory 1204 can be random access memory (RAM), and also contain any other types and types of memory, in particular, non-volatile memory devices (eg, flash drives) and read-only memory devices, such as hard drives, etc. In addition, it may be considered that the memory 1204 includes hardware for storing information physically located elsewhere in the computing means 1200, for example, a cache memory in a processor 1202, a memory used as virtual and stored on an external or internal read only memory device 1210.

[0078] Вычислительное средство 1200 также обычно имеет некоторое количество входов и выходов для передачи информации вовне и получения информации извне. Для взаимодействия с пользователем вычислительное средство 1200 может содержать одно или более устройств ввода (например, клавиатура, мышь, сканер и т.д.) и устройство отображения 1208 (например, жидкокристаллический дисплей). Вычислительное средство 1200 также может иметь одно или более постоянных запоминающих устройств 1210, например, привод оптических дисков (CD, DVD или другой), жесткий диск, ленточный накопитель. Кроме того, вычислительное средство 1200 может иметь интерфейс с одной или более сетями 1212, обеспечивающими соединение с другими сетями и вычислительными устройствами. В частности, это может быть локальная сеть (LAN), беспроводная сеть Wi-Fi, соединенные со всемирной сетью Интернет или нет. Подразумевается, что вычислительное средство 1200 включает подходящие аналоговые и/или цифровые интерфейсы между процессором 1202 и каждым из компонентов 1204, 1206, 1208, 1210 и 1212.[0078] Computing means 1200 also typically has a number of inputs and outputs for transmitting information to the outside and receiving information from the outside. To interact with a user, computing means 1200 may comprise one or more input devices (e.g., keyboard, mouse, scanner, etc.) and a display device 1208 (e.g., liquid crystal display). Computing means 1200 may also have one or more read-only memory devices 1210, for example, an optical disc drive (CD, DVD or another), a hard disk drive, a tape drive. In addition, computing means 1200 may have an interface with one or more networks 1212 that connect to other networks and computing devices. In particular, it can be a local area network (LAN), a wireless Wi-Fi network connected to the Internet or not. Computing means 1200 are intended to include suitable analog and / or digital interfaces between processor 1202 and each of components 1204, 1206, 1208, 1210, and 1212.

[0079] Вычислительное средство 1200 работает под управлением операционной системы 1214 и выполняет различные приложения, компоненты, программы, объекты, модули и т.д., указанные обобщенно цифрой 1216.[0079] Computing means 1200 is running an operating system 1214 and executes various applications, components, programs, objects, modules, etc., indicated collectively by the number 1216.

[0080] Вообще программы, исполняемые для реализации способов, соответствующих данному изобретению, могут являться частью операционной системы или представлять собой обособленное приложение, компоненту, программу, динамическую библиотеку, модуль, скрипт, либо их комбинацию.[0080] In general, programs executed to implement the methods of this invention may be part of an operating system or may be a stand-alone application, component, program, dynamic library, module, script, or a combination thereof.

[0081] Настоящее описание излагает основной изобретательский замысел авторов, который не может быть ограничен теми аппаратными устройствами, которые упоминались ранее. Следует отметить, что аппаратные устройства, прежде всего, предназначены для решения узкой задачи. С течением времени и с развитием технического прогресса такая задача усложняется или эволюционирует. Появляются новые средства, которые способны выполнить новые требования. В этом смысле следует рассматривать данные аппаратные устройства с точки зрения класса решаемых ими технических задач, а не чисто технической реализации на некой элементной базе.[0081] The present description sets forth the main inventive concept of the authors, which cannot be limited to those hardware devices that were previously mentioned. It should be noted that hardware devices are primarily designed to solve a narrow problem. Over time and with the development of technological progress, such a task becomes more complicated or evolves. New tools are emerging that are able to fulfill new requirements. In this sense, these hardware devices should be considered from the point of view of the class of technical problems they solve, and not purely technical implementation on a certain elemental base.

Claims

1. The method of organizing a search in electronic text cases for a computer system, which consists in the fact that at least once produce the following sequence of actions:

- carry out semantic-syntactic analysis of the search query, including the construction of a ranked list of possible lexical values for at least one word of the search query, where each of the lexical values is associated with the corresponding semantic class;

- compile a list of synonyms for at least one lexical value from a ranked list of possible lexical values for at least one word of the search query;

- rank synonyms from the list of synonyms for at least one lexical meaning;

- form query options taking into account the ranked synonyms of lexical meanings;

- calculate the assessment of compliance of the query options to the original search query;

- perform a search for text fragments in the corps of electronic texts that satisfy the query for at least one variant of the query, while this search includes semantic-syntactic analysis of the found text fragments;

- calculate the conformity assessment of the lexical meanings of the words in the fragments found to the lexical meanings of the words of the variant of the original query;

- rank the found text fragments in accordance with the assessment of compliance of the query variant with the initial search query.

2. The method according to claim 1, further comprising pre-constructing at least one index of words constituting the texts of the corpus of texts and storing the index in memory.

3. The method according to p. 1 or 2, in which:

- the specified semantic-syntactic analysis of text fragments further includes determining the most likely lexical meanings of sentence words.

4. The method according to p. 3, further comprising:

- calculation of the integral assessment of the correspondence of the found fragment to the variant of the initial query;

- the location of the found fragments in accordance with the assessment of the variant of the search query and the value of the integral assessment of the correspondence of the found fragment to the variant of the initial query.

5. The method according to p. 1, further comprising:

- preliminary semantic-syntactic analysis of the corpus of texts, including the definition of lexical meanings of words of sentences;

- the construction of semantic structures of sentences that make up the texts of the corpus of texts;

- saving the results of semantic-syntactic analysis in memory; and

- indexing of the corpus of texts, including the construction of indexes of lexical values and semantic structures and the preservation of indexes.

6. The method according to p. 5, further comprising:

- ranking of the found fragments in accordance with the assessment of the variant of the search query and the value of the integral assessment of the correspondence of the found fragment to the variant of the initial query.

7. The method according to p. 1, where the semantic-syntactic analysis of the search query includes the construction of the semantic structure of the search query.

8. The method according to p. 7, further comprising building options for the search query, taking into account paraphrases of at least parts of the search query.

9. The method according to claim 8, where the paraphrases of at least parts of the search query are obtained as a synthesis of at least one fragment in a natural language based on at least one fragment of the semantic structure obtained as a result of semantic-syntactic analysis of the search query.

10. The method according to claim 9, where the resulting paraphrases are evaluated and ranked in accordance with the degree of semantic proximity to the original search query.

11. A system for organizing searches in electronic text bodies, including:

- one or more processors;

- one or more memory devices;

- software instructions for a computing device recorded in one or more memory devices that, when executed on one or more processors, control the system for:

- preliminary implementation of semantic-syntactic analysis of the search query, including the construction of a ranked list of possible lexical values for at least one word of the search query, where each of the lexical values is associated with the corresponding semantic class;

- forming a list of synonyms for at least one lexical value from a ranked list of possible lexical values for at least one word of the search query;

- ranking of synonyms from the list of synonyms for at least one lexical meaning;

- formation of query options taking into account the ranked synonyms of lexical meanings;

- calculation of the assessment of the conformity of the query options to the initial search query;

- search for text fragments in the corpus of electronic texts that satisfy the query for at least one variant of the query, while this search includes semantic-syntactic analysis of the found text fragments;

- calculation of the assessment of the correspondence of the lexical meanings of words in the fragments found to the lexical meanings of words of a variant of the initial query;

- ranking of the found text fragments in accordance with the assessment of compliance of the query variant with the initial search query.

12. The system of claim 11, further comprising pre-constructing at least one index of words constituting the texts of the corpus of texts and storing the index in memory.

13. The system of claim 11 or 12, wherein said semantic-syntactic analysis of the fragments found further includes determining the most likely lexical meanings of sentence words.

14. The system of claim 13, further comprising:

15. The system of claim 11, further comprising a preliminary semantic-syntactic analysis of the corpus of texts, including determining the lexical meanings of words in sentences;

- saving the results of semantic-syntactic analysis in memory; and

16. The system of claim 15, further comprising:

17. The system of claim 11, wherein the semantic-syntactic analysis of the search query includes the construction of the semantic structure of the search query.

18. The system of claim 17, further comprising constructing search query options considering paraphrases of at least parts of the search query.

19. The system of claim 18, wherein the rephrasing of at least parts of the search query is performed as synthesis of at least one fragment in a natural language based on at least one fragment of the semantic structure obtained as a result of semantic-syntactic analysis of the search query.

20. The system of claim 19, wherein the resulting paraphrases are evaluated and ranked in accordance with the degree of semantic proximity to the original search query.