RU2348072C1

RU2348072C1 - Context-based method of assessing manifestation degree of notion in text for search systems

Info

Publication number: RU2348072C1
Application number: RU2007116774/09A
Authority: RU
Inventors: Алексей Сергеевич Злыгостев (RU); Алексей Сергеевич Злыгостев
Original assignee: Алексей Сергеевич Злыгостев
Priority date: 2007-05-03
Filing date: 2007-05-03
Publication date: 2009-02-27
Also published as: RU2007116774A

Abstract

FIELD: physics, computer technology.

SUBSTANCE: invention concerns data search and intellectual systems, particularly methods of information search in large document data base. Assessment matrix of correlations between words is defined and totaled in subject context matrix; interference elements are removed from context matrix; image of documents included in search index is generated in the form of a list for each unique term used in document; term frequency index in documents is developed; accuracy and comprehensiveness of notion manifestation are assessed; and assessments are added to frequency index. Search request is provided; numbers of documents where at least one request term is present are store in computer memory; requested notion manifestation degree in found documents is calculated as function of accuracy and comprehensiveness assessments; assessment W(i) is calculated as function of requested notion manifestation degree, proximity and order of requested words and correlation to requested word forms; found documents are sorted by W(i) assessment and presented to user.

EFFECT: reduction of search document image size from square to linear dependency on unique term number in document; reduced subject context size; enhanced computation efficiency and search accuracy.

Description

Предлагаемый способ относится к вычислительной технике, а именно к информационно-поисковым и интеллектуальным системам, в частности к способам поиска информации в больших документальных базах данных.The proposed method relates to computing, and in particular to information retrieval and intelligent systems, in particular to methods of searching for information in large documentary databases.

Известен «Способ поиска информации в политематических массивах неструктурированных текстов» (патент RU 2266560 G06F от 28.04.2004), заключающийся в том, что терминам вектора запроса присваивают порядковые номера, осуществляют поиск с занесением в память компьютера номеров документов хотя бы с одним термином вектора запроса, заносят в память компьютера количество терминов, совпавших с терминами запроса, и порядковые номера совпавших терминов, сортируют в памяти компьютера документы по классам с равным количеством совпавших терминов. Технический результат достигается тем, что вводится новый критерий выдачи документов, позволяющий пользователю получать релевантные документы, наполненные новыми терминами, необходимыми для проведения дальнейших рекурсий. Эффективность способа при этом не зависит от того, на каком естественном языке написаны тексты в базе данных.The well-known “Method of searching for information in polythematic arrays of unstructured texts” (patent RU 2266560 G06F dated 04/28/2004), which consists in assigning serial numbers to the terms of the query vector, and searching by storing document numbers with at least one query vector term in the computer memory , enter into the computer memory the number of terms that match the query terms, and the serial numbers of the matching terms, sort documents in the computer memory into classes with an equal number of matching terms. The technical result is achieved by introducing a new criterion for issuing documents, allowing the user to receive relevant documents filled with new terms necessary for further recursions. The effectiveness of the method in this case does not depend on what natural language the texts are written in the database.

Существенные признаки аналога, общие с заявленным способом: терминам вектора запроса присваивают порядковые номера, поиск осуществляют с занесением в память компьютера номеров документов, в которых присутствует хотя бы один термин вектора запроса, заносят в память компьютера количество совпавших терминов с терминами запроса и порядковые номера совпавших терминов, в памяти компьютера документы сортируют по классам с равным количеством совпавших терминов.The essential features of the analogue are common with the claimed method: serial numbers are assigned to the query vector terms, the search is carried out with the numbers of documents containing at least one query vector term stored in the computer memory, the number of matching terms with the query terms and the serial numbers of matching terms in the computer memory, documents are sorted into classes with an equal number of matching terms.

Недостатком данного способа является то, что он многоитерационный и со временем будет улучшаться полнота, но уменьшаться точность поиска.The disadvantage of this method is that it is multi-iteration and over time the completeness will improve, but the accuracy of the search will decrease.

Известен «Способ синтеза самообучающейся системы извлечения знаний из текстовых документов для поисковых систем», осуществляющий поиск информации путем переформулирования пользовательского запроса и извлечения из текстов фраз, схожих с запросом, но синтаксически являющихися ответами на вопрос, поставленный пользователем (патент RU 2273879 G06F, G09B от 28.05.2002). Этот способ заключается: в обеспечении механизма самообучения в виде стохастически индексированной системы искусственного интеллекта, основанной на применении уникальных комбинаций двоичных сигналов стохастических индексов информации; в обеспечении автоматического обучения системы правилам грамматического и семантического анализа путем применения эквивалентных преобразований стохастически индексированных фрагментов текста, логического вывода и формирования из них связанных семантических структур и стохастического индексирования для представления в формате правил продукций; в выполнении морфологического анализа и стохастического индексирования лингвистических текстов в электронном виде с одновременным автоматическим обучением системы правилам морфологического анализа; в произведении морфологического и синтаксического анализа, а также стохастического индексирования текстовых документов по заданной теме в электронном виде на заданном языке с одновременным автоматическим обучением системы правилам синтаксического анализа; произведении семантического анализа стохастически индексированных текстовых документов по заданной теме в электронном виде с одновременным автоматическим обучением системы правилам семантического анализа; формировании запроса пользователя на естественном заданном языке и представлении его в электроном виде после стохастического индексирования в форме вопросительного предложения; преобразовании запроса пользователя в стохастически индексированном виде во множество новых запросов, эквивалентных исходному запросу; в осуществлении в соответствии с запросом пользователя предварительного выбора стохастически индексированных фрагментов текстовых документов в электронном виде, содержащих в совокупности все словосочетания преобразованного запроса; в формировании стохастически индексированной семантической структуры с использованием указанных фрагментов текстовых документов; в формировании краткого ответа системы на основе указанной структуры с помощью логического вывода, обеспечивающего связь стохастически индексированных элементов различных текстов и эквивалентного преобразования текста; в проверке релевантности полученного краткого ответа системы запросу путем формирования на его основе вопросительного предложения, сравнения полученного вопросительного предложения с запросом; в принятии решения о релевантности краткого ответа системы запросу и представлении его на заданном языке при идентичности полученного вопросительного предложения и запроса.The well-known "Method for the synthesis of a self-learning system for extracting knowledge from text documents for search engines" that searches for information by reformulating a user request and extracting phrases from the texts that are similar to the request, but are syntactically answers to the question posed by the user (patent RU 2273879 G06F, G09B from 05/28/2002). This method consists of: providing a self-learning mechanism in the form of a stochastically indexed artificial intelligence system based on the use of unique combinations of binary signals of stochastic information indices; in providing automatic training of the system for the rules of grammatical and semantic analysis by using equivalent transformations of stochastically indexed text fragments, inference and the formation of related semantic structures from them, and stochastic indexing for presentation in the format of production rules; in performing morphological analysis and stochastic indexing of linguistic texts in electronic form with simultaneous automatic training of the system in the rules of morphological analysis; in the work of morphological and syntactic analysis, as well as stochastic indexing of text documents on a given topic in electronic form in a given language with simultaneous automatic training of the system in the rules of parsing; a semantic analysis of stochastically indexed text documents on a given topic in electronic form with simultaneous automatic training of the system in the rules of semantic analysis; forming a user’s request in a natural given language and presenting it in electronic form after stochastic indexing in the form of an interrogative sentence; converting a user's request in a stochastically indexed form into many new requests equivalent to the original request; in the implementation, in accordance with the user's request, of the preliminary selection of stochastically indexed fragments of text documents in electronic form, containing in aggregate all phrases of the converted request; in the formation of a stochastically indexed semantic structure using the indicated fragments of text documents; in the formation of a short response of the system based on the indicated structure with the help of logical inference, providing the connection of stochastically indexed elements of various texts and equivalent text conversion; in checking the relevance of the received short response of the system to the request by forming on its basis an interrogative sentence, comparing the received interrogative sentence with the request; in deciding on the relevance of the short response of the system to the request and presenting it in a given language with the identity of the received interrogative proposal and request.

Существенные признаки, общие с заявленным способом: производят морфологический анализ, производят семантический анализ, формируют стохастически индексированную семантическую структуру на основе указанных фрагментов текстовых документов (в заявляемом способе - производят семантическое индексирование документов коллекции).Salient features common with the claimed method: they perform morphological analysis, produce semantic analysis, form a stochastically indexed semantic structure based on the indicated fragments of text documents (in the present method, they produce semantic indexing of collection documents).

Недостатки способа: время отклика системы, описанной в данном способе, будет велико, так как трудоемок процесс извлечения информации из найденных документов.The disadvantages of the method: the response time of the system described in this method will be large, since the process of extracting information from the documents found is laborious.

Из всех известных способов наиболее близким к заявляемому является «Система и метод поиска документов, основанные на контексте» (патент США №6633868). Выполняемый на вычислительных устройствах этот метод включает следующие существенные признаки:Of all the known methods, the closest to the claimed is the "System and method for searching documents based on the context" (US patent No. 6633868). Run on computing devices, this method includes the following significant features:

- для каждого документа i из коллекции документов вычисляют матрицу D(i) взаимосвязей между словами, включающую статистику для пар слов из документа; вычисляют для каждого документа вектор частоты;- for each document i from the collection of documents, a matrix D (i) of relationships between words is calculated, including statistics for pairs of words from the document; compute a frequency vector for each document;

- генерируют контекстную базу данных, включающую матрицу С размерности N×N, где N - общее число уникальных слов в коллекции документов; матрицу С вычисляют из матриц взаимосвязей слов D(i) для всех документов;- generate a context database including a matrix C of dimension N × N, where N is the total number of unique words in the collection of documents; matrix C is calculated from matrices of word relationships D (i) for all documents;

- вычисляют поисковую матрицу S из поискового запроса и матрицы С;- calculate the search matrix S from the search query and matrix C;

- для каждого документа i вычисляют вес W(i) из поисковой матрицы S и матрицы взаимосвязей документа D(i);- for each document i, the weight W (i) is calculated from the search matrix S and the relationship matrix of the document D (i);

- извлекают и выдают документы, отсортированные по весу W(i).- retrieve and issue documents sorted by weight W (i).

Кроме того, по этому методу взаимосвязь слов матрицы D(i) имеет развитие в зависимости от числа инцидентности пар, вычисление матрицы С включает добавление матрицы D(i) всех документов из коллекции документов, вычисление поисковой матрицы S включает выбранную колонкам векторов матрицу С, соответствующих ключевым словам поискового запроса, формирующих поисковую матрицу S из колонок векторов, вес W(i) вычисляют, включая поэлементное перемножение D(i) и S с последующим суммированием всех результирующих элементов.In addition, according to this method, the interconnection of the words of the matrix D (i) develops depending on the number of incidence pairs, the calculation of the matrix C includes the addition of the matrix D (i) of all documents from the collection of documents, the calculation of the search matrix S includes the matrix C selected for the columns of vectors corresponding to the keywords of the search query, forming the search matrix S from the columns of vectors, the weight W (i) is calculated, including the elementwise multiplication of D (i) and S followed by the summation of all the resulting elements.

Следующие существенные признаки прототипа являются общими с заявляемым способом: для формирования контекста предметной области строят контекст предметной области, для чего на вход системы подают тематическую коллекцию документов, после чего для каждого документа i из коллекции документов вычисляют матрицу оценок D(i) взаимосвязей между словами, далее суммируют D(i) в матрице контексте, а для обслуживания запроса поиска производят семантический анализ.The following essential features of the prototype are common with the claimed method: to create the context of the subject area, the context of the subject area is built, for which a thematic collection of documents is fed to the input of the system, after which for each document i from the collection of documents the matrix of estimates D (i) of the relationships between the words is calculated, then summarize D (i) in the matrix context, and to serve a search query produce semantic analysis.

К недостаткам прототипа следует отнести пригодность его для небольших статей, так как существует квадратичная зависимость размера поисковых образов документов от количества слов в документе; кроме того, он не дает способа сокращения контекста, который без сокращения имеет размерность квадратичного порядка от общего числа уникальных слов в коллекции документов, что чрезвычайно велико, так как даже в тематической коллекции из 3000 документов уже может встретиться более 50000 уникальных слов, что составит 2,5 Гб, если приводить оценку взаимовстречаемости термов к 1 байту; у данного способа большая вычислительная трудоемкость, т.к необходимо сопоставить образу запроса все документы коллекции; при использовании прототипа происходит размывание границы искомых понятий, включенных в запрос пользователя, при сопоставлении образа документа - поисковому образу запроса - в результате могут быть найдены документы, в которых обсуждается близкая тема, но там не будет ни синонимов слов, ни самих слов из поискового запроса.The disadvantages of the prototype include its suitability for small articles, since there is a quadratic dependence of the size of the search images of documents on the number of words in the document; in addition, it does not provide a way to reduce the context, which without reduction has the dimension of a quadratic order of the total number of unique words in the collection of documents, which is extremely large, since even in a thematic collection of 3000 documents there can already be more than 50,000 unique words, which will be 2 , 5 GB, if we bring the assessment of the reciprocity of terms to 1 byte; this method has a large computational complexity, because it is necessary to match the image of the request with all the documents in the collection; when using the prototype, the boundary of the search terms included in the user’s request is eroded, when the image of the document is compared to the search image of the query, as a result, documents can be found that discuss a related topic, but there will be no synonyms of words or words from the search query .

Задачей предлагаемого способа является: сокращение размера образов поисковых документов с квадратичной зависимости до линейной от количества уникальных слов в документе; сокращение размера контекста предметной области за счет фильтрации от шумовых элементов и сжатой формы записи; уменьшение вычислительной трудоемкости в процессе поиска - за счет отсева документов, не содержащих слова из пользовательского запроса; увеличение точности поиска - за счет введения двух оценок точности и полноты раскрытия понятия в тексте документа.The objective of the proposed method is: reducing the size of images of search documents from a quadratic dependence to linear on the number of unique words in the document; reducing the size of the context of the subject area due to filtering from noise elements and a compressed form of recording; reduction of computational complexity in the search process - due to screening of documents that do not contain words from a user request; increasing search accuracy - by introducing two estimates of the accuracy and completeness of the disclosure of the concept in the text of the document.

Технический результат достигается тем, что для формирования контекста предметной области вводят словарь, содержащий известные системе словоформы языка и позволяющий привести словоформу к нормализованной форме, пополняют его новыми словами из документов, матрицу оценок D(i) взаимосвязей между словами вычисляют на основании непосредственной близости терминов и логического деления текста на предложения и абзацы, а матрицу оценок D(i) суммируют в матрице контексте предметной области Cntxt, фильтруют матрицу Cntxt от шумовых элементов, обрезая шумовые элементы с малой неуникальной оценкой взаимовстречаемости, производят нормализацию строк матрицы Cntxt, сжимают контекст предметной области Cntxt, приводя его к виду в каждой строке <номер термина, оценка>, затем для формирования поискового индекса вносимым в поисковый индекс документам присваивают уникальные номера, формируют образы этих документов в виде перечисления для каждого уникального термина, используемого в документе, позиций употребления его словоформ, строят индекс встречаемости терминов в документах в виде перечисления для каждого уникального термина коллекции, в каких документах была встречена его словоформа, вычисляют оценки нахождения терминов в заголовках и присутствия терминов в текстах ссылок на данный документ и добавляют их в индекс встречаемости терминов в документах, вычисляют оценки точности и полноты раскрытия понятия и добавляют их в индекс встречаемости терминов в документах, затем представляют поисковый запрос в виде вектора, где каждому термину присваивают порядковый номер, производят морфологический анализ терминов запроса, используя словарь, получают номера терминов, осуществляют поиск, внося в память компьютера номера документов, в которых присутствуют хотя бы один термин вектора запроса, заносят в память компьютера количество совпавших терминов с терминами запроса и порядковые номера совпавших терминов, отбрасывают те документы, которые не содержат хотя бы одно не служебное слово из запросов пользователя, производят семантический анализ, вычисляя из запроса для существительных и устойчивых словосочетаний (с существительным) степень раскрытия понятия в найденных документах R_β как функцию от оценок точности и полноты, оценивают соответствие поискового запроса найденным документам, вычисляя оценку W(i) как функцию от раскрытия понятий из запроса R_β, близость и порядок слов из запроса, соответствия словоформам и запроса, нахождения слов в заголовке и присутствия ссылок на данный документ с текстом запроса, сортируют полученные документы по оценке W(i) и выдают пользователю найденные документы.The technical result is achieved by introducing a dictionary containing the word forms of the language known to the system and allowing to bring the word form to normalized form to fill the context of the subject area, replenish it with new words from the documents, the matrix of estimates D (i) of the relationships between the words is calculated based on the close proximity of the terms and logical division of the text into sentences and paragraphs, and the matrix of estimates D (i) is summed in the matrix of the context of the subject area Cntxt, the matrix of Cntxt is filtered from noise elements, trimming w Mum elements with a small non-unique estimate of reciprocity, normalize the rows of the Cntxt matrix, compress the context of the Cntxt domain, bring it to the form in each row <term number, rating>, then assign unique numbers to the documents entered into the search index and form images of these documents in the form of an enumeration for each unique term used in the document, the positions of use of its word forms, build an index of occurrence of terms in documents in the form of an enumeration for each unique term in the collection in which documents its word form was found, estimates of the terms in the headings and the presence of terms in the texts of links to this document are calculated and added to the index of occurrence of terms in the documents, estimates of the accuracy and completeness of the concept are added and added to the index of occurrence of terms in documents, then they present the search query in the form of a vector, where each term is assigned a serial number, a morphological analysis of the query terms is performed, using knowing the dictionary, they get the numbers of terms, search by entering into the computer memory the numbers of documents that contain at least one term of the query vector, enter into the computer memory the number of matching terms with the query terms and serial numbers of the matching terms, discard those documents that do not contain at least one non-official word from the user's queries, perform semantic analysis, calculating from the query for nouns and stable phrases (with a noun) the degree of disclosure of the concept in the find R _β as a function of accuracy and completeness estimates, evaluate the compliance of the search query with the documents found, calculating W (i) as a function of revealing concepts from the query R _β , proximity and order of words from the query, matching word forms and query, finding words in the title and presence of links to this document with the text of the request, sort the received documents according to the assessment of W (i) and give the user the documents found.

Для достижения технического результата в способе оценки степени раскрытия понятия в тексте, основанном на контексте, для поисковых систем, заключающемся в формировании контекста предметной области, для чего строят контекст предметной области, на вход системы подают тематическую коллекцию документов, для каждого документа i из коллекции документов вычисляют матрицу оценок D(i) взаимосвязей между словами, направляют матрицу оценок D(i) в контекст предметной области, суммируют D(i) в матрице контексте, для формирования контекста предметной области вводят словарь, содержащий известные системе словоформы языка и позволяющий привести словоформу к нормализованной форме, пополняют его новыми словами из документов, матрицу оценок D(i) взаимосвязей между словами вычисляют на основании непосредственной близости терминов и логического деления текста на предложения и абзацы, а матрицу оценок D(i) суммируют в матрице контексте предметной области Cntxt, фильтруют матрицу Cntxt от шумовых элементов, обрезая шумовые элементы с малой неуникальной оценкой взаимовстречаемости, производят нормализацию строк матрицы Cntxt, сжимают контекст предметной области Cntxt, приводя его к виду в каждой строке <номер термина, оценка>, затем для формирования поискового индекса вносимым в поисковый индекс документам присваивают уникальные номера, формируют образы этих документов в виде перечисления для каждого уникального термина, используемого в документе, позиций употребления его словоформ, строят индекс встречаемости терминов в документах в виде перечисления для каждого уникального термина коллекции, в каких документах была встречена его словоформа, вычисляют оценки нахождения терминов в заголовках и присутствия терминов в текстах ссылок на данный документ и добавляют их в индекс встречаемости терминов в документах, вычисляют оценки точности и полноты раскрытия понятия и добавляют их в индекс встречаемости терминов в документах, затем представляют поисковый запрос в виде вектора, где каждому термину присваивают порядковый номер, производят морфологический анализ терминов запроса, используя словарь, получают номера терминов, осуществляют поиск, внося в память компьютера номера документов, в которых присутствуют хотя бы один термин вектора запроса, заносят в память компьютера количество совпавших терминов с терминами запроса и порядковые номера совпавших терминов, отбрасывают те документы, которые не содержат хотя бы одно не служебное слово из запросов пользователя, производят семантический анализ, вычисляя из запроса для существительных и устойчивых словосочетаний (с существительным) степень раскрытия понятия в найденных документах R_β как функцию от оценок точности и полноты, оценивают соответствие поискового запроса найденным документам, вычисляя оценку W(i) как функцию от раскрытия понятий из запроса R_β, близость и порядок слов из запроса, соответствия словоформам и запроса, нахождения слов в заголовке и присутствия ссылок на данный документ с текстом запроса, сортируют полученные документы по оценке W(i) и выдают пользователю найденные документы.To achieve a technical result in a method for evaluating the degree of disclosure of a concept in a text based on context for search engines, which consists in forming the context of the subject area, for which the context of the subject area is built, a thematic collection of documents is fed to the input of the system, for each document i from the collection of documents calculate the matrix of estimates D (i) of the relationships between words, direct the matrix of estimates D (i) in the context of the subject area, summarize D (i) in the matrix of the context, to form the context of the subject The authorities introduce a dictionary containing the word forms known to the system and allowing the word form to be normalized, replenish it with new words from documents, the matrix of estimates D (i) of the relationships between the words is calculated based on the close proximity of the terms and the logical division of the text into sentences and paragraphs, and the matrix estimates D (i) are summarized in the context matrix of the Cntxt domain, filter the Cntxt matrix from the noise elements, cutting off the noise elements with a small non-unique estimate of the reciprocity, produce the normal the rows of the Cntxt matrix are compressed, the context of the Cntxt subject area is compressed, bringing it to the form in each row <term number, rating>, then, to form the search index, the documents entered in the search index are assigned unique numbers, images of these documents are formed in the form of an enumeration for each unique term used in the document, the positions of use of its word forms, build the index of occurrence of terms in documents in the form of an enumeration for each unique term of the collection, in which documents its word was found informa, calculate the estimates of the terms in the headings and the presence of terms in the texts of links to this document and add them to the index of occurrence of terms in documents, calculate the estimates of accuracy and completeness of the concept and add them to the index of occurrence of terms in documents, then present the search query in the form vectors where each term is assigned a serial number, a morphological analysis of the query terms is performed using the dictionary, the term numbers are obtained, the search is carried out, entering into the computer memory the numbers of documents in which at least one term of the query vector is present are stored in the computer memory the number of matching terms with the query terms and the sequence numbers of the matching terms, discard those documents that do not contain at least one non-service word from the user's queries, perform semantic analysis, calculating from the request for nouns and collocations (noun) degree concepts found in the disclosure of the documents R _β as a function of the accuracy and completeness of the estimates, evaluate compliance oiskovogo request documents found by calculating estimate W (i) as a function of disclosing the concepts of the query R _β, proximity and order of words from a query, matching word forms and query finding words in the header and the presence of reference in this document to the query text, sorting documents received according to W (i) and give the user the documents found.

Рассмотрим возможность осуществления предлагаемого способа на конкретном примере. Пусть дан документ из двух абзацев.Consider the possibility of implementing the proposed method on a specific example. Let a document of two paragraphs be given.

Геракл, в греческой мифологии герой. Геракл был сыном Зевса и смертной женщины Алкмены.Hercules, in Greek mythology a hero. Hercules was the son of Zeus and the mortal woman Alkmena.

Имя «Геракл» скорее всего означает «прославленный Герой» или «благодаря Гере».The name "Hercules" most likely means "glorified Hero" or "thanks to Hera."

Вводят в систему словарь с известными системе словоформами русского языка, который позволяет привести словоформы к нормализованной форме.A dictionary is introduced into the system with the word forms of the Russian language known to the system, which allows you to bring the word forms to a normalized form.

Пополняют данный словарь новыми словами из исследуемого документа (при необходимости в случае отсутствия в словаре этих слов).Replenish this dictionary with new words from the document being examined (if necessary, if these words are not in the dictionary).

Все слова приводят к нормализованной форме, остаются только существительные, прилагательные и отглагольные:All words lead to a normalized form, only nouns, adjectives and verbs remain:

Геракл, … греческий, мифология, герой. Геракл … сын Зевс … смертный, женщина, Алкмена.Hercules, ... Greek, mythology, hero. Hercules ... son of Zeus ... mortal, woman, Alkmena.

Имя Геракл … означать прославленный Гера … Гера.The name Hercules ... means glorified Hera ... Hera.

Вначале выделяют пары терминов, которые стоят рядом и не разделены стоп-словами и знаками препинания. В контексте документа D(i) (где i=1 для данного конкретного примера) ставят им оценку за каждую встречу - 1 балл. Данные пары: греческий и мифология, мифология и герой, сын и Зевс, и т.д. После этого матрица D(1) будет выглядеть следующим образом:First, pairs of terms that stand side by side and are not separated by stop words and punctuation marks are distinguished. In the context of the document, D (i) (where i = 1 for this particular example) give them a mark for each meeting - 1 point. These pairs: Greek and mythology, mythology and hero, son and Zeus, etc. After that, the matrix D (1) will look as follows:

Ставят по 1 баллу терминам, хотя бы раз встретившимся в одном предложении. Для первого предложения по одному баллу добавится для пар:They give 1 point to terms that have met at least once in one sentence. For the first sentence, one point will be added for pairs:

Геракл и греческий, Геракл и мифология, Геракл и герой, греческий и мифология, греческий и герой, мифология и герой. Матрица D(1) после внесения оценок за встречаемость в предложении:Hercules and Greek, Hercules and mythology, Hercules and hero, Greek and mythology, Greek and hero, mythology and hero. Matrix D (1) after making estimates for occurrence in the proposal:

Ставят по 1 баллу терминам, хотя бы раз встретившимся в одном абзаце. Таким образом, для пары слов герой и Алкмена из первого абзаца, но находящихся в разных предложениях будет получен 1 балл. Матрица D(1) после внесения оценок за встречаемость в абзаце:They give 1 point to terms that have met at least once in one paragraph. Thus, for a couple of words the hero and Alkmena from the first paragraph, but who are in different sentences, will receive 1 point. Matrix D (1) after making estimates for occurrence in the paragraph:

После этого полученную матрицу D(1) суммируют в матрицу контекста предметной области Cntxt (размерность данной матрицы будет больше, чем у D(i) ввиду того, что количество терминов, используемых внутри всей коллекции документов предметной области, больше, чем в одном документе). В матрице Cntxt в строках остаются только существительные и устойчивые словосочетания (с существительными).After that, the resulting matrix D (1) is summarized in the context matrix of the subject domain Cntxt (the dimension of this matrix will be greater than that of D (i) due to the fact that the number of terms used within the entire collection of documents in the subject domain is greater than in one document) . In the Cntxt matrix, only nouns and stable phrases (with nouns) remain in the rows.

В матрице Cntxt производят поэлементное суммирование оценок для соответствующих пар терминов. Так как в данном примере анализируют только один документ, то матрица Cntxt будет сформирована из рассмотренной выше матрицы D(1).In the Cntxt matrix, elementwise summation of the estimates is made for the corresponding pairs of terms. Since in this example only one document is analyzed, the Cntxt matrix will be formed from the matrix D (1) considered above.

После добавления всех контекстов документов в матрицу Cntxt производят фильтрацию и нормализацию.After adding all document contexts to the Cntxt matrix, filtering and normalization are performed.

Убирают шумовые элементы в Cntxt. Для данного примера, ввиду малочисленности выборки, актуально убрать только пары с минимальной оценкой встречаемости (с оценкой 1). Для больших контекстов убирают все элементы с минимальными оценками, пока не будет нарушена целочисленная непрерывность оценки (например, если у понятия было встречено 15 терминов с оценкой 1, 4 термина с оценкой 2, 2 термина с оценкой 1, 1 термин с оценкой 3, 1 термин с оценкой 5, 1 термин с оценкой 7 и т.д.; мы видим разрыв в ряде натуральных чисел оценки на 4-х; таким образом все термины с оценками ниже 4 будут отброшены). Выбранный способ фильтрации позволяет увеличить степень разреженности матрицы Cntxt с увеличением размерности матрицы Cntxt. После фильтрации матрица Cntxt (предполагаем, что коллекция состояла из одного документа, рассмотренного выше) будет выглядеть следующим образом:Remove noise elements in Cntxt. For this example, due to the small number of samples, it is important to remove only pairs with a minimum estimate of occurrence (with a score of 1). For large contexts, all elements with minimal ratings are removed until the integer continuity of the rating is violated (for example, if a concept has 15 terms with a rating of 1, 4 terms with a rating of 2, 2 terms with a rating of 1, 1 term with a rating of 3, 1 term with a rating of 5, 1 term with a rating of 7, etc.; we see a gap in the number of natural numbers of a rating of 4; thus, all terms with ratings below 4 will be discarded). The selected filtering method allows to increase the degree of sparseness of the Cntxt matrix with an increase in the dimension of the Cntxt matrix. After filtering, the Cntxt matrix (assuming that the collection consisted of one document discussed above) will look like this:

Далее проводят нормализацию векторов контекстов для понятий, разделив оценки в строке на сумму оценок в строке. После нормализации Cntxt примет вид:Next, the context vectors for the concepts are normalized, dividing the estimates in the line by the sum of the estimates in the line. After normalization, Cntxt will take the form:

В связи с высокой степенью разреженности матрицы в памяти компьютера будет удобно представить матрицу в виде отсортированного списка терминов с оценками для каждого из понятий. Например, для понятия «Алкмена» будет создан вектор из 5 элементов:Due to the high degree of sparsity of the matrix in the computer's memory, it will be convenient to present the matrix in the form of a sorted list of terms with estimates for each of the concepts. For example, for the concept of “Alkmena” a vector of 5 elements will be created:

<женщина, 0.273>, <Геракл, 0.182>, <3евс, 0.182>, <смерть, 0.182>, <сын, 0.182><woman, 0.273>, <Hercules, 0.182>, <3evs, 0.182>, <death, 0.182>, <son, 0.182>

Данный вектор описывает контекст, в котором в тематической коллекции документов наиболее часто встречается понятие «Алкмена». Наибольшая связь у понятия «Алкмена» с термином «женщина».This vector describes the context in which the concept of “Alkmena” is most often found in the thematic collection of documents. The most connected with the concept of "Alkmena" with the term "woman".

Рассмотрим пример формирования поискового индекса на статье из краткого мифологического словаря:Consider an example of forming a search index on an article from a brief mythological dictionary:

Геракл (у римлян Геркулес) - в древнегреческой мифологии величайший герой, сын Зевса, совершивший множество подвигов; после смерти вознесен на Олимп и принят в сонм бессмертных богов.Hercules (among the Romans Hercules) - in ancient Greek mythology, the greatest hero, son of Zeus, who performed many feats; after death, ascended to Olympus and accepted into the host of immortal gods.

В поисковом индексе документов присваивают приведенной статье индекс 1. Формируют образ этого документа в виде перечисления для каждого уникального термина, используемого в документе, позиций употребления его словоформ. При подсчете позиций учитывают все слова русского языка (в том числе и стоп-слова). Для приведенной в качестве примера статьи для термина «Геракл» будет записана одна позиция: 1.In the search index of documents, the article is assigned the index 1. Form the image of this document in the form of an enumeration for each unique term used in the document, the positions of use of its word forms. When calculating the positions, all words of the Russian language are taken into account (including stop words). For an example article for the term "Hercules" one position will be written: 1.

Далее строят индекс встречаемости терминов в документах в виде перечисления для каждого уникального термина коллекции, в каких документах была встречена его словоформа. Например, для термина «Геракл» будет одна запись о встрече данного термина в документе 1.Next, an index of the occurrence of terms in the documents is constructed in the form of an enumeration for each unique term of the collection, in which documents its word form was found. For example, for the term "Hercules" there will be one record of the meeting of this term in document 1.

После этого вычисляют оценку нахождения терминов в текстах ссылок на данный документ. Т.к. в предлагаемом примере нет ссылок на рассматриваемый документ из других документов, то для всех терминов эта оценка будет равна 0.After that, an estimate of the location of terms in the texts of links to this document is calculated. Because in the proposed example, there are no references to the document in question from other documents, then for all terms this estimate will be 0.

Затем вычисляют оценки точности и полноты. Рассмотрим пример получения оценок точности и полноты раскрытия контекста термина «Геракл» в приведенной статье. Добавим все полученные оценки в поисковый индекс. Для термина «Геракл» запись будет выглядеть следующим образом:The accuracy and completeness estimates are then calculated. Consider the example of obtaining estimates of the accuracy and completeness of the disclosure of the context of the term "Hercules" in the article. Add all the grades received to the search index. For the term Hercules, the entry will look like this:

Геракл: встречается в документе 1, оценка за присутствие в заголовке - 0, оценка за присутствие в тексте ссылок 0, оценка точности раскрытия понятия - 0,2036, оценка полноты раскрытия понятия - 0,32.Hercules: found in document 1, the rating for the presence in the title is 0, the rating for the presence in the text of links 0, the rating for the accuracy of the disclosure of the concept is 0.2036, the rating for the completeness of the disclosure of the concept is 0.32.

Из контекста внутри словарной статьи присутствуют следующие элементы: <герой, 0.08>, <3евс, 0.08>, <мифология, 0.08>, <сын, 0.08>.From the context, the following elements are present inside the dictionary entry: <hero, 0.08>, <3evs, 0.08>, <mythology, 0.08>, <son, 0.08>.

Контекст термина t раскрываемый на участке текста ψ'_t=4*0.08=0.32.The context of the term t disclosed in the text section ψ ' _t = 4 * 0.08 = 0.32.

Полный контекст термина t ψ_t=1 (т.к. пользуемся нормализованным вектором контекста).The full context of the term t ψ _t = 1 (since we use the normalized context vector).

Количество терминов, содержащихся в области текста, в которой раскрывается контекст термина t обозначим φ_t.=11 (Позиция первого слова «Геракл» - 1, позиция последнего слова из контекста «Зевс» 11) Нормализующую величину, обозначающую среднюю величина энциклопедической статьи, в которой может быть раскрыт смысл понятия обозначим θ. Этот параметр выбирается эмпирически в зависимости от тематически и размера коллекции. В данном примере примем его равным 15. φ'_t - нормализованная величина размера текста в терминах, которая равна 1 при φ_t<=θ и равна φ_t/θ при φ_t>θ.The number of terms contained in the text area in which the context of the term t is revealed is denoted by φ _t . = 11 (The position of the first word “Hercules” is 1, the position of the last word from the context is “Zeus” 11) The normalizing value denoting the average value of the encyclopedic article, in by which the meaning of the concept can be revealed, we denote θ. This parameter is selected empirically depending on the thematic and size of the collection. In this example, we take it equal to 15. φ ' _t is the normalized value of the text size in terms, which is 1 for φ _t <= θ and equal to φ _t / θ for φ _t > θ.

Таким образом, точность раскрытия понятия в словарной статье:Thus, the accuracy of the disclosure of a concept in a dictionary entry:

Контекст термина t раскрываемый внутри всего документа ψ''_t=0.32 (т.к. предполагаем, что на странице кроме словарной статьи больше ничего не находится).The context of the term t disclosed inside the entire document is ψ '' _t = 0.32 (since we assume that there is nothing else on the page except the dictionary entry).

Полнота раскрытия понятия будет равна:The completeness of the concept will be equal to:

Рассмотрим обработку поискового запроса пользователя. Предположим, что пользователь ввел в строку запроса «О Геракле».Consider processing a user's search query. Suppose a user types "About Hercules" into the query string.

Представляют поисковый запрос в виде вектора, где каждому термину присваивают порядковый номер. «О» - получит номер 1, «Геракле» - получит номер 2.Submit a search query in the form of a vector, where each term is assigned a serial number. “O” - will receive number 1, “Hercules” - will receive number 2.

Далее производят морфологический анализ терминов запроса, используя словарь. Получаем номера нормализованных форм в словаре для двух слов из запроса. По номерам нормализованных форм из словаря извлекают информацию о частях речи. «О» - предлог, а следовательно служебная часть речи. «Геракле» - существительное, приведется к нормализованному виду «Геракл».Next, a morphological analysis of the query terms is performed using the dictionary. We get the numbers of normalized forms in the dictionary for two words from the query. According to the numbers of normalized forms, information about parts of speech is extracted from the dictionary. “O” is an excuse, and therefore the service part of speech. “Hercules” - a noun, will be reduced to the normalized form “Hercules”.

По поисковому индексу встречаемости терминов в документах находят все документы, в которых присутствует термин «Геракл». В итоге получают список из одного документа с номером 1, в котором встречен термин «Геракл». Найденный документ содержит все не служебные слова из запроса пользователя.By the search index of occurrence of terms in documents, all documents in which the term "Hercules" is present are found. As a result, they receive a list of one document with number 1, in which the term "Hercules" is found. The document found contains all non-service words from the user's request.

Вычисляют степень раскрытия понятия в найденных документах R_β как функцию от оценок точности и полноты. Отдадим приоритет точности. Пусть β=0,9. Тогда:Calculate the degree of disclosure of the concept in the found documents R _β as a function of accuracy and completeness estimates. Give priority to accuracy. Let β = 0.9. Then:

В общем случае оценка соответствия поискового запроса найденным документам W(i) зависит от R_β, от оценки ω (значение от 0 до 1, зависит от близости и порядка слов из запроса, соответствия словоформам и запроса), оценки λ (значение от 0 до 1, зависит от количества употреблений термина в заголовках и от важности заголовков), оценки µ (значение от 0 до 1, зависит от количества ссылок на данный документ с термином в тексте ссылки). Один из возможных вариантов функции W(i):In the general case, the assessment of the conformity of a search query with the found documents W (i) depends on R _β , on the estimate ω (a value from 0 to 1, depends on the proximity and order of words from the query, correspondence to word forms and query), λ (value from 0 to 1, depends on the number of uses of the term in headings and on the importance of headings), µ score (value from 0 to 1, depends on the number of links to this document with the term in the link text). One of the possible options for the function W (i):

где k - число неслужебных терминов из пользовательского запроса.where k is the number of unofficial terms from a user request.

Т.к. в тексте статьи нет предлога «о» и словоформы «Геракле», то оценка дается только за присутствие термина «Геракл», без совпадения словоформы ω=0.333. Один из вариантов общего вида формулы для вычисления ω:Because in the text of the article there is no preposition “about” and the word form “Hercules”, then the assessment is given only for the presence of the term “Hercules”, without the word form ω = 0.333 coinciding. One of the options for the general form of the formula for calculating ω:

где m - общее число слов в пользовательском запросе,where m is the total number of words in the user request,

l - число не служебных слов,l is the number of non-official words,

swl - число служебных слов,swl - the number of service words,

presence(j) - булевская функция, возвращающая 1, если слово j присутствует в тексте документа,presence (j) is a Boolean function that returns 1 if the word j is present in the text of the document,

sw_presence(j) - булевская функция, возвращающая 1, если служебное слово j присутствует в тексте документа,sw_presence (j) - a Boolean function that returns 1 if the service word j is present in the text of the document,

order(j) - булевская функция, возвращающая 1, если слово j стоит в тексте документа перед словом j+1, т.е. в той же последовательности, что и в поисковом запросе относительно соседних слов и возвращает 0, в случае когда позиция не соответствует,order (j) is a Boolean function that returns 1 if the word j appears before the word j + 1 in the text of the document, i.e. in the same sequence as in the search query for neighboring words and returns 0 if the position does not match,

distance(j) - функция, возвращающая расстояние между словами j и j+1 из пользовательского запроса; в случае когда два слова стоят рядом, то расстояние равно 1, если слово отсутствует, то функция вернет бесконечно большое значение,distance (j) - a function that returns the distance between words j and j + 1 from a user request; in the case when two words stand side by side, the distance is 1, if the word is missing, the function will return an infinitely large value,

form(j) - булевская функция, возвращающая 1, если данное неслужебное слово j стоит в тексте документа в той же форме, что и в пользовательском запросе.form (j) is a Boolean function that returns 1 if the given unofficial word j appears in the text of the document in the same form as in the user request.

Из поискового индекса встречаемости терминов в документах известно, что для документа оценки нахождения слов в заголовке λ=0 (нет в статье заголовка с термином («Геракл»), оценка присутствия ссылок на данный документ с текстом запроса µ=0 (нет ссылок с термином «Геракл»).From the search index for the occurrence of terms in documents, it is known that for a document evaluating the presence of words in the heading λ = 0 (there is no heading with the term (“Hercules” in the article), the presence of links to this document with the query text µ = 0 (there are no links with the term Hercules).

Вычисляют W(i) для документа 1 (k=1, т.к. в запросе только одно не служебное слово):W (i) is calculated for document 1 (k = 1, because there is only one non-service word in the request):

Полученные документы сортируют по оценке W(i) и выдаются пользователю в отсортированном виде.The received documents are sorted according to the estimate W (i) and are issued to the user in sorted form.

Предлагается новый способ информационного поиска с линейной, а не квадратичной, зависимостью размера образов поисковых документов от количества уникальных терминов в документе; сокращение размера контекста предметной области за счет фильтрации шума и сжатия формы записи; уменьшение вычислительной трудоемкости в процессе поиска за счет отсева документов, не содержащих слова из пользовательского запроса и вынесения основных вычислительных операций на этап индексирования; увеличение точности поиска за счет использования оценок точности и полноты раскрытия понятия в тексте документа; предоставляется возможность объединять найденные документы в группы за счет соотношения оценок точности и полноты.A new method of information retrieval is proposed with a linear, rather than quadratic, dependence of the size of images of search documents on the number of unique terms in a document; reducing the size of the context of the subject area due to noise filtering and compression of the recording form; reduction of computational complexity in the search process due to screening out documents that do not contain words from a user request and putting basic computational operations to the indexing stage; increasing the accuracy of the search through the use of estimates of accuracy and completeness of the disclosure of the concept in the text of the document; the opportunity is given to combine the documents found into groups due to the ratio of accuracy and completeness ratings.

Использование данного способа обеспечивает возможность автоматического формирования знаний о предметной области путем извлечения их из коллекции текстовых документов и интеллектуальную обработку запросов пользователя с целью получения документов, в которых наиболее раскрыто значение искомого понятия. Этот результат достигается благодаря обеспечению механизма самообучения в виде автоматического формирования контекстов; производят морфологический и синтаксический анализ коллекции текстовых документов по заданной теме для формирования контекста предметной области; формируется поисковый индекс, содержащий оценки точности и полноты раскрытия понятия для каждого из документов на основании контекстов; находят оценку релевантности документов как соотношение между параметрами точности и полноты в зависимости от предпочтений пользователя.Using this method provides the ability to automatically generate knowledge about the subject area by extracting them from the collection of text documents and the intelligent processing of user requests in order to obtain documents in which the meaning of the desired concept is most disclosed. This result is achieved by providing a self-learning mechanism in the form of automatic formation of contexts; perform morphological and syntactic analysis of a collection of text documents on a given topic to form the context of the subject area; a search index is formed containing estimates of the accuracy and completeness of the disclosure of the concept for each of the documents based on contexts; find an assessment of the relevance of documents as the ratio between the accuracy and completeness parameters depending on the user's preferences.

Claims

A method for assessing the degree of disclosure of a concept in a text based on context for search engines, which consists in forming the context of the subject area, for which the context of the subject area is built, a thematic collection of documents is fed to the input of the system, for each document i from the collection of documents, the estimation matrix D ( i) the relationships between the words, direct the estimation matrix D (i) to the context of the subject area, summarize D (i) in the context matrix, characterized in that a dictionary is introduced to form the context of the subject area, containing the word forms of the language known to the system and allowing to bring the word form to a normalized form, replenish it with new words from documents, the matrix of estimates D (i) of the relationships between words is calculated based on the close proximity of terms and the logical division of the text into sentences and paragraphs, and the matrix of estimates D (i ) summarize Cntxt in the context matrix of the subject area, filter the Cntxt matrix from noise elements, cutting off the noise elements with a small non-unique estimate of reciprocity, normalize the matrix rows Cntxt, compress the context of the Cntxt subject area, bringing it to the form in each line <term number, rating>, then, to form a search index, the documents entered in the search index are assigned unique numbers, images of these documents are formed as an enumeration for each unique term used in document, positions of use of its word forms, build an index of occurrence of terms in documents in the form of an enumeration for each unique term of the collection, in which documents its word form was found, calculate about They find the terms in the headings and the presence of terms in the texts of links to this document and add them to the index of occurrence of terms in documents, calculate the accuracy and completeness of the concept and add them to the index of occurrence of terms in documents, then present a search query in the form of a vector, where each term is assigned a serial number, a morphological analysis of the query terms is performed using the dictionary, the term numbers are obtained, the search is performed by entering the document numbers into the computer’s memory, At least one term of the query vector is present, the number of matching terms with the query terms and serial numbers of the matching terms are entered into the computer memory, discarding documents that do not contain at least one non-service word from the user's queries, perform semantic analysis, calculating from the query for nouns and stable phrases (with a noun), the degree of disclosure of a concept in found documents R _β as a function of accuracy and completeness assessments, evaluate the compliance of a search query with found documents, calculating the estimate W (i) as a function of revealing concepts from the query R _β , proximity and word order from the query, matching word forms and query, finding words in the title and the presence of links to this document with the query text, sort the received assessment documents W (i) and give the user found documents.