RU2476927C2

RU2476927C2 - Method of positioning text in knowledge space based on ontology set

Info

Publication number: RU2476927C2
Application number: RU2009114293/08A
Authority: RU
Inventors: Сергей Александрович Аншуков; Валерий Владимирович Бардин
Original assignee: Сергей Александрович Аншуков; Валерий Владимирович Бардин
Priority date: 2009-04-16
Filing date: 2009-04-16
Publication date: 2013-02-27
Also published as: RU2009114293A

Abstract

FIELD: information technology.

SUBSTANCE: result is achieved by using a method of positioning text in knowledge space. In the disclosed method, elements are selected from input data, which correspond to patterns included in taxa which form taxonomia merged into ontology. Significant taxa are determined, which are weighed based on conditions assigned to patterns. A set of weighed vectors which position the input document in the knowledge space is constructed. An ontology set is used for positioning. When constructing sets of vectors, only those elements which correspond to patterns included in one taxon or in taxa having common parent taxa are considered.

EFFECT: high efficiency of search engines, contextual advertisement systems and systems for behavioural targeting on the Internet.

3 cl, 3 dwg

Description

Область изобретенияField of Invention

Настоящее изобретение относится к способу идентификации объектов по их текстовым или иным описаниям и может использоваться, например, в анализе ситуаций, при информационном поиске, в построении поисковых систем, в системах контекстной рекламы и т.п.The present invention relates to a method for identifying objects by their textual or other descriptions and can be used, for example, in situation analysis, in information retrieval, in the construction of search engines, in contextual advertising systems, etc.

ВведениеIntroduction

В течение длительного времени таксономии считались полезным средством для классификации объектов. Кроме того, что они дают возможность наименования классов объектов, они также дают возможность определения степени похожести.For a long time, taxonomies were considered a useful tool for classifying objects. Besides the fact that they give the possibility of naming classes of objects, they also give the opportunity to determine the degree of similarity.

В простейшей форме таксономия - это иерархическая группировка отдельных понятий в более общие классы. Два понятия в таксономии имеют общие свойства того уровня группировки, который включает их обоих, а степень того, насколько похожи понятия, зависит от взаимного местоположения классов в иерархии.In its simplest form, taxonomy is a hierarchical grouping of individual concepts into more general classes. Two concepts in taxonomy have common properties of the level of grouping that includes both of them, and the degree to which the concepts are similar depends on the relative position of the classes in the hierarchy.

Различные авторы, например:Various authors, for example:

- Ф.Резник, «Исследование информационного контента для оценки семантической похожести в таксономии» («Using information content to evaluate semantic similarity in a taxonomy», Proceedings of the 14th International Joint Conference on Artificial Intelligence, 1995, стр.448-453);- F. Reznik, “Researching information content to evaluate semantic similarity in a taxonomy” (“Using information content to evaluate semantic similarity in a taxonomy”, Proceedings of the 14th International Joint Conference on Artificial Intelligence, 1995, pp. 48-453);

- Ж.By и др., «Глагольная семантика и лексический выбор» («Verb semantics and lexical selection», Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, 1994, стр.133-138);- J. By et al., “Verb semantics and lexical selection” (Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, 1994, pp. 133-138);

- Д.Лин, «Информационно-теоретическое определение похожести» («An information-theoretic definition of similarity» Proceedings of the 15th International Conference on Machine Learning Proceedings of the 15th International Conference on Machine Learning, 1998, стр.296-304);- D. Lin, “An information-theoretical definition of similarity” (“Proceedings of the 15th International Conference on Machine Learning Proceedings of the 15th International Conference on Machine Learning, 1998, pp. 266-304);

- Дж.Райс и др., патентная заявка США номер 20080027929 «Компьютерный метод для поиска похожих объектов с использованием таксономии» («Computer-based method for finding similar objects using a taxonomy»)- J. Rice et al., US Patent Application No. 20080027929 "Computer-based method for finding similar objects using a taxonomy"

разработали способы преобразования интуитивной идеи «похожести» в численную меру, которая может быть использована для измерения «похожести» объектов.developed methods for converting the intuitive idea of "similarity" into a numerical measure that can be used to measure the "similarity" of objects.

Нахождение похожих объектов по заданному описанию употребляется во многих областях техники. Например, кто-то может хотеть найти патент, похожий на заданный, или объект «похожих» клинических испытаний лекарственных средств. В биоинформатике есть необходимость поиска генетических продуктов (например, белков), похожих на заданный генетический продукт. В каждой из этих областей (и не только в них) для классификации наборов объектов разработаны и используются различные подробные таксономии.Finding similar objects according to a given description is used in many areas of technology. For example, someone may want to find a patent similar to a given one, or an object of “similar” clinical trials of drugs. In bioinformatics there is a need to search for genetic products (for example, proteins) that are similar to a given genetic product. In each of these areas (and not only in them), various detailed taxonomies have been developed and are used to classify sets of objects.

Классификация при помощи таксономии может быть более сложной, чем вышеописанные примеры. Во-первых, часто некоторый класс объектов относится более чем к одному «родительскому» классу (то есть классу, стоящему выше по уровню иерархии). Во-вторых, таксономии часто изменяются с течением времени: образуются новые специализированные группы; содержание старых групп меняется. В-третьих, даже при неизменной таксономии классификация конкретного объекта может меняться при изменении знаний о нем, возможны и разногласия пользователей таксономии о месте в них конкретных объектов.Taxonomy classification may be more complex than the examples described above. First, often a certain class of objects refers to more than one “parent” class (that is, a class that is higher in the hierarchy). Secondly, taxonomies often change over time: new specialized groups form; the content of old groups is changing. Thirdly, even with a constant taxonomy, the classification of a particular object can change with a change in knowledge about it, disagreements of users of a taxonomy about the place of specific objects in them are also possible.

Проблемы анализа текстовText Analysis Issues

Имеется ряд проблем, которые должны решаться системами для анализа текстов, в частности:There are a number of problems that should be solved by systems for analyzing texts, in particular:

(а) проблема словоформ - одно и то же слово в русском языке может иметь до 40 различных словоформ, что усложняет поиск;(a) the problem of word forms - the same word in Russian can have up to 40 different word forms, which complicates the search;

(б) проблема синонимии - использования разных слов для описания одного и того же явления или идеи. В процессе поиска пользователь по запросу с ключевым словом «врач» может не найти слово «доктор», обозначающее в данном контексте то же самое понятие;(b) the problem of synonymy - the use of different words to describe the same phenomenon or idea. In the search process, the user, upon request with the keyword “doctor”, may not find the word “doctor”, meaning in this context the same concept;

(в) проблема полисемии - одно слово может иметь ряд не связанных друг с другом значений (например, «лук» - растение и «лук» - оружие). В процессе поиска есть большая вероятность нахождения документа не с тем значением слова, которое нужно пользователю.(c) the problem of polysemy - one word can have a number of meanings that are not related to each other (for example, “onion” - a plant and “onion” - a weapon). In the search process, there is a high probability of finding the document with the wrong word meaning that the user needs.

Также существенным является вопрос определения мер похожести при сравнении документов.Also significant is the issue of determining similarity measures when comparing documents.

Меры похожестиSimilarity measures

В литературе описаны три различные группы мер похожести, которые могут быть применены к таксономиям. Первая группа, называемая «мерами похожести терминов», может быть применена для вычисления похожести индивидуальных терминов. Две другие группы мер похожести могут применяться в случае, когда объект обозначается несколькими терминами.The literature describes three different groups of similarity measures that can be applied to taxonomies. The first group, called “measures of similarity of terms”, can be used to calculate the similarity of individual terms. Two other groups of similarity measures can be applied when an object is denoted by several terms.

В контексте машинного перевода By и Палмер в вышеуказанной статье, озаглавленной «Глагольная семантика и лексический выбор», описали меру похожести, основанную на глубине таксономии ближайшего общего предка двух терминов относительно таксономической глубины индивидуальных терминов. Чем ближе общий «предок» к терминам, тем больше похожесть. Проблемой этого подхода является то, что одни части таксономии могут быть весьма развиты и содержать значительное число терминов, тогда как в других частях таксономии плотность терминов меньше. Такая разница в «плотностях» терминов делает эту и другие меры похожести, основанные на простом учете количества ребер в графе, недостаточно точными.In the context of machine translation, By and Palmer in the above article entitled “Verb semantics and lexical choice” described a measure of similarity based on the depth of the taxonomy of the closest common ancestor of the two terms relative to the taxonomic depth of the individual terms. The closer the common "ancestor" to the terms, the greater the similarity. The problem with this approach is that some parts of the taxonomy can be highly developed and contain a significant number of terms, while in other parts of the taxonomy the density of terms is less. Such a difference in the “densities” of terms makes this and other similarity measures based on a simple count of the number of edges in the graph insufficiently accurate.

Идея использования информационного контента для измерения похожести принадлежит Резнику (см. выше). Используя таксономии проекта WorldNet и оценки частотности слов, полученные из большого массива английских текстов, Резник вычислил семантическую похожесть пар слов путем выделения общего предка с наибольшим информационным контентом. Для слов с несколькими значениями Резник использовал тот смысл, который давал максимальную меру похожести. Используя в качестве стандарта оценки, производимые людьми, Резник обнаружил, что такая мера работает лучше вышеописанных мер, основанных на подсчете числа связей. Мера Резника не может быть использована в случае, когда объект обозначен не одним, а несколькими терминами. Кроме того, эта мера имеет ряд других недостатков. Ее диапазон значений похожести не нормализован, но, что более важно, выбирая родительские узлы с наибольшим информационным содержимым, Резник недооценивает похожесть объектов, фокусируясь только на одном наиболее важном аспекте похожести ценой игнорирования всех остальных.The idea of using information content to measure similarity belongs to Reznik (see above). Using the taxonomies of the WorldNet project and word frequency estimates obtained from a large array of English texts, Reznik calculated the semantic similarity of word pairs by highlighting a common ancestor with the most information content. For words with several meanings, Reznik used the meaning that gave the maximum measure of similarity. Using the estimates made by people as a standard, Reznik found that such a measure works better than the measures described above, based on counting the number of connections. The Reznik measure cannot be used in the case when the object is indicated not by one, but by several terms. In addition, this measure has a number of other disadvantages. Its range of similarity values is not normalized, but more importantly, choosing the parent nodes with the highest information content, Reznik underestimates the similarity of objects, focusing on only one most important aspect of similarity at the cost of ignoring everyone else.

Лин предоставил аксиоматическое определение похожести и показал, как подход Резника может быть адаптирован для использования в указанной ситуации. Мера похожести Резника была основана исключительно на общности значений слов, тогда как подход Лина принимает в расчет значение разницы смыслов слов для определения нормализованного коэффициента похожести. Лин сравнил свою меру похожести с Резником и Ву-Палмером и обнаружил, что она дает показатели похожести, лучше коррелирующие с человеческими оценками этого критерия, чем их методы. Однако Лин не описал, каким образом его мера похожести может использоваться в случае, когда для описания объекта могут быть применены различные термины.Lin provided an axiomatic definition of similarity and showed how Reznik's approach can be adapted for use in this situation. The measure of Reznik’s similarity was based solely on the generality of the meanings of words, while Lin’s approach takes into account the meaning of the difference in the meanings of words to determine the normalized similarity coefficient. Lin compared his measure of similarity with Reznik and Wu-Palmer and found that it gives similarity indicators that correlate better with human ratings of this criterion than their methods. However, Lin did not describe how his measure of similarity can be used when different terms can be used to describe the object.

Мера похожести второй группы описывается в литературе как мера похожести, основанная на числе упоминаний (частоте упоминаний) терминов, являющихся общими в описаниях обоих объектов. Меры этой группы включают в себя Jaccard, Dice и Set Cosine, которые часто используются в системах информационного поиска и отличаются друг от друга способом, которым нормализуется количество общих терминов в описаниях, а также меру, описанную Келлером и др. в статье «Основанные на таксономии мягкие меры похожести в биоинформатике» («Taxonomy-based soft similarity measures in bioinformatics», 2004. Proceedings. 2004 IEEE International Conference on Fuzzy Systems, Volume 1, Issue 25-29 July 2004, стр: 23-29). Все эти меры похожести не берут в расчет структуру таксономии. Любой объект-кандидат, который не имеет в описании общих терминов со сравниваемым объектом, получит нулевой коэффициент похожести, хотя на деле они могут быть весьма сходными.The measure of similarity of the second group is described in the literature as a measure of similarity, based on the number of references (frequency of references) of terms that are common in the descriptions of both objects. Measures of this group include Jaccard, Dice and Set Cosine, which are often used in information retrieval systems and differ from each other in the way that the number of general terms in descriptions is normalized, as well as the measure described by Keller and others in the article “Based on Taxonomy” soft measures of similarity in bioinformatics "(" Taxonomy-based soft similarity measures in bioinformatics ", 2004. Proceedings. 2004 IEEE International Conference on Fuzzy Systems, Volume 1, Issue 25-29 July 2004, pp. 23-29). All these similarity measures do not take into account the structure of the taxonomy. Any candidate object that does not have common terms with the object being compared will receive a similarity coefficient of zero, although in reality they can be very similar.

Третий подход к измерению похожести основывается на мере похожести терминов для определения похожести между индивидуальными терминами, с дальнейшим комбинированием их для получения общего коэффициента похожести. Халкиди и др. в статье «Организация коллекций веб-документов, основанная на семантике связей» («THESUS: Organizing Web document collections based on link semantics», The VLDB Journal - The International Journal on Very Large Data Bases, Vol.12, Issue 4 (November 2003) стр.320-332) описывают меру похожести данного типа для использования в кластеризации интернет-страниц. Используя меру похожести By и Палмера, Халкиди рассматривает каждый термин в исходном и базовом документе по отдельности и находит наиболее похожие термины для каждого из документов. Далее для каждого набора терминов (из исходного и базового документов соответственно) определяются и усредняются коэффициенты похожести. После этого Халкиди комбинирует коэффициенты похожести из двух наборов терминов без учета весовых функций. Поскольку Халкиди также использует меру похожести By и Палмера, использование ее для таксономии, представленных направленными ациклическими графами, не представляется возможным.A third approach to measuring similarity is based on a measure of similarity of terms to determine similarity between individual terms, with a further combination of them to obtain a common similarity coefficient. Chalkidi et al. In THESUS: Organizing Web document collections based on link semantics, The VLDB Journal - The International Journal on Very Large Data Bases, Vol.12, Issue 4 (November 2003) p. 320-332) describe a measure of similarity of this type for use in clustering web pages. Using a measure of the similarity of By and Palmer, Halkidi considers each term in the source and base documents separately and finds the most similar terms for each of the documents. Further, for each set of terms (from the source and base documents, respectively), similarity coefficients are determined and averaged. After that, Halkidi combines similarity coefficients from two sets of terms without taking weight functions into account. Since Halkidi also uses the similarity measure of By and Palmer, using it for the taxonomy represented by directed acyclic graphs is not possible.

Ванг и др. в статье «Генетическое выражение корреляции и генетическая основанная на онтологиях похожесть: оценка качественных взаимоотношений» («Gene expression correlation and gene ontology-based similarity: An assessment of quantitative relationships», Computational Intelligence in Bioinformatics and Computational Biology, 2004. Proceedings of the 2004 IEEE Symposium on CIBCB, 7-8 Oct. 2004, стр.25-31) описывают меру похожести, обходящую проблему, нерешаемую методом Халкиди, путем использования обобщенной формы информационно-теоретической похожести Лина для определения похожести каждой пары терминов.Wang et al. In the article “Gene expression correlation and gene ontology-based similarity: An assessment of quantitative relationships”, Computational Intelligence in Bioinformatics and Computational Biology, 2004. Proceedings of the 2004 IEEE Symposium on CIBCB, 7-8 Oct. 2004, pp. 25-31) describe a similarity measure that circumvents a problem that cannot be solved by the Chalkidi method by using the generalized form of Lin's informational and theoretical similarity to determine the similarity of each pair of terms.

Необходимо отметить, что Ванг и др. обобщают формулу Лина для использования в таксономии, описываемой направленным ориентированным графом, путем выбора ближайшего общего предка с максимально насыщенным информационным контентом. Их подход отличается от подхода патентной заявки Райе и др. тем, что они применяют меру похожести только к парам терминов и не точно следуют аксиоматическому определению похожести Лина, потому что рассматривают только часть общих значений терминов. Ванг использует отличающуюся от Халкиди функцию для комбинирования коэффициентов подобия пар терминов. Вместо того чтобы усреднять коэффициенты для наиболее близких значений из обоих наборов терминов, они усредняют значения похожести по всем наборам пар терминов.It should be noted that Wang et al. Generalize Lin's formula for use in the taxonomy described by a directed oriented graph by choosing the closest common ancestor with the most saturated information content. Their approach differs from the approach of the patent application of Rayet et al. In that they apply the measure of similarity only to pairs of terms and do not exactly follow the axiomatic definition of similarity of Lin because they consider only part of the general meanings of the terms. Wang uses a function different from Halkidi to combine the similarity coefficients of the pairs of terms. Instead of averaging the coefficients for the closest values from both sets of terms, they average the similarity values across all sets of term pairs.

Все меры похожести в этой третьей группе искусственно разделяют комбинации терминов, для которых уже имеются общие таксономические предки, и те, для которых таковых нет.All similarity measures in this third group are artificially separated by combinations of terms for which there are already common taxonomic ancestors, and those for which there are none.

Келлер и др. (см. выше) представляют несколько способов определения похожести на базе нечетких мер, основанных на глубине информационного содержимого терминов. Однако эти меры требуют некоторых субъективно-определяемых весовых коэффициентов или решения полиноминальных систем уравнений высокого уровня сложности, что делает использование этих методов неэффективным для больших массивов информации.Keller et al. (See above) present several ways to determine similarity based on fuzzy measures based on the depth of the information content of the terms. However, these measures require some subjectively determined weighting coefficients or the solution of polynomial systems of equations of high complexity, which makes the use of these methods ineffective for large amounts of information.

Решая задачу поиска «похожих» или «близких» документов, все вышеуказанные способы не решают задачи интерпретации результатов поиска. Например, мы можем измерить точные расстояния между географическими объектами, но при этом не получить никакого представления о положении этих объектов на земной поверхности.Solving the problem of searching for “similar” or “close” documents, all of the above methods do not solve the problem of interpreting search results. For example, we can measure the exact distances between geographical objects, but at the same time do not get any idea about the position of these objects on the earth's surface.

Ключевым отличием метода, используемого в данном изобретении, от вышеописанных методов является то, что он позволяет решать проблемы позиционирования документа или документов в пространстве знаний, не ограничиваясь подсчетом мер похожести пар или групп документов.The key difference between the method used in this invention and the above methods is that it allows you to solve the problems of positioning a document or documents in the knowledge space, not limited to calculating the similarity measures of pairs or groups of documents.

Раскрытие изобретенияDisclosure of invention

ОпределенияDefinitions

Прежде чем перейти к описанию заявленного способа, целесообразно привести определения некоторых понятий, встречающихся в описании и прилагаемой формуле изобретения.Before proceeding to the description of the claimed method, it is advisable to give definitions of some of the concepts found in the description and the attached claims.

Паттерн - словесные, графические, числовые и прочие компоненты в допустимых формах и модификациях или фраза, включающая микроконтекст.Pattern - verbal, graphic, numerical and other components in acceptable forms and modifications or a phrase that includes micro-context.

Микроконтекст - часть паттерна, обладающая самостоятельным значением (например, набору символов может быть присвоено значение "дата", "цена", "ФИО" и т.д.)Microcontext is a part of a pattern that has an independent meaning (for example, a set of characters can be assigned the value "date", "price", "name", etc.)

Направленный ациклический граф (НАГ) - случай направленного графа, в котором отсутствуют направленные циклы, то есть пути, начинающиеся и кончающиеся в одной и той же вершине.A directed acyclic graph (NAG) is a case of a directed graph in which there are no directed cycles, that is, paths starting and ending at the same vertex.

Таксономия - направленный ациклический граф, отражающий иерархию (дерево) понятий в порядке убывания категорий от общего к частному, при этом к каждой категории относится один или несколько паттернов. Узлы графа называются таксонами. Любой таксон, кроме самого верхнего в таксономии, может относиться к одной или нескольким таксономиям.Taxonomy is a directed acyclic graph that reflects the hierarchy (tree) of concepts in descending order of categories from general to particular, with one or more patterns per category. The nodes of the graph are called taxa. Any taxon, except the highest in a taxonomy, can relate to one or more taxonomies.

Под Онтологией в настоящей заявке понимается набор таксономии, природа (смысл) которых позволяет сходную интерпретацию и, в частности, может характеризоваться единым коэффициентом значимости (см. фиг.1).Ontology in this application is understood as a set of taxonomy, the nature (meaning) of which allows a similar interpretation and, in particular, can be characterized by a single coefficient of significance (see figure 1).

Пространство знаний - совокупность онтологии, описывающих разные предметные области.The space of knowledge is a set of ontologies that describe different subject areas.

Вектор - совокупность наборов значимых категорий, к которым относится рассматриваемый документ, и присвоенных им весов. Каждый документ может быть определен набором векторов.A vector is a collection of sets of significant categories to which the document in question belongs and the weights assigned to them. Each document can be defined by a set of vectors.

Категоризатор - это программно-аппаратный механизм, предназначенный для получения описания входного текста в виде набора векторов, позиционирующих входной документ в пространстве знаний. Категоризатор использует априорные знания о связи и иерархии понятий, характерных признаках категорий.Categorizer is a hardware-software mechanism designed to obtain a description of the input text in the form of a set of vectors that position the input document in the knowledge space. The categorizer uses a priori knowledge of the connection and hierarchy of concepts, the characteristic features of categories.

АналогиAnalogs

Известен метод «идентификации объектов по их описаниям» (патент РФ №2167450, МПК G06F 17/30, дата публикации 27.12.2000), использующий исключительно статистическую меру похожести, основанную на частоте употребления встречающихся в документе слов. Из-за ограниченности указанных статистических методов остаются нерешенными проблемы полисемии и синонимии в исходных текстах, что существенно снижает качество идентификации по сравнению с предлагаемым изобретением.The known method of “identifying objects according to their descriptions” (RF patent No. 2164450, IPC G06F 17/30, publication date 12/27/2000) using only a statistical measure of similarity based on the frequency of use of the words found in the document. Due to the limitations of these statistical methods, the problems of polysemy and synonymy in the source texts remain unresolved, which significantly reduces the quality of identification compared to the proposed invention.

Известен метод Латентного Семантического анализа (ЛСА), который может быть использован для целей определения похожести документов (патент США 4839853 «Поиск компьютерной информации с использованием латентных семантических структур» («Computer information retrieval using latent semantic structure», авторы Дирвестер и др., Международная классификация G06F 17/21; G06F 17/30; G06F 15/40). Метод ЛСА использует статистические методы обработки текстов, при которых из массивов текстов извлекается информация о частоте употребления слов и словосочетаний, из которой в свою очередь статистическими методами извлекается информация о «концепциях», дающая возможность последующего определения «похожести» документов. При этом, в отличие от предлагаемого изобретения, метод ЛСА не использует таксономии как основу для вычисления похожести, а опирается только на статистические методы. Например, вес термина обратно пропорционален частотности его упоминания в тексте вне зависимости от тематики документа. Метод ЛСА не позиционирует документы в пространстве знаний.The known method of Latent Semantic Analysis (LSA), which can be used to determine the similarity of documents (US patent 4839853 "Search for computer information using latent semantic structures" ("Computer information retrieval using latent semantic structure", authors Dirvester and others, International classification G06F 17/21; G06F 17/30; G06F 15/40) The LSA method uses statistical text processing methods in which information on the frequency of use of words and phrases is extracted from text arrays, from which, in turn, statistical methods are used to extract information about “concepts”, which makes it possible to subsequently determine the “similarity” of documents. In this case, unlike the present invention, the LSA method does not use taxonomies as a basis for calculating similarity, but relies only on statistical methods. For example, the weight of the term is back proportional to the frequency of its mention in the text, regardless of the subject of the document.The LSA method does not position documents in the knowledge space.

Описания подходов к построению систематической классификации пространства знаний предложены в книге С.Кордонского «Циклы деятельности и идеальные объекты» (Москва, 2001 г.).Descriptions of approaches to constructing a systematic classification of the knowledge space are proposed in the book by S. Kordonsky “Cycles of activity and ideal objects” (Moscow, 2001).

Известен метод поиска документов с использованием таксономии (патент США 6442545 «Term-level text with mining with taxonomies», авторы Фельдман и др., международная классификация G06F 17/30), который не предусматривает возможности вхождения одной и той же категории в разные таксономии, объединение таксономии в онтологии и применение понятие "коэффициента условности" для онтологии.There is a known method of searching for documents using taxonomy (US patent 6442545 "Term-level text with mining with taxonomies", authors Feldman and others, international classification G06F 17/30), which does not provide for the possibility of entering the same category in different taxonomies, the unification of taxonomy in ontology and the application of the concept of “conditionality coefficient” for ontology.

Известны метод генерации таксономии, описанный в патенте США 6360227 («System and method for generating taxonomies with applications to content-based recommendations», авторы Аггарвал и др., международная классификация G06F 17/30), а также метод создания категоризированных баз документов, описанный в патентной заявке США 20070106662 («Categorized document bases», aвторы Кимброу и др.), рассматривающие похожесть пар записей (документов) в соответствии с заданной таксономией. В отличие от настоящего изобретения они не решают проблему позиционирования документов в пространстве знаний, а относятся только к поиску похожих документов.The known method of generating taxonomy described in US patent 6360227 ("System and method for generating taxonomies with applications to content-based recommendations", authors Aggarwal and others, international classification G06F 17/30), as well as the method of creating categorized document bases described in US patent application 20070106662 ("Categorized document bases", authors Kimbrow and others), considering the similarity of pairs of records (documents) in accordance with a given taxonomy. Unlike the present invention, they do not solve the problem of positioning documents in the knowledge space, but relate only to the search for similar documents.

Наиболее близкой к предлагаемому способу является представленная Райс и др. патентная заявка США номер 20080027929 «Компьютерный метод для поиска похожих объектов с использованием таксономии» («Computer-based method for finding similar objects using a taxonomy»).Closest to the proposed method is presented by Rice and other US patent application number 20080027929 "Computer-based method for finding similar objects using taxonomy" ("Computer-based method for finding similar objects using a taxonomy").

Представленный в настоящей заявке метод отличается от способа Райс и др. тем, что он использует понятие «онтологии» как совокупности таксономии и позволяет учитывать случаи, когда таксон относится к одной или более таксономии, а также использует понятие коэффициента условности, приписываемое таксономии для определения употребимости паттерна при расчете веса категории.The method presented in this application differs from the method of Rice et al. In that it uses the concept of “ontology” as an aggregate of a taxonomy and allows one to take into account cases when a taxon belongs to one or more taxonomies, and also uses the concept of a conditionality coefficient attributed to a taxonomy to determine usage pattern when calculating the weight category.

Необходимо отметить, что встречающиеся в реальной жизни документы редко бывают на одну тему. Соответственно документ, как правило, позиционируется в пространстве знаний не одним, а несколькими векторами, каждому из которых может быть присвоен весовой коэффициент.It should be noted that the documents encountered in real life are rarely on the same topic. Accordingly, a document, as a rule, is positioned in the knowledge space not by one, but by several vectors, each of which can be assigned a weight coefficient.

Как указывалось в вышеупомянутой работе С.Кордонского, позиция наблюдателя определяет степень значимости для него тех или иных аспектов документа и делает бессмысленной попытку определения единственно правильной «главной» темы документа.As indicated in the aforementioned work of S. Kordonsky, the position of the observer determines the degree of significance for him of certain aspects of the document and makes senseless the attempt to determine the only correct “main” topic of the document.

Если признать, что для реальных текстов не существует «единственной правильной» и всеобъемлющей онтологии, становится очевидной необходимость создания аппарата позиционирования, ориентированного на получение множественного результата, оставляющего возможность выбора категории, наиболее актуальной в настоящий момент для пользователя системы.If we admit that for “real texts” there is no “only correct” and comprehensive ontology, it becomes obvious the need to create a positioning apparatus focused on obtaining multiple results, leaving the possibility of choosing the category most relevant for the system user at the moment.

В отличие от настоящего изобретения, все вышеописанные аналоги не рассматривают эту проблему.Unlike the present invention, all of the above analogues do not address this problem.

Из существующего уровня техники не выявлены объекты, которые содержали бы совокупность указанных выше признаков. Это позволяет считать заявленный способ новым.From the current level of technology, objects that would contain a combination of the above characteristics have not been identified. This allows us to consider the claimed method as new.

Из существующего уровня техники не известна также совокупность признаков, отличных от признаков упомянутого выше наиболее близкого аналога. Это позволяет считать заявленный способ обладающим изобретательским уровнем.From the existing level of technology is also not known a set of features other than the features of the aforementioned closest analogue. This allows us to consider the claimed method with an inventive step.

Таким образом, создание механизма позиционирования при наличии множества онтологий, каждая из которых представлена своим набором таксономии, представляется актуальным техническим результатом при создании реально функционирующей системы.Thus, the creation of a positioning mechanism in the presence of many ontologies, each of which is represented by its own set of taxonomy, seems to be an actual technical result when creating a really functioning system.

Сутью изобретения является метод позиционирования документов в пространстве знаний, представленном множеством онтологий, которые в свою очередь представлены набором объединяющих паттерны таксономий (см. фиг.2). В отличие от ближайшего аналога (патентная заявка Райс и др.) таксоны используются как обобщение терминов и могут описывать не только наборы релевантных терминов, но и фразы и выражения, рассматриваемые как признаки соответствующих категорий. Так же, как и в рассматриваемом аналоге, один и тот же паттерн может входить в различные категории. В отличие от ближайшего аналога, в данном методе каждая категория может одновременно являться вершиной одного дерева и подкатегорией других деревьев.The essence of the invention is the method of positioning documents in the knowledge space, represented by many ontologies, which in turn are represented by a set of combining patterns of taxonomies (see figure 2). Unlike the closest analogue (patent application Rice et al.), Taxa are used as a generalization of terms and can describe not only sets of relevant terms, but also phrases and expressions considered as signs of the corresponding categories. As well as in the considered analogue, the same pattern can fall into different categories. Unlike the closest analogue, in this method, each category can simultaneously be the top of one tree and a subcategory of other trees.

В отличие от всех перечисленных методов в данном изобретении таксономии объединяются в онтологии с едиными правилами интерпретации таксонов, а на паттерн могут накладываться дополнительные условия его применения. Например, условный паттерн "финансовый кризис" может входить в дерево категорий, верхние узлы которых обозначены как "Россия", "Германия", "США", но в каждом из трех таксонов будет рассматриваться только при наличии других паттернов, явно принадлежащих этому дереву.Unlike all of the above methods, in this invention, taxonomies are combined in the ontology with the same rules for interpreting taxa, and additional conditions for its application may be imposed on the pattern. For example, the conditional pattern "financial crisis" can be included in the tree of categories, the upper nodes of which are designated as "Russia", "Germany", "USA", but in each of the three taxa it will be considered only if there are other patterns that clearly belong to this tree.

Пространство знаний, онтологии, таксономии и формы их представления в виде направленных ациклических графов или иные не являются предметом настоящего изобретения. Общеизвестный язык RDF (Resource Description Framework) и другие общеизвестные языки описания онтологий и таксономий могут использоваться в качестве входного описания таксономий с последующей трансляцией на язык внутреннего представления категоризатора(310, фиг.3). Возможны и другие варианты входного описания таксономий, подразумевающие создание специальных программных средств-конверторов из языка внешнего представления во внутреннее представление категоризатора.The space of knowledge, ontology, taxonomy and the form of their representation in the form of directed acyclic graphs or others are not the subject of the present invention. The well-known language RDF (Resource Description Framework) and other well-known languages for describing ontologies and taxonomies can be used as an input description of taxonomies with subsequent translation into the language of the internal representation of the categorizer (310, Fig. 3). Other variants of the input description of taxonomies are also possible, implying the creation of special software-converters from the language of the external representation to the internal representation of the categorizer.

Краткое описание чертежейBrief Description of the Drawings

Фиг.1. Диаграмма, показывающая образец онтологии, объединяющей таксономии, представленные в виде направленных ациклических графов.Figure 1. A diagram showing an example of an ontology uniting taxonomies presented in the form of directed acyclic graphs.

Фиг.2. Диаграмма, показывающая образец таксономии, представленной в виде направленного ациклического графа, с относящимися к таксонам паттернами.Figure 2. A diagram showing an example of a taxonomy presented as a directed acyclic graph with taxon-related patterns.

Фиг.3. Диаграмма, иллюстрирующая вариант осуществления изобретения.Figure 3. A diagram illustrating an embodiment of the invention.

Подробное описание варианта осуществления изобретенияDetailed description of an embodiment of the invention

Входной текст 301 (фиг.3) разбивается на предложения 302. Каждое предложение рассматривается на наличие слов и выражений 303, соответствующих паттернам, известным категоризатору 310. На основе паттернов П_H, для которых найдены релевантные слова или выражения в тексте, в имеющихся таксономиях и онтологиях выделяются поддеревья (подграфы) 304, содержащие понятия, соответствующие данному паттерну и всем его родительским понятиям. Найденные паттерны П_H подтверждают гипотезу о принадлежности текста к соответствующим категориям. При этом рассматриваются не только категории, прямо подтвержденные паттернами П_H, но и категории, включающие прямо подтвержденные категории. Поддеревья, состоящие из одного таксона, ведущего к одному понятию, отбрасываются.The input text 301 (FIG. 3) is broken down into sentences 302. Each sentence is considered for the presence of words and expressions 303 corresponding to patterns known to categorizer 310. Based on patterns P _H for which relevant words or expressions are found in the text, in available taxonomies and ontologies, subtrees (subgraphs) 304 are distinguished that contain concepts that correspond to this pattern and all its parental concepts. The found patterns of P _H confirm the hypothesis that the text belongs to the corresponding categories. In this case, not only categories directly confirmed by the P _H patterns are considered, but also categories including directly confirmed categories. Subtrees, consisting of one taxon leading to one concept, are discarded.

Далее в каждом из полученных поддеревьев определяется значение веса каждой категории 305. Вес категории рассчитывается как:Further, in each of the resulting subtrees, the weight value of each category 305 is determined. The weight of the category is calculated as:

гдеWhere

w₁ - сумма весов всех понятий, напрямую связанных с этой категорией,w ₁ - the sum of the weights of all concepts directly related to this category,

w₂- вес всех поддеревьев, относящихся к этому узлу, вычисляемый какw ₂ is the weight of all subtrees related to this node, calculated as

,

гдеWhere

w - сумма весов всех паттернов П_H, входящих в поддерево,w is the sum of the weights of all patterns P _H included in the subtree,

c₁ - число ветвей в поддереве,c ₁ - the number of branches in the subtree,

o₁ - число паттернов в поддереве,o ₁ - the number of patterns in the subtree,

n_c - число узлов в поддереве.n _c is the number of nodes in the subtree.

Вес паттерна везде равен

, гдеThe weight of the pattern is everywhere

where

n_c - число категорий, с которыми напрямую связан этот паттерн П_H,n _c is the number of categories with which this pattern P _H is directly related,

К- коэффициент, принимающий значения от 0.3 до 3.K is a coefficient taking values from 0.3 to 3.

Совокупность векторов, ведущих от верхних категорий таксономии до подтвержденных подкатегорий, отранжированных в порядке весов категорий, показывает позицию данного документа в пространстве знаний.The set of vectors leading from the upper categories of taxonomy to confirmed subcategories, arranged in order of category weights, shows the position of this document in the knowledge space.

При этом поддеревья 306, связанные с общей родительской категорией или, при ее отсутствии, верхними подтвержденными категориями, которые имеют значение веса ниже границы отсечения для онтологии, в которую они входят, отбрасываются, то есть признаются малозначимыми.In this case, subtrees 306 associated with the common parent category or, in its absence, the upper confirmed categories, which have a weight value below the cutoff border for the ontology into which they enter, are discarded, that is, they are considered insignificant.

Значимость категории в таксономии зависит от онтологии и определяется числом отсечения, являющимся характеристикой данной онтологии. В соответствии с набранным весом мы можем определить категории как более или менее значимые для данного документа. При этом в сравнении могут участвовать только категории, принадлежащие таксономиям одной онтологии. Например, рассматривая таксономии "цвет" и "транспортные средства", принадлежащие к разным онтологиям, по-видимому, бессмысленно при поступлении входного текста "красный автомобиль" пытаться определить, является ли данный предмет более красным, чем автомобилем.The significance of a category in taxonomy depends on the ontology and is determined by the cutoff number, which is a characteristic of this ontology. In accordance with the gained weight, we can determine the categories as more or less significant for this document. Moreover, only categories belonging to the taxonomies of one ontology can participate in the comparison. For example, considering the taxonomy “color” and “vehicles” belonging to different ontologies, it seems senseless to try to determine if a given item is more red than a car when the input text “red car” is received.

В значимые попадают только те категории, которые имеют численное значение веса, превышающее число отсечения, указанное для данной онтологии. Например, текст, содержащий только фрагменты, соответствующие паттернам "Виктор Ющенко", "Юлия Тимошенко", "газовый конфликт", "переговоры с Россией", может породить гипотезы о значимости таксонов: "правительство Украины", "международные экономические отношения", "соглашение о транспортировке газа", "пчеловодство". Гипотеза о том, что в данном документе речь идет о пчеловодах (поскольку В. Ющенко является известным пчеловодом) будет отвергнута, поскольку не находит каких-либо подтверждений другими паттернами документа. То есть "Виктор Ющенко" будет рассматриваться только в контексте его государственной деятельности, а не в контексте его персональных увлечений.Only those categories that have a numerical weight value exceeding the cutoff number specified for a given ontology fall into the significant ones. For example, a text containing only fragments corresponding to the patterns of “Viktor Yushchenko,” “Yulia Tymoshenko,” “gas conflict,” “negotiations with Russia,” may give rise to hypotheses about the significance of taxa: “the government of Ukraine,” “international economic relations,” gas transportation agreement "," beekeeping ". The hypothesis that in this document we are talking about beekeepers (since V. Yushchenko is a well-known beekeeper) will be rejected because he does not find any confirmation with other document patterns. That is, "Viktor Yushchenko" will be considered only in the context of his state activities, and not in the context of his personal hobbies.

Перед расчетом весовых коэффициентов из рассмотрения исключаются все паттерны, не соответствующие заданным условиям. Так, например, если условием является обязательное подтверждение наличия не менее определенного количества паттернов, подтверждающих данную категорию, то паттерны, не набравшие указанного числа, исключаются из рассмотрения.Before calculating the weighting factors, all patterns that do not meet the specified conditions are excluded from consideration. So, for example, if the condition is the obligatory confirmation of the presence of at least a certain number of patterns confirming this category, then patterns that do not gain the indicated number are excluded from consideration.

Одинаковые по написанию паттерны могут иметь разные специфические атрибуты (как правило, название одной из родительских категорий), указывающие границы действия коэффициента условности 307. Это позволяет указывать разные значения условности для одинаковых по написанию паттернов. Тогда, например, изменение в условности паттерна "шина (автотранспорт)" не будет влиять на значение условности паттерна "шина (травматология)". С помощью условных паттернов, в частности, могут описываться свойства, которые сами по себе не определяют точно объект или явление, а могут принадлежать другим классам объектов или явлений, но в определенных обстоятельствах позволяют сделать выбор нужного объекта. Например, паттерн, соответствующий значению "одно спальное место" или "два спальных места", сам по себе не указывает на то, что речь в документе идет про гостиницу или поезд, или самолет бизнес-класса, но в сочетании с паттернами, определяющими тему "автомобильный транспорт", позволяет сделать вывод, что речь идет не обо всех автомобилях, а только о седельных тягачах или мобильных домах.Patterns that are identical in spelling can have different specific attributes (usually the name of one of the parent categories) that indicate the boundaries of the condition coefficient 307. This allows you to specify different conventions for the same spelling patterns. Then, for example, a change in the conditionality of the pattern “tire (motor vehicle)” will not affect the value of the conditionality of the pattern “tire (traumatology)”. Using conditional patterns, in particular, properties can be described that themselves do not exactly determine an object or phenomenon, but can belong to other classes of objects or phenomena, but in certain circumstances they allow you to make the choice of the desired object. For example, the pattern corresponding to the value of “one berth” or “two berths” does not in itself indicate that the document refers to a hotel or train, or a business plane, but in combination with patterns that define the theme "automobile transport" allows us to conclude that we are not talking about all cars, but only truck tractors or mobile homes.

В результате для исходного текста определяется набор векторов, указывающих на категории, определяющие контекст данного документа, что и рассматривается как позиция документа в пространстве знаний.As a result, a set of vectors is defined for the source text that indicate the categories that determine the context of this document, which is considered as the position of the document in the knowledge space.

В качестве похожих могут рассматриваться документы, имеющие сходные вектора ("документы об одном и том же"); документ, вектора которого входят как подмножество в набор векторов другого документа (статья про самолет По-2 может входить как подмножество в обширный документ "история авиации СССР").Documents having similar vectors ("documents about the same") can be considered as similar; a document whose vectors are included as a subset of the vectors of another document (an article about the Po-2 aircraft can be included as a subset in the extensive document "History of USSR Aviation").

Все рассмотренные выше действия выполняются с помощью общеизвестных программных операций - сравнений, подсчета повторений, вычисления статистических величин, работы с матрицами и т.п. Конкретный вид соответствующих программ будет определяться конкретным видом аппаратного обеспечения и установленной на нем операционной системы и не является предметом патентных притязаний заявителей.All the actions discussed above are performed using well-known software operations - comparisons, counting repetitions, calculating statistical values, working with matrices, etc. The specific type of relevant programs will be determined by the specific type of hardware and the operating system installed on it and is not the subject of patent claims of applicants.

Таким образом, из приведенного описания видно, что предлагаемое изобретение позволяет идентифицировать (располагать в пространстве знаний) различные объекты с учетом их подобия.Thus, from the above description it can be seen that the present invention allows to identify (locate in the space of knowledge) various objects, taking into account their similarity.

Настоящее изобретение может использоваться в различных областях информационных технологий, например в информационном поиске, оценке ситуаций, контекстной рекламе, может являться основой для построения поисковых систем, которые в отличие от традиционного поиска по образцу сочетают поиск по образцу с "поиском по понятиям", то есть поиском по положению документов в пространстве знаний. Например, при обычном поиске результаты запроса «кризис в России» будут содержать все документы, содержащие слова, упоминаемые в данном запросе, тогда как при использовании алгоритма поиска в пространстве знаний результаты поиска будут релевантны смысловому наполнению запроса. К примеру, "акция массового неповиновения в Дальнегорске" имеет отношение к «кризису в России», но при обычном поиске этот результат не будет получен, поскольку слова «кризис» и «Россия» прямо не упоминаются в документе про «акцию массового неповиновения в Дальнегорске».The present invention can be used in various fields of information technology, for example, in information retrieval, situation assessment, contextual advertising, can be the basis for building search engines, which, unlike traditional search by pattern, combine search by pattern with “concept search”, that is search by position of documents in the knowledge space. For example, in a conventional search, the results of the “crisis in Russia” query will contain all documents containing the words mentioned in this query, while when using the search algorithm in the knowledge space, the search results will be relevant to the semantic content of the query. For example, the “mass disobedience action in Dalnegorsk” is related to the “crisis in Russia”, but this search will not be obtained with a regular search, because the words “crisis” and “Russia” are not directly mentioned in the document about the “mass disobedience in Dalnegorsk” ".

Кроме того, данный алгоритм может применяться как средство сопоставления документов в контекстной рекламе; поведенческой рекламе (наличие пространства знаний позволяет вводить понятие "поведение клиента" как историю его перемещений по документам, позиционируемым в данном пространстве, или как историю его перемещений по документам, соответствующим категориям пространства); при привязывании к документу релевантных документов, то есть документов, обладающих аналогичным набором векторов пространства знаний.In addition, this algorithm can be used as a means of matching documents in contextual advertising; behavioral advertising (the presence of a knowledge space allows you to introduce the concept of "customer behavior" as the history of his movements in documents positioned in this space, or as the history of his movements in documents corresponding to categories of space); when tying relevant documents to a document, that is, documents that have a similar set of knowledge space vectors.

Использование настоящего изобретения позволяет получить технический результат, состоящий в увеличении финансовой эффективности систем контекстной рекламы за счет увеличения релевантности демонстрируемых рекламных объявлений; увеличении эффективности систем информационного поиска за счет более точной, релевантной и пертинентной выдачи результатов; улучшении функционирования поисковых систем за счет более четкого таргетирования информации и рекламы.Using the present invention allows to obtain a technical result, consisting in increasing the financial efficiency of contextual advertising systems by increasing the relevance of the displayed advertisements; increasing the effectiveness of information retrieval systems through more accurate, relevant, and continent output of results; improving the functioning of search engines through clearer targeting of information and advertising.

Приведенные примеры реализации настоящего изобретения служат лишь в качестве иллюстраций и никоим образом не ограничивают объема патентных притязаний, определяемого нижеследующей формулой изобретения.The examples of implementation of the present invention serve only as illustrations and in no way limit the scope of patent claims defined by the following claims.

Claims

1. The method of positioning texts in the knowledge space, which consists in the fact that (a) elements corresponding to patterns included in taxa forming taxonomies combined in ontology are extracted from the input data; (b) identify significant taxa that are weighted based on the conditions assigned to the patterns; (c) compose a set of weighted vectors that position the input document in the knowledge space, characterized in that it uses a lot of ontologies for positioning, and also in the fact that when compiling sets of vectors, only those elements that correspond to patterns included in one taxon are considered or in taxa having common parent taxa.

2. The method according to claim 1, characterized in that all taxa, except for the top taxon of the taxonomy, can be subsidiaries simultaneously for several parent taxa.

3. The method according to claim 1, characterized in that it uses the concept of a coefficient of conditionality as a condition indicating the possibility of using the pattern only in a given context.