CN108304488A

CN108304488A - A method of utilizing the automatic study ontology of Topic Model

Info

Publication number: CN108304488A
Application number: CN201810009239.3A
Authority: CN
Inventors: 林志杰
Original assignee: Shanghai Dianji University
Current assignee: Shanghai Dianji University
Priority date: 2018-01-04
Filing date: 2018-01-04
Publication date: 2018-07-20

Abstract

The present invention provides a kind of methods of the automatic study ontology using Topic Model, this method supports automatic domain body structure, a kind of measure of Semantic Similarity between the calculating concept of information is invented, the Semantic Similarity between concept for calculating the generation of LDA models, the method for this automatic study ontology are divided into two steps：The first step is to carry out concept identification from text corpus or web corpus；Second step is to carry out the relationship between concept using Semantic Similarity defined herein measurement CP to establish.This method need not have auxiliary of the seed ontology as initial study ontology.The experimental results showed that the method proposed by the present invention for carrying out automated ontology structure using Topic Model is very effective.

Description

A Method of Automatically Learning Ontology Using Topic Model

技术领域technical field

本发明涉及一种本体构建的方法，利用TopicModel作为产生基本概念单位，不用本体种子即可学习出本体达到构建本体的目的。The invention relates to a method for constructing an ontology, which utilizes a TopicModel as a basic concept unit, and can learn ontology without ontology seeds to achieve the purpose of constructing ontology.

背景技术Background technique

本体构建已经应用到各种领域，如人工智能、信息抽取、机器翻译等领域。但是人工构建本体是非常耗时费力的工作，随着概念和领域信息的不断扩展更新，构建大规模的本体需要越来越多的人力物力和精力，所以人为构建像webdirectories、Wordnet这样的大型本体种子需要耗费更多的努力和精力。因此强烈需要能够自动构建本体来跟上这种领域信息暴涨的现实需求，来缩小认为构建和维护本体的代价。所以最近利用计算机数据分析、数据挖掘的方式自动构建本体是一件很有意义的研究，吸引了很多研究人员对此进行大量深入的研究。Ontology construction has been applied to various fields, such as artificial intelligence, information extraction, machine translation and other fields. However, manually constructing ontology is a very time-consuming and laborious work. With the continuous expansion and updating of concepts and domain information, building large-scale ontology requires more and more manpower, material resources and energy. Therefore, artificially constructing large-scale ontology such as webdirectories and Wordnet Seeds require more effort and effort. Therefore, there is a strong need to be able to automatically build ontology to keep up with the real demand of this field information surge, to reduce the cost of building and maintaining ontology. Therefore, recently, it is a very meaningful research to automatically construct ontology by means of computer data analysis and data mining, which has attracted many researchers to do a lot of in-depth research on it.

自动构建本体已经变成一个新的研究领域，有很多方法已经提出来用于自动构建本体，目前本体已经有很多即时应用，能够帮助知识工程师结合自动或者半自动机器学习技术来构建和扩展本体，大大减少了人工构建维护本体的代价。大多数现在的本体学习方法集中在扩展、更新已有的本体种子，利用从文献词库中抽取出概念或者词汇单元来更新和宽展本体种子。也有一些自动学习本体的方法，但是多数这种自动学习本体的方法都是基于特殊知识领域的本体构建，如SKOS模型，但是这些方法都具有一定的局限性。Automatic construction of ontology has become a new research field. Many methods have been proposed for automatic construction of ontology. At present, there are many real-time applications of ontology, which can help knowledge engineers combine automatic or semi-automatic machine learning technology to construct and expand ontology, greatly The cost of manually constructing and maintaining ontology is reduced. Most of the current ontology learning methods focus on expanding and updating the existing ontology seeds, and update and broaden the ontology seeds by extracting concepts or lexical units from the literature lexicon. There are also some methods for automatically learning ontology, but most of these methods are based on ontology construction in special knowledge domains, such as the SKOS model, but these methods have certain limitations.

有很多从文本语料库中学习本体的方法，如基于lexico-syntactic的本体构建方法，这些方法主要利用自然语言处理技术和现有的lexicon资源来学习概念之间的is-a关系，即所谓的Hearst-parterns，但是这类方法有个缺点就是Hearst-parterns这种需要频繁出现的词汇模式不会频繁出现，同时他只能处理一些非常模糊的词汇语义关系。P.Cimicano和F.M.Suchanek等常识利用Wikipedia、Wordnet这种web搜索引擎去抽取更多的语言模式。There are many methods of learning ontology from text corpus, such as ontology construction methods based on lexico-syntactic, these methods mainly use natural language processing technology and existing lexicon resources to learn the is-a relationship between concepts, the so-called Hearst -parterns, but this type of method has a disadvantage that Hearst-parterns, a lexical pattern that needs to appear frequently, does not appear frequently, and at the same time, he can only deal with some very vague lexical semantic relationships. Common sense such as P.Cimicano and F.M.Suchanek use web search engines such as Wikipedia and Wordnet to extract more language patterns.

基于聚类和分类的统计学习方法也应用到本体学习中，这些方法通常利用相似性度量和不相似性度量来进行概念关系的建立。这类方法的局限是基于聚类和分类的本体学习方法很难执行。基于信息抽取技术的本体学习方法学习本体的层次结构，这类方法仅能够抽取出类似人类、地点、动物这种非常概化的概念以及它们的子概念。Statistical learning methods based on clustering and classification are also applied to ontology learning, and these methods usually use similarity measures and dissimilarity measures to establish concept relationships. The limitation of such methods is that ontology learning methods based on clustering and classification are difficult to implement. Ontology learning methods based on information extraction technology learn the hierarchical structure of ontology, which can only extract very generalized concepts like human beings, places, animals and their sub-concepts.

Topic Model概率模型是一种在没有先验知识提供的情况下，从科学出版物中识别出概念的已被业界证明了非常有效的模型。Topic Model模型现在已经被广泛应用到文本挖掘领域。利用Topic Model进行本体学习是一种新的研究方法。Elias Zavitsanos等提出一种基于统计方法的自动本体学习方法，该方法是通过不断的重复利用Topic Model模型训练出的概念集合，然后再利用条件独立性判断识别出的概念之间的联系，但是该方法不能进行两个层次结构之间概念的联系。Wang wei等人提出了两个方法都是基于语义Web学习本体结构的方法，该方法利用信息论与Topic Model相结合的方式，表现出很好的召回率和准确率，但是需要限制最近根节点的子概念节点的数量。The Topic Model probabilistic model is a model that has been proven to be very effective in identifying concepts from scientific publications without prior knowledge. The Topic Model model has been widely used in the field of text mining. Ontology learning using Topic Model is a new research method. Elias Zavitsanos et al. proposed an automatic ontology learning method based on statistical methods. This method continuously reuses the concept set trained by the Topic Model model, and then uses the conditional independence to judge the connection between the identified concepts. However, this method Methods cannot make a conceptual connection between two hierarchies. Wang Wei et al. proposed two methods based on semantic Web learning ontology structure. This method uses the combination of information theory and Topic Model to show good recall and accuracy, but it needs to limit the nearest root node. The number of child concept nodes.

发明内容Contents of the invention

本发明的目的是提供一种自动学习本体的方法，不当能够准确的确定概念之间的相互关系，而且能够在不提供先验知识的情况下，学习本体的过程中决定本体的深度和学习时间的终点。The purpose of the present invention is to provide a method for automatically learning ontology, which can accurately determine the relationship between concepts, and can determine the depth and learning time of ontology in the process of learning ontology without providing prior knowledge end point.

为了达到上述目的，本发明的技术方案是提供了一种利用Topic Model的自动学习本体的方法，其特征在于，包括以下步骤：In order to achieve the above object, the technical solution of the present invention provides a method for automatically learning ontology utilizing Topic Model, which is characterized in that it includes the following steps:

第一步、利用LDA模型从给定的文献语料库中进行概念抽取，由抽取到的概念产生出概念集合，然后进行概念层次细分产生本体构建的层次结构G，G＝{T，E}，式中，T＝{t1，t2，...，tm}是概念集合，定义为上层概念集合；T’＝{t1’，t2’，...，tm’}是子概念集合，定义为上层概念集合T的下一层概念集合，概念集合T与子概念集合T’是相继的两层；E是边的集合，每个eij∈E表示概念集合T中的第i个概念ti与子概念集合T′中第j个概念tj’有边相连；The first step is to use the LDA model to extract concepts from a given document corpus, generate a concept set from the extracted concepts, and then carry out conceptual hierarchical subdivision to generate a hierarchical structure G of ontology construction, G={T, E}, In the formula, T = {t1, t2, ..., tm} is a concept set, defined as the upper concept set; T' = {t1', t2', ..., tm'} is a sub-concept set, defined as The concept set of the next layer of the upper concept set T, the concept set T and the sub-concept set T' are two consecutive layers; E is a set of edges, and each eij∈E represents the i-th concept ti and sub-concept in the concept set T. The jth concept tj' in the concept set T' is connected by an edge;

第二步、利用CosTMI相似性度量方法，识别层次结构G中相继两层之间的语义相似性，其中，上层概念集合T中第p个概念tp与概念tp的上下文中，下一层概念集合T′中第s个概念ts’和第r个概念tr’两个概念的语义相似度CosTMI(ts′，tr′；tp)The second step is to use the CosTMI similarity measurement method to identify the semantic similarity between two consecutive layers in the hierarchical structure G, wherein, in the context of the p-th concept tp in the upper-level concept set T and the concept tp, the lower-level concept set The semantic similarity between the sth concept ts' and the rth concept tr' in T′ is CosTMI(ts′, tr′; tp)

式中，tp包含词汇序列{wp1，wp2，...，wpn}；ts’包含词汇序列{ws'1，ws'2，...，ws’n}；tr’包含词汇序列{wr’1，wr'2，...，wr’n}；PMI()是两个词汇的点互信息，两个词汇w与w’的点互信息为PMI(w，w’)，则有：In the formula, tp contains the lexical sequence {wp1, wp2, ..., wpn}; ts' contains the lexical sequence {ws'1, ws'2, ..., ws'n}; tr' contains the lexical sequence {wr' 1, wr'2,...,wr'n}; PMI() is the point mutual information of two words, and the point mutual information of two words w and w' is PMI(w, w'), then:

式中，P(w，w’)＝P(w)P(w′|w)；In the formula, P(w,w')=P(w)P(w'|w);

式中，z是主题，P(z＝j)是主题为j时的概率，P(w|z＝j)是主题为j时，词汇w的条件概率，k是概念的数量； In the formula, z is the topic, P(z=j) is the probability when the topic is j, P(w|z=j) is the conditional probability of the vocabulary w when the topic is j, and k is the number of concepts;

式中，P(w′|z＝j)是主题为j时候w’的条件概率，P(z＝j|w)是词汇为w时，主题j的条件概率； In the formula, P(w'|z=j) is the conditional probability of w' when the topic is j, and P(z=j|w) is the conditional probability of topic j when the vocabulary is w;

若CosTMI(ts′，tr′；tp)大于一定的阈值thc，则在tp和ts’、tr’建立关系；If CosTMI(ts', tr'; tp) is greater than a certain threshold thc, a relationship is established between tp and ts', tr';

第三步、计算标准相似性度量L(ts′，tr′；tp)，式中，P(ts′|tp)是(是在tp上下文词汇环境下ts’的发生的概率)，P(tr′|tp)是(是在tp上下文词汇环境下ts’的发生的概率)；The third step is to calculate the standard similarity measure L(ts', tr'; tp), In the formula, P(ts′|tp) is (is the occurrence probability of ts’ in the tp context vocabulary environment), P(tr′|tp) is (is the occurrence probability of ts’ in the tp context vocabulary environment) ;

在通过标准相似性度量L(ts′，tr′；tp)定义本体概念之间的关系时，每个通过Topic Model学习出的概念都对应一个本体的概念，每个概念ts’或者tr’在tp的上下文环境下的条件概率，用来计算同一层概念之间的语义相似度，值越小表明值的语义相似性越高；When the standard similarity measure L(ts', tr'; tp) is used to define the relationship between ontology concepts, each concept learned through the Topic Model corresponds to an ontology concept, and each concept ts' or tr' is in The conditional probability in the context of tp is used to calculate the semantic similarity between the concepts of the same layer. The smaller the value, the higher the semantic similarity of the value;

第四步、确定本体的层次结构The fourth step is to determine the hierarchy of the ontology

设利用TopicModel学习出三个概念层次Th、Tm、Tl，Th是最高层次，Tm是中间层次，Tl是最低层次，这三个变量的熵记为H(Th)、H(Tm)、H(Tl)，H(Tl|Tm)是信息领域中的条件熵，则相继两层的概念集合信息增益Δ(I(Th，Tm，Tl))定义为：Assume that the TopicModel is used to learn three conceptual levels Th, Tm, and Tl. Th is the highest level, Tm is the middle level, and Tl is the lowest level. The entropy of these three variables is recorded as H(Th), H(Tm), H( Tl), H(Tl|Tm) is the conditional entropy in the information field, then the concept set information gain Δ(I(Th, Tm, Tl)) of two consecutive layers is defined as:

Δ(I(Th，Tm，Tl))＝H(Th)-H(Tl|Tm)Δ(I(Th, Tm, Tl))=H(Th)-H(Tl|Tm)

当Δ(I(Th，Tm，Tl))小于规定的阈值ω时，停止利用LDA模型学习概念集合。When Δ(I(Th, Tm, Tl)) is less than the specified threshold ω, stop using the LDA model to learn the concept set.

优选地，在所述第一步中，进行概念层次细分产生本体构建的层次结构G时遵循以下规则：Preferably, in the first step, the following rules are followed when subdividing the concept hierarchy to generate the hierarchical structure G of ontology construction:

规则1：如果ti∈T，tj’∈T'，NT＜NT'，结论是：子概念集合T’比概念集合T，其中，NT和NT'分别是概念集合T和子概念集合T’的层高级别；Rule 1: If ti∈T, tj'∈T', NT<NT', the conclusion is: the sub-concept set T' is greater than the concept set T, where NT and NT' are the layers of the concept set T and the sub-concept set T' respectively high-level;

规则2：如果ti∈T，tj’∈T'，在ti与tj’之间极有可能存在上下级关系，其中，是空集。Rule 2: If ti∈T, tj'∈T', It is very likely that there is a subordinate-subordinate relationship between ti and tj', where, is the empty set.

本发明提出了一种新方法可以自动从所给文本语料库库中学习本体。我们利用一种被广泛应用的概率模型即TopicModel模型生成的概念作为构建本体所需要的概念单元，有了这些概念，还需要有一种方法度量这些概念的相似性，来定义本体结构中相邻上下两层概念之间的联系，也就是为了构建本题结构给概念之间建立起边，形成本体的层次架构。保证学习出的概念之间有联系，并且概念之间的联系最紧凑合理。为此我们定义了两个相似性度量，而且我们提出一个新的判别本体层次结构深度的标准，也就是提出了一个新的方法来判别学习本体时循环结束的标准。The invention proposes a new method that can automatically learn ontology from a given text corpus. We use the concepts generated by a widely used probability model, namely the TopicModel model, as the conceptual units required to construct ontology. With these concepts, we also need a method to measure the similarity of these concepts to define the adjacent up and down in the ontology structure. The connection between the two layers of concepts is to build the structure of the topic and establish edges between the concepts to form a hierarchical structure of the ontology. Ensure that there is a connection between the learned concepts, and the connection between the concepts is the most compact and reasonable. To this end, we define two similarity measures, and we propose a new criterion for judging the depth of the ontology hierarchy, that is, a new method for judging the end of the loop when learning an ontology.

本发明通过反复利用LDA模型即Topic Model模型产生概念，定义能够准确测量概念之间语义相似性的度量方法来构建本体的概念以及概念之间的结构层次。利用真实的GENIA语料库及本体GENIA本体验证本发明提出本体构建方法的有效性和实用性。The present invention generates concepts by repeatedly using the LDA model, that is, the Topic Model model, and defines a measurement method that can accurately measure the semantic similarity between the concepts to construct the concepts of the ontology and the structural levels between the concepts. The validity and practicability of the ontology construction method proposed in the present invention are verified by using the real GENIA corpus and ontology GENIA ontology.

附图说明Description of drawings

图1为准确率随topic维度的变化关系图；Figure 1 is a graph showing the relationship between the accuracy rate and the topic dimension;

图2为在CP度量下随着本体深度的增加准确率的变化关系图；Figure 2 is a graph of the change in accuracy rate with the increase of ontology depth under the CP measure;

图3为在L1度量下随本体深度的增加准确率的变化关系图。Fig. 3 is a graph showing the variation of the accuracy rate with the increase of the body depth under the L1 metric.

具体实施方式Detailed ways

下面结合具体实施例，进一步阐述本发明。应理解，这些实施例仅用于说明本发明而不用于限制本发明的范围。此外应理解，在阅读了本发明讲授的内容之后，本领域技术人员可以对本发明作各种改动或修改，这些等价形式同样落于本申请所附权利要求书所限定的范围。Below in conjunction with specific embodiment, further illustrate the present invention. It should be understood that these examples are only used to illustrate the present invention and are not intended to limit the scope of the present invention. In addition, it should be understood that after reading the teachings of the present invention, those skilled in the art can make various changes or modifications to the present invention, and these equivalent forms also fall within the scope defined by the appended claims of the present application.

本发明提供的一种一种利用Topic Model的自动学习本体的方法大致包括以下步骤：A method for automatically learning ontology using Topic Model provided by the present invention roughly includes the following steps:

第一步、利用LDA模型从给定的文献语料库中进行概念抽取，通过不断的重复利用该模型徐熙出构建本体所需要的概念集合。The first step is to use the LDA model to extract concepts from the given literature corpus, and to repeatedly use the model to extract the concept set needed to build the ontology.

第二步、设计了CP相似性度量，识别层次结构概念之间的相似性，即相邻层次间概念的潜在联系；制定了L1标准来判定本体构建的层次数量。In the second step, the CP similarity measure is designed to identify the similarity between concepts in the hierarchy, that is, the potential connection between concepts in adjacent levels; the L1 standard is formulated to determine the number of levels in the ontology construction.

第三步、通过实验验证了此本体构建方法的有效性和实效性。The third step is to verify the validity and effectiveness of this ontology construction method through experiments.

上述步骤涉及如下技术创新：The above steps involve the following technological innovations:

一)本体构建过程1) Ontology construction process

图1图示出本体构建的过程。构建一个层次结构G，G＝{T，E}，式中，T＝{t1，t2，…，tm}是概念集合，称为概念层，由LDA模型产出，可以定义为上层概念集合。T’＝{t1’，t2’，...，tm’}是子概念集合，定义为上层概念集合T的下一层概念集合。E是边的集合，每个eij∈E表示概念集合T中的第i个概念ti与子概念集合T′中第j个概念tj’有边相连G＝{T，E}，图中T＝{t1，t2，...，tm}是一个概念的集合。Figure 1 illustrates the process of ontology construction. Construct a hierarchical structure G, G = {T, E}, where T = {t1, t2, ..., tm} is a concept set, called the concept layer, which is output by the LDA model and can be defined as the upper-level concept set. T'={t1', t2', ..., tm'} is a sub-concept set, which is defined as the next-level concept set of the upper-level concept set T. E is a set of edges, each eij∈E means that the i-th concept ti in the concept set T and the j-th concept tj' in the sub-concept set T′ are connected by an edge G={T, E}, in the figure T= {t1,t2,...,tm} is a collection of concepts.

为了构建上下两层概念之间的联系，需要确定这些概念节点所属的概念层次，哪些属于高一层概念集合，哪些属于低一层的概念集合，而建立起这两层概念集合之间的联系会更加复杂。利用LDA模型的概念之间的界限不是特别清晰，需要利用一定的度量方法将这些概念分层，并且还要建立起层与层之间的关系，有些概念可能有几个父亲，有些概念可能没有孩子，产生的概念层次越多，概念层之间的关系越紧密，所以层次概念产生的数量不能无限制增多，需要人为设定一个本体构建的层次数量。In order to build the connection between the upper and lower levels of concepts, it is necessary to determine the concept level to which these concept nodes belong, which ones belong to the higher level concept set, and which ones belong to the lower level concept set, and establish the connection between the two level concept sets will be more complicated. The boundaries between concepts using the LDA model are not particularly clear. It is necessary to use certain measurement methods to layer these concepts, and to establish the relationship between layers. Some concepts may have several fathers, and some concepts may not. Children, the more concept layers are generated, the closer the relationship between the concept layers, so the number of layered concepts cannot be increased without limit, and it is necessary to artificially set the number of layers for an ontology construction.

二)有关规则2) Relevant rules

在提出具体实现自动本体学习方法之前，首先定义两个基本的规则。一般的情况下是不断的重复利用LDA模型产生出概念集合，用于构建层次结构所需要的概念。本发明定义了一些规则，这些规则用来限制该模型产生出的概念，在构建层级结构本体时候使用。Before proposing a specific implementation of automatic ontology learning methods, we first define two basic rules. In general, the LDA model is continuously reused to generate a concept set, which is used to construct the concepts required for the hierarchical structure. The present invention defines some rules, which are used to limit the concepts generated by the model, and are used when constructing the hierarchy ontology.

根据直觉，越处在高层的概念越抽象，反之越具体；越处在高层的概念越少，反之越多。那么基于这些常识，定义如下规则：According to intuition, the higher the concept is, the more abstract it is, and vice versa, the more concrete it is; the higher the level, the fewer concepts, and vice versa. Then, based on these common senses, the following rules are defined:

规则1：如果ti∈T，tj’∈T'，NT＜NT'，结论是：子概念集合T’比概念集合T，其中，NT和NT'分别是概念集合T和子概念集合T′的层高级别。Rule 1: If ti∈T, tj'∈T', NT<NT', the conclusion is: the sub-concept set T' is greater than the concept set T, where NT and NT' are the layers of the concept set T and the sub-concept set T' respectively high-level.

当用LDA模型重复的去学习产生概念集合的时候，必须首先确定NT＜NT’。因此该规则对于构建本体的方法是非常重要的。When using the LDA model to repeatedly learn to generate concept sets, it must first be determined that NT<NT'. Therefore, this rule is very important for the method of constructing ontology.

通过文献语料库由LDA学习出的每层的每一个概念都是在文献中高频出现的词汇，在高层高频出现的概念集合极有可能在低层概念集合里同样高频出现，所以在构建本体过程中这些相同的词汇可能建立联系，这是不合理的。因此定义如下规则：Each concept of each layer learned by LDA through the document corpus is a vocabulary that appears frequently in the literature. The concept set that appears frequently in the high-level is very likely to appear frequently in the low-level concept set. Therefore, in the process of building the ontology It is unreasonable that these same words in So define the following rules:

该规则能够帮助我们定义概念之间的本专利下面要介绍的相似性度量。This rule can help us define the similarity measure to be introduced below in this patent between concepts.

三)相似性度量3) Similarity measure

本发明利用相似性度量的方法来构建本体的层次结构，也就是说概念之间的联系是通过概念之间的相似度建立的。两个层次概念集中的两个概念之间达到一定的相似度值，才能建立联系，否则认为他们之间是没有联系的。为了计算两个概念之间的语义相似性，要利用LDA模型在产生概念集合生成的概念矩阵，每个矩阵输入是概念出现在本体里的可能性大小。The present invention utilizes the method of similarity measurement to construct the hierarchical structure of ontology, that is to say, the connection between concepts is established through the similarity between concepts. Only when the two concepts in the two levels of concept sets reach a certain similarity value can a connection be established, otherwise they are considered to be unconnected. In order to calculate the semantic similarity between two concepts, the LDA model is used to generate the concept matrix generated by the concept set, and each matrix input is the possibility of the concept appearing in the ontology.

通常概念之间的相似性利用点互信息PMI(Pointwise Mutual Information)来度量，本发明定义了一种新的词汇w和w’之间语义相似性度量方法，利用两个概念的期望来定义PMI，每个概念有一系列的词汇组成，这也是LDA模型的一个特殊性质。两个词汇w与w’的点互信息为PMI(w，w’)，则有：Usually, the similarity between concepts is measured by Pointwise Mutual Information (PMI). The present invention defines a new semantic similarity measurement method between words w and w', and uses the expectations of two concepts to define PMI. , each concept has a series of vocabulary components, which is also a special property of the LDA model. The point mutual information of two words w and w' is PMI(w, w'), then:

式中，P(w，w’)＝P(w)P(w′|w)；In the formula, P(w,w')=P(w)P(w'|w);

式中，z是主题，P(z＝j)是主题为j时的概率，P(w|z＝j)是主题为j时，词汇w的概率，k是概念的数量； In the formula, z is the topic, P(z=j) is the probability when the topic is j, P(w|z=j) is the probability of the vocabulary w when the topic is j, and k is the number of concepts;

式中，P(w′|z＝j)是主题为j时，w’的概率，P(z＝j|w)是词汇为w时，主题为j的条件概率。 In the formula, P(w'|z=j) is the probability of w' when the topic is j, and P(z=j|w) is the conditional probability of the topic j when the vocabulary is w.

本发明给出两个词汇的点互信息的计算公式是为后续组织构建本体之间概念的层次结构做准备，而且定义另个概念之间的语义相似性也会用到该公式。The calculation formula of the point mutual information of two vocabularies provided by the present invention is to prepare for the subsequent organization and construction of the hierarchical structure of concepts between ontology, and the formula is also used to define the semantic similarity between another concept.

由LDA产生的每个概念对应本体结构里面的一个概念。语义相似性度量是度量两个概念之间的语义相似度。在特殊语境的上下文中，另外两个概念的语义相似度。上层概念集合T中第p个概念tp与概念tp的上下文中，下一层概念集合T’中第s个概念ts’和第r个概念tr’两个概念的语义相似度CosTMI(ts′，tr′；tp)Each concept generated by LDA corresponds to a concept in the ontology structure. Semantic similarity measure is to measure the semantic similarity between two concepts. In the context of a particular context, the semantic similarity of two other concepts. In the context of the p-th concept tp in the upper-level concept set T and the concept tp, the semantic similarity CosTMI(ts′, tr'; tp)

式中，tp包含词汇序列{wp1，wp2，...，wpn}；ts’包含词汇序列{ws'1，ws'2，...，ws’n}；tr’包含词汇序列{wr’1，wr'2，...，wr’n}。In the formula, tp contains the lexical sequence {wp1, wp2, ..., wpn}; ts' contains the lexical sequence {ws'1, ws'2, ..., ws'n}; tr' contains the lexical sequence {wr' 1, wr'2, ..., wr'n}.

预先设定阈值thct，如果CosTMI(ts′，tr′；tp)值大于一定的阈值thct，在tp和ts，ts’建立关系。通过上述定义以及语义相似性的计算，得出的可以建立关系的概念都是本体构建中本体里的一个概念。阈值Thct是通过实验要确定的一个值，此值越大说明两个概念之间的语义相似性越大，反之语义相似性越小。The threshold thct is preset, and if the value of CosTMI(ts', tr'; tp) is greater than a certain threshold thct, a relationship is established between tp and ts, ts'. Through the above definitions and the calculation of semantic similarity, the concepts that can be related are all concepts in the ontology in the ontology construction. Threshold Thct is a value to be determined through experiments. The larger the value, the greater the semantic similarity between two concepts, otherwise the smaller the semantic similarity.

L1标准相似性度量：L1 standard similarity measure:

式中，P(ts′|tp)是(是在tp上下文词汇环境下ts’的发生的概率)，P(tr’|tp)是(是在tp上下文词汇环境下tr’的发生的概率)； In the formula, P(ts′|tp) is (is the probability of occurrence of ts' in the tp context vocabulary environment), P(tr'|tp) is (is the occurrence probability of tr' in the tp context vocabulary environment) ;

在通过标准相似性度量L(ts’，tr’；tp)定义本体概念之间的关系时，每个通过Topic Model学习出的概念都对应一个本体的概念，每个概念ts’或者tr’在tp的上下文环境下的条件概率，用来计算同一层概念之间的语义相似度，值越小表明值的语义相似性越高。When the standard similarity measure L(ts', tr'; tp) is used to define the relationship between ontology concepts, each concept learned through the Topic Model corresponds to an ontology concept, and each concept ts' or tr' is in The conditional probability in the context of tp is used to calculate the semantic similarity between concepts of the same layer. The smaller the value, the higher the semantic similarity of the value.

确定本体的层次结构：Determine the hierarchy of the ontology:

根据规则1，低层次的概念要比高层次概念更具体，在自然语言处理中，我们能够发现最抽象的概念。也就是那些最抽象的概念是无法再进行细分的概念。因此学习本体过程中，概念不能一直被细分，我们提出一个新的方法来确定本体合适的大小，即在给定的领域知识库中确定本体层次的数量。According to rule 1, low-level concepts are more concrete than high-level concepts, and in natural language processing, we are able to discover the most abstract concepts. That is, the most abstract concepts are those that cannot be further subdivided. Therefore, in the process of learning ontology, concepts cannot always be subdivided. We propose a new method to determine the appropriate size of ontology, that is, determine the number of ontology levels in a given domain knowledge base.

Δ(I(Th，Tm，Tl))＝H(Th)-H(Tl|Tm)Δ(I(Th, Tm, Tl))=H(Th)-H(Tl|Tm)

当Δ(I(Th，Tm，Tl))小于规定的阈值ω时，停止利用LDA模型学习概念集合。此不等式意味着当Δ(I(Th，Tm，Tl))值小于一定阈值时，Tm和Tl的概念分布语义已经十分相似，LDA概念学习达到一个本体概念层次学习的最高期望，此时概念层次数量就是本体的概念层次数量。在实际实验中，我们设定ω值接近于0。When Δ(I(Th, Tm, Tl)) is less than the specified threshold ω, stop using the LDA model to learn the concept set. This inequality means that when the value of Δ(I(Th, Tm, Tl)) is less than a certain threshold, the concept distribution semantics of Tm and Tl are very similar, and LDA concept learning reaches the highest expectation of an ontology concept level learning. At this time, the concept level Quantity is the number of conceptual levels of the ontology. In actual experiments, we set the value of ω close to 0.

我们所提出的构建本体方法，由GENIA语料库对应的GENIA本体来进行实验验证。GENIA语料库是一个生物语料库。该语料库包含1,999个医学词汇，是从MeSH、human、和blood cells中收集得到。GENIA本体中包含45个概念和42个关系。我们实验内容是将GENIA预料输入到LDA模型，计算出要构建本体的所需概念。我们对比了我们提出的方法和Zavitsanos等人提出的方法算法的执行是在奔腾4，内存2GB的PC机上完成，我们对比了CosTMI和Zavitsanos等人提出的CI方法，CI、CP和L1三个度量方法的参数设置如表1。The ontology construction method proposed by us is verified experimentally by the GENIA ontology corresponding to the GENIA corpus. The GENIA corpus is a biological corpus. The corpus contains 1,999 medical words collected from MeSH, human, and blood cells. The GENIA ontology contains 45 concepts and 42 relations. The content of our experiment is to input the GENIA prediction into the LDA model, and calculate the concepts required to construct the ontology. We compared the method we proposed with the method proposed by Zavitsanos et al. The execution of the algorithm was completed on a Pentium 4 PC with 2GB of memory. We compared the CI method proposed by CosTMI and Zavitsanos et al., CI, CP and L1. The parameter settings of the method are shown in Table 1.

表1相似性度量方法的参数设置Table 1 Parameter settings of the similarity measurement method

下面详细介绍一下我们的实验结果。GENIA本体层次结构是包含两个不同的本体。我们利用召回率、准确率和F1度量来评估我们提出的方法的执行效率和得到的本体结构的质量。召回率Rec的计算公式如下：The following describes our experimental results in detail. The GENIA ontology hierarchy consists of two different ontologies. We utilize recall, precision, and F1 metrics to evaluate the performance efficiency of our proposed method and the quality of the resulting ontology structures. The formula for calculating the recall rate Rec is as follows:

式中，n_rc是算法给出的学习的正确的概念的数量，N_r是模型总体计算出的概念的数量。 In the formula, n _rc is the number of correct concepts learned by the algorithm, and N _r is the number of concepts calculated by the model as a whole.

准确率Prec定义如下：The accuracy rate Prec is defined as follows:

式中，n_pc是概念学习的数量，N_p是算法学习出全部的概念数量。 In the formula, n _pc is the number of concepts learned, and N _p is the number of concepts learned by the algorithm.

F1度量的计算公式如下：The formula for calculating the F1 metric is as follows:

两个方法执行在学习概念上对比结果如表2所示：The comparison results of the two methods in terms of learning concepts are shown in Table 2:

表2算法学习的概念C基于相似性度量的执行结果Table 2 Concept C of Algorithm Learning Based on the Execution Results of Similarity Measures

从表2中我们可以看到，我们提出的方法AOL执行结果是非常有效地，能够被用于其他领域知识的本体构建，准确率和召回率都是高于CI方法。From Table 2, we can see that the AOL implementation results of our proposed method are very effective, and can be used for ontology construction of knowledge in other domains, and the accuracy and recall rates are higher than the CI method.

表3不同的相似性度量构建改建之间关系的执行结果Table 3. Execution results of different similarity measures to construct relationships between reconstructions

由表2和表3所示，对比结果非常令人满意，我们所提出的算法被证实是在构建领域本体方面是非常有效地。从实验对比结果中可以看出，我们的算法在识别概念的准确性方面确实稍有逊色，原因可能是算法在识别非常具体的概念方面还是有所欠缺的。As shown in Table 2 and Table 3, the comparison results are very satisfactory, and the proposed algorithm is proved to be very effective in constructing domain ontology. It can be seen from the experimental comparison results that our algorithm is indeed slightly inferior in the accuracy of recognizing concepts. The reason may be that the algorithm is still lacking in recognizing very specific concepts.

图1展示了每个概念包含的词汇数量，在我们做实验过程中发现，每个概念所包含的词汇数量会影响到本体构建的准确性。实验结果表明如果每个概念包含10个以下的词汇数量，会严重影响本体构建的准确性。反之，如果每个概念包含的词汇数量越多，构建出本体的准确性也越高。但是并不是包含的概念越多越好，通过实验测试分析，每个概念包含16个词汇结果会比较好，如果概念包含词汇太多，概念中会出现语料库中出现的一些低频词汇，对本体构建中概念的抽象意义不大，反而会影响到本体构建的实际质量。在图1中结果所示，利用本专利提出的CP和L1度量是效果是非常好的。该实验结果也表明本专利定义的词汇语义相似性度量在构建本体过程中进行概念语义相似性度量的有效性。Figure 1 shows the number of words contained in each concept. During our experiments, we found that the number of words contained in each concept will affect the accuracy of ontology construction. Experimental results show that if each concept contains less than 10 words, the accuracy of ontology construction will be seriously affected. Conversely, if each concept contains more words, the accuracy of the constructed ontology will be higher. However, the more concepts it contains, the better. Through experimental testing and analysis, each concept contains 16 words. If the concept contains too many words, some low-frequency words that appear in the corpus will appear in the concept. Ontology construction The abstract meaning of the concept in the ontology is not significant, but will affect the actual quality of ontology construction. As shown in the results in Fig. 1, the effect of using the CP and L1 metrics proposed in this patent is very good. The experimental results also show that the lexical semantic similarity measure defined in this patent is effective in measuring conceptual semantic similarity in the process of constructing an ontology.

我们还通过算法在构建本体过程中，本体层次深度的变化与准确率之间的变化关系。实验结果如图2和图3所示，主要展示了利用CP和L1度量时，随着本体深度的变化，准确率F1的变化情况。我们实验设定的参数th_cp＝0.93，th_fl＝1.24。算法结束的标准按照我们定义的DMI的值，该值设定为ω＝0.01。图2描述了当本体的深度达到7时，这准确率度量F1达到最高。对于图3是当本体深度达到8的时候，这个准确率F1达到最高。We also use the algorithm in the process of building ontology, the relationship between the change of ontology level depth and the accuracy rate. The experimental results are shown in Figure 2 and Figure 3, which mainly show the change of accuracy F1 with the change of ontology depth when using CP and L1 metrics. The parameters th _cp =0.93 and th _fl =1.24 are set in our experiments. The criterion for the end of the algorithm is according to the value of DMI defined by us, which is set as ω=0.01. Figure 2 depicts that when the depth of the ontology reaches 7, the accuracy metric F1 reaches the highest. For Figure 3, when the body depth reaches 8, the accuracy rate F1 reaches the highest.

最后我们必须提到一些有关影响本体构建的一些因素，自动本体构建一个开放的研究领域，还没有一个固定的标准来评估所有见本体的质量和效果；另外就是本专利实验所使用的本体GENIA也是由领域专家即人的主观或者比较基础的方法构建的。所以通过客观的方法去衡量一个主观的方法会更有困难。还有就是自动构建本体的评估要比在具有种子本体基础上扩展和更新本体的复杂性和困难会更多。Finally, we must mention some factors that affect ontology construction. Automatic ontology construction is an open research field, and there is no fixed standard to evaluate the quality and effect of all ontologies. In addition, the ontology GENIA used in this patent experiment is also Constructed by domain experts, that is, human subjective or comparatively basic methods. So it will be more difficult to measure a subjective method by an objective method. In addition, the evaluation of automatically constructing ontology is more complex and difficult than expanding and updating ontology based on seed ontology.

Claims

1. A method utilizing the automatic learning ontology of Topic Model, is characterized in that, comprises the following steps:

The first step is to use the LDA model to extract concepts from a given document corpus, generate a concept set from the extracted concepts, and then carry out conceptual hierarchical subdivision to generate a hierarchical structure G of ontology construction, G={T, E}, In the formula, T = {t1, t2, ..., tm} is a concept set, which is defined as the upper-level concept set; T' = {t1', t2', ..., tm'} is a sub-concept set, which is defined as the upper-level concept The concept set of the next layer of the set T, the concept set T and the sub-concept set T' are two consecutive layers; E is a set of edges, and each eij∈E represents the i-th concept ti and the sub-concept set in the concept set T The jth concept tj' in T' is connected by an edge;

The second step is to use the CosTMI similarity measurement method to identify the semantic similarity between two consecutive layers in the hierarchical structure G, wherein, in the context of the p-th concept tp in the upper-level concept set T and the concept tp, the lower-level concept set The semantic similarity between the sth concept ts' and the rth concept tr' in T' CosTMI(ts', tr'; tp)

In the formula, tp contains the lexical sequence {wp1, wp2, ..., wpn}; ts' contains the lexical sequence {ws'1, ws'2, ..., ws'n}; tr' contains the lexical sequence {wr' 1, wr'2,...,wr'n}; PMI() is the point mutual information of two words, and the point mutual information of two words w and w' is PMI(w, w'), then:

In the formula, P(w, w’)=P(w)P(w’|w);

In the formula, z is the topic, P(z=j) is the probability when the topic is j, P(w|z=j) is the conditional probability of the vocabulary w when the topic is j, and k is the number of concepts;

In the formula, P(w'|z=j) is the conditional probability of w' when the topic is j, and P(z=j|w) is the conditional probability of topic j when the vocabulary is w;

If CosTMI(ts', tr'; tp) is greater than a certain threshold thc, establish a relationship between tp and ts', tr';

The third step is to calculate the standard similarity measure L(ts', tr'; tp), In the formula, P(ts'|tp) is (is the occurrence probability of ts' in the tp context vocabulary environment), P(tr'|tp) is (is the occurrence probability of tr' in the tp context vocabulary environment) ;

When the standard similarity measure L(ts', tr'; tp) is used to define the relationship between ontology concepts, each concept learned through TopicModel corresponds to an ontology concept, and each concept ts' or tr' is in tp The conditional probability in the context environment is used to calculate the semantic similarity between the concepts of the same layer, and the smaller the value, the higher the semantic similarity of the value;

The fourth step is to determine the hierarchy of the ontology

Assume that the TopicModel is used to learn three conceptual levels Th, Tm, and Tl. Th is the highest level, Tm is the middle level, and Tl is the lowest level. The entropy of these three variables is recorded as H(Th), H(Tm), H( Tl), H(Tl|Tm) is the conditional entropy in the information field, then the concept set information gain Δ(I(Th, Tm, Tl)) of two consecutive layers is defined as:

Δ(I(Th, Tm, Tl))=H(Th)-H(Tl|Tm)

When Δ(I(Th, Tm, Tl)) is less than the specified threshold ω, stop using the LDA model to learn the concept set.

2. a kind of ontology construction method based on Topic Model as claimed in claim 1, it is characterized in that, in the first step, follow the following rules when carrying out concept hierarchy subdivision to produce the hierarchical structure G of ontology construction:

Rule 1: If ti∈T, tj'∈T', The conclusion is: the sub-concept set T' is higher than the concept set T, where NT and NT' are the higher levels of the concept set T and the sub-concept set T'respectively;

Rule 2: If ti∈T, tj'∈T', It is very likely that there is a subordinate-subordinate relationship between ti and tj', where, is the empty set.