CN107895012B

CN107895012B - An Ontology Construction Method Based on Topic Model

Info

Publication number: CN107895012B
Application number: CN201711112981.9A
Authority: CN
Inventors: 林志杰
Original assignee: Shanghai Dianji University
Current assignee: Shanghai Dianji University
Priority date: 2017-11-10
Filing date: 2017-11-10
Publication date: 2021-10-08
Anticipated expiration: 2037-11-10
Also published as: CN107895012A

Abstract

The invention provides an ontology construction method based on Topic Model. The invention proposes an AOL method, which supports automatic domain ontology construction, and invents a method for measuring the semantic similarity between information calculation concepts, which is used to calculate the semantic similarity between concepts generated by the LDA model. The AOL method There is no limit to the number of child nodes of the root node, and there is no need to have a seed ontology as an aid for the initial learning of the ontology. The experimental results show that the method for automatic ontology construction using Topic Model proposed by the present invention is very effective.

Description

Ontology construction method based on Topic Model

Technical Field

The invention relates to a method for constructing an ontology, which utilizes TopicModel as a unit for generating basic concepts and can learn the ontology without ontology seeds so as to achieve the purpose of constructing the ontology.

Background

In recent years, ontology has been applied to various fields such as artificial intelligence, information extraction, machine translation, and the like. However, the manual construction of the ontology is time-consuming and labor-consuming work, and for this reason, the automatic construction of the ontology by means of computer data analysis and data mining is a significant research, and many researchers are attracted to perform a great deal of intensive research on the ontology. Most current ontology learning methods focus on expanding and updating existing ontology seeds by extracting concepts or vocabulary units from the document lexicon to update and broaden the ontology seeds. There are some methods for automatically learning ontologies, but most of these methods for automatically learning ontologies are based on ontologies in special knowledge fields, such as SKOS models, but these methods have certain limitations.

The Topic Model probabilistic Model is a Model that has proven to be very effective by the industry to identify concepts from scientific publications without a priori knowledge being available. The Topic Model has now been widely applied in the field of text mining.

Elias zaavitsanos et al propose an automatic ontology learning method based on a statistical method, which is to continuously and repeatedly use a concept set trained by a Topic Model, and then judge the relation between recognized concepts by using condition independence, but the method cannot carry out the relation between concepts of two hierarchical structures. Wang wei et al propose two methods, both of which are based on a semantic Web learning ontology structure, which show good recall rate and accuracy by combining an information theory and a Topic Model, but need to limit the number of sub-concept nodes of the nearest root node.

Disclosure of Invention

The invention aims to provide a measurement method for calculating semantic similarity between concepts of information, which is used for calculating the semantic similarity between the concepts generated by an LDA model.

In order to achieve the above object, the technical solution of the present invention is to provide a body construction method based on a Topic Model, which is characterized by comprising the following steps:

the method comprises the steps of firstly, extracting concepts from a given document corpus by using an LDA model, generating a concept set by using the extracted concepts, and then performing concept hierarchy subdivision to generate a hierarchy G of an ontology, wherein T is { T1, T2, … and tm } which is a concept set and is defined as an upper-layer concept set; t '{ T1', T2 ', …, tm' } is a set of sub-concepts, defined as a set of concepts at a lower level of the set of concepts at an upper level T; e is a set of edges, and each eij epsilon E represents that the ith concept ti in the concept set T is connected with the jth concept tj 'in the sub-concept set T' by an edge;

secondly, identifying similarity among all concepts in the hierarchical structure G, namely potential connection of concepts among adjacent hierarchies by using a CosTMI similarity measurement method, wherein in the context of the p-th concept tp and the concept tp in the upper-layer concept set T, semantic similarity CosTMI (ts ', tr '; tp) of the s-th concept ts ' and the r-th concept tr ' in the lower-layer concept set T ')

In the formula, tp comprises the vocabulary sequence { wp1, wp2, …, wpn }; ts 'contains the lexical sequence { ws' 1, ws '2, …, ws' n }; tr 'contains the lexical sequence { wr' 1, wr '2, …, wr' n }; PMI () is point mutual information of two vocabularies, and point mutual information of two vocabularies w and w 'is PMI (w, w'), there are:

wherein P (w, w ') ═ P (w) P (w' | w);

where z is the topic, P (z ═ j) is the probability for the topic j, P (w | z ═ j) is the conditional probability for the word w for the topic j, and k is the number of concepts;

in the formula, when P (w '| z ═ j) is a conditional probability with topic j, w' and when P (z | j | w) is a word w, the conditional probability with topic j.

Preferably, in the first step, the following rules are followed when performing concept hierarchy subdivision to generate the ontology-structured hierarchy G:

rule 1: if ti ∈ T, tj ' ∈ T ', NT < NT ', the conclusion is: the sub-concept set T ' is higher than the concept set T, wherein NT and NT ' are the level of hierarchy of the concept set T and the sub-concept set T ', respectively;

rule 2: if ti e T, tj 'e T',

there is a high likelihood of a high and low level relationship between ti and tj', where,

is an empty set.

The invention provides an AOL method, which supports automatic domain ontology construction, invents a measurement method for semantic similarity between information calculation concepts, is used for calculating the semantic similarity between concepts generated by an LDA model, does not limit the number of child nodes of a root node, and does not need to be assisted by a seed ontology as an initial learning ontology. Experimental results show that the method for performing automatic ontology construction by using the Topic Model is very effective.

The invention constructs the concepts of the ontology and the structural hierarchy among the concepts by repeatedly utilizing an LDA Model, namely a Topic Model to generate the concepts and defining a measurement method capable of accurately measuring semantic similarity among the concepts.

Drawings

FIG. 1 is a process of building a body structure;

FIG. 2 is a diagram of the accuracy of concepts versus lexical dimensions;

FIG. 3 is a graph of the number of body levels versus the F1 metric.

Detailed Description

The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.

The invention provides a body construction method based on a Topic Model, which comprises the following steps:

the method comprises the steps that firstly, concept extraction is carried out from a given document corpus by utilizing an LDA model, and then concept hierarchy subdivision is carried out to generate a hierarchy structure constructed by an ontology;

secondly, designing a CosTMI similarity measurement method, and identifying the similarity between the concepts of the hierarchical structure, namely the potential relation of the concepts between adjacent hierarchies;

the steps involve the following technical innovations:

one) ontology construction process

FIG. 1 illustrates a process of ontology construction. And constructing a hierarchical structure G, G ═ T, E }, wherein T ═ T1, T2, …, tm } is a concept set, called a concept layer, produced by the LDA model and can be defined as an upper concept set. T '{ T1', T2 ', …, tm' } is a set of sub-concepts, defined as a set of concepts at a lower level of the set of concepts at an upper level T. E is a set of edges, each eij E represents that the ith concept ti in the concept set T is connected with the jth concept tj 'in the sub-concept set T' by an edge, G ═ T, E }, where T ═ T1, T2, …, tm } is a set of concepts.

In order to build the connection between the concepts of the upper layer and the lower layer, the concept levels to which the concept nodes belong, which belong to the concept set of the upper layer, and which belong to the concept set of the lower layer need to be determined, and the connection between the concept sets of the two layers is more complicated. The boundaries between concepts using the LDA model are not particularly clear, the concepts need to be layered by using a certain measurement method, and relationships between layers need to be established, some concepts may have several parents, some concepts may have no children, the more concept layers are generated, the tighter relationship between concept layers is, so the number of generated concept layers cannot be increased without limitation, and the number of layers of an ontology construction needs to be set manually.

Two) related rules

Before proposing a method for implementing automatic ontology learning in detail, two basic rules are first defined. It is common practice to continually reuse the LDA model to generate a set of concepts for building the concepts required by the hierarchy. The present invention defines rules that limit the concepts generated by the model for use in building the hierarchy ontology.

According to intuition, concepts at higher levels are more abstract, and vice versa are more concrete; the higher the level the fewer concepts and vice versa. Then based on these common sense, the following rules are defined:

rule 1: if ti ∈ T, tj ' ∈ T ', NT < NT ', the conclusion is: the sub-concept set T ' is at a higher level than the concept set T, where NT and NT ' are the hierarchy levels of the concept set T and the sub-concept set T ', respectively.

When repeated de-learning with the LDA model yields a set of concepts, NT < NT' must first be determined. The rules are therefore very important to the method of building the ontology.

Every concept of each layer learned by LDA through a document corpus is a word which appears in documents at a high frequency, and a concept set appearing at a high frequency at a high layer is highly likely to appear at the same high frequency in a low-layer concept set, so that the same words can be connected in the process of constructing an ontology, which is unreasonable. The following rules are thus defined:

rule 2: if ti e T, tj 'e T',

is an empty set.

This rule can help us define similarity measures between concepts as described in this patent below.

Three) similarity measurement

The invention utilizes a similarity measurement method to construct the hierarchy of the ontology, namely, the relation between concepts is established through the similarity between the concepts. Until a certain similarity value is reached between two concepts in the two hierarchical concept sets, the connection can be established, otherwise, the two concepts are considered to be not connected. In order to calculate semantic similarity between two concepts, a concept matrix generated in the process of generating a concept set is used by using an LDA model, and each matrix input is the probability size of the concept appearing in an ontology.

In general, similarity between concepts is measured by using point Mutual information PMI (point Mutual information), the invention defines a new semantic similarity measurement method between words w and w', and PMI is defined by using expectation of two concepts, wherein each concept has a series of word compositions, which is also a special property of LDA model. The mutual point information of the two words w and w 'is PMI (w, w'), and there are:

wherein P (w, w ') ═ P (w) P (w' | w);

where z is the topic, P (z ═ j) is the probability for topic j, and P (w | z ═ j) is the probability for topic j, the vocabulary is representedProbability of w, k is the number of concepts;

in the formula, P (w' | z ═ j) is a probability with a topic j, and P (z | j | w) is a conditional probability with a topic j, with a word w.

The invention provides a calculation formula of point mutual information of two vocabularies, which is used for preparing the hierarchical structure of concepts between ontologies for subsequent organization and construction, and the formula can be used for defining semantic similarity between other concepts.

Each concept generated by LDA corresponds to a concept within the ontology structure. The semantic similarity measure is a measure of semantic similarity between two concepts. In the context of a special context, the semantic similarity of two other concepts. In the context of the pth concept tp and the concept tp in the upper concept set T, the semantic similarity CosTMI (ts ', tr '; tp) of the two concepts s ' and r ' in the lower concept set T ')

In the formula, tp comprises the vocabulary sequence { wp1, wp2, …, wpn }; ts 'contains the lexical sequence { ws' 1, ws '2, …, ws' n }; tr 'contains the sequence of words wr' 1, wr '2, …, wr' n.

A threshold value thct is preset, and if the CosTMI (ts ', tr '; tp) value is larger than a certain threshold value thct, a relation is established between tp and ts, ts '. Through the definition and the calculation of semantic similarity, the obtained concepts capable of establishing the relationship are all concepts in the ontology construction. The threshold Thct is a value to be determined by experiment, and a larger value indicates a larger semantic similarity between two concepts, whereas a smaller semantic similarity is obtained.

The validity and practicality of the ontology construction method proposed herein are verified below using the real GENIA corpus and the ontology GENIA ontology.

The ontology construction method provided by the invention is used for carrying out experimental verification by using the GENIA ontology corresponding to the GENIA corpus. The GENIA corpus is a biological corpus. This corpus contains 1,999 medical vocabularies, which were collected from MeSH, human, and blood cells. The GENIA ontology contains 45 concepts and 42 relationships. The experimental content of the invention is to input the GENIA expectation into the LDA model and calculate the required concept of the ontology to be constructed. Compared with the method algorithms proposed by the methods AOL, Zavidisanos and the like, the method algorithms are executed on a PC with Pentium 4 and memory 2GB, and compared with the CI methods proposed by CosTMI, Zavidisanos and the like, the threshold values of parameter setting are 0.93 and 3-10-6 respectively.

The algorithm proposed by the invention finally evaluates the effectiveness and the quality of the body structure by the recall rate, the accuracy and the F1 measurement. The results of the comparison performed by the two methods are shown in table 1.

TABLE 1 results of similarity metric based execution of concept C and relationship S

From table 1 we can see that our proposed method AOL performs very efficiently, can be used for ontology construction of other domain knowledge, and both accuracy and recall are higher than CI methods.

FIG. 2 shows the number of words contained in each concept, and in the process of experiments, the number of words contained in each concept influences the accuracy of ontology construction. The experimental result shows that if each concept contains less than 10 vocabulary quantities, the accuracy of ontology construction is seriously influenced. Conversely, if each concept contains a larger number of words, the accuracy of constructing the ontology is higher. However, the more the concepts that are not included, the better, through experimental test analysis, the result that each concept includes 16 vocabularies is better, if the concepts include too many vocabularies, some low-frequency vocabularies appearing in the corpus can appear in the concepts, the abstract meaning of the concepts in the ontology construction is not large, and the actual quality of the ontology construction can be influenced.

In fig. 3, we show a detailed diagram of the accuracy of the algorithm execution, which shows how the algorithm execution is a change of F1 value when the download threshold value thct of CosTMI metric is 0.93, and in fig. 3, we can see that the F1 value is the highest when the number of body levels is 7.

Claims

1. A body construction method based on a Topic Model is characterized by comprising the following steps:

the method comprises the steps of firstly, extracting concepts from a given document corpus by using an LDA model, generating a concept set by using the extracted concepts, and then performing concept hierarchy subdivision to generate a hierarchy G of an ontology, wherein T is { T1, T2, … and tm } which is a concept set and is defined as an upper-layer concept set; t '{ T1', T2 ', …, tm' } is a set of sub-concepts, defined as a set of concepts at a lower level of the set of concepts at an upper level T; e is a set of edges, each eij epsilon E represents that the ith concept ti in the concept set T is connected with the jth concept tj 'in the sub-concept set T' by an edge, wherein the following rules are followed when the concept hierarchy subdivision is carried out to generate the hierarchy G constructed by the ontology:

rule 1: if ti ∈ T, tj ' ∈ T ', NT < NT ', the conclusion is: the concept level of the concept set T is higher than that of the concept set T, wherein NT and NT 'are the level levels of the concept set T and the concept set T', respectively;

rule 2: if ti e T, tj 'e T',

is an empty set;

wherein P (w, w ') ═ P (w) P (w' | w);

in the formula, P (w '| z ═ j) is a conditional probability of w' when the topic is j, and P (z ═ j | w) is a conditional probability of the topic j when the vocabulary is w.