Disclosure of Invention
The invention aims to provide a measurement method for calculating semantic similarity between concepts of information, which is used for calculating the semantic similarity between the concepts generated by an LDA model.
In order to achieve the above object, the technical solution of the present invention is to provide a body construction method based on a Topic Model, which is characterized by comprising the following steps:
the method comprises the steps of firstly, extracting concepts from a given document corpus by using an LDA model, generating a concept set by using the extracted concepts, and then performing concept hierarchy subdivision to generate a hierarchy G of an ontology, wherein T is { T1, T2, … and tm } which is a concept set and is defined as an upper-layer concept set; t '{ T1', T2 ', …, tm' } is a set of sub-concepts, defined as a set of concepts at a lower level of the set of concepts at an upper level T; e is a set of edges, and each eij epsilon E represents that the ith concept ti in the concept set T is connected with the jth concept tj 'in the sub-concept set T' by an edge;
secondly, identifying similarity among all concepts in the hierarchical structure G, namely potential connection of concepts among adjacent hierarchies by using a CosTMI similarity measurement method, wherein in the context of the p-th concept tp and the concept tp in the upper-layer concept set T, semantic similarity CosTMI (ts ', tr '; tp) of the s-th concept ts ' and the r-th concept tr ' in the lower-layer concept set T ')
In the formula, tp comprises the vocabulary sequence { wp1, wp2, …, wpn }; ts 'contains the lexical sequence { ws' 1, ws '2, …, ws' n }; tr 'contains the lexical sequence { wr' 1, wr '2, …, wr' n }; PMI () is point mutual information of two vocabularies, and point mutual information of two vocabularies w and w 'is PMI (w, w'), there are:
wherein P (w, w ') ═ P (w) P (w' | w);
where z is the topic, P (z ═ j) is the probability for the topic j, P (w | z ═ j) is the conditional probability for the word w for the topic j, and k is the number of concepts;
in the formula, when P (w '| z ═ j) is a conditional probability with topic j, w' and when P (z | j | w) is a word w, the conditional probability with topic j.
Preferably, in the first step, the following rules are followed when performing concept hierarchy subdivision to generate the ontology-structured hierarchy G:
rule 1: if ti ∈ T, tj ' ∈ T ', NT < NT ', the conclusion is: the sub-concept set T ' is higher than the concept set T, wherein NT and NT ' are the level of hierarchy of the concept set T and the sub-concept set T ', respectively;
rule 2: if ti e T, tj 'e T',
there is a high likelihood of a high and low level relationship between ti and tj', where,
is an empty set.
The invention provides an AOL method, which supports automatic domain ontology construction, invents a measurement method for semantic similarity between information calculation concepts, is used for calculating the semantic similarity between concepts generated by an LDA model, does not limit the number of child nodes of a root node, and does not need to be assisted by a seed ontology as an initial learning ontology. Experimental results show that the method for performing automatic ontology construction by using the Topic Model is very effective.
The invention constructs the concepts of the ontology and the structural hierarchy among the concepts by repeatedly utilizing an LDA Model, namely a Topic Model to generate the concepts and defining a measurement method capable of accurately measuring semantic similarity among the concepts.
Detailed Description
The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.
The invention provides a body construction method based on a Topic Model, which comprises the following steps:
the method comprises the steps that firstly, concept extraction is carried out from a given document corpus by utilizing an LDA model, and then concept hierarchy subdivision is carried out to generate a hierarchy structure constructed by an ontology;
secondly, designing a CosTMI similarity measurement method, and identifying the similarity between the concepts of the hierarchical structure, namely the potential relation of the concepts between adjacent hierarchies;
the steps involve the following technical innovations:
one) ontology construction process
FIG. 1 illustrates a process of ontology construction. And constructing a hierarchical structure G, G ═ T, E }, wherein T ═ T1, T2, …, tm } is a concept set, called a concept layer, produced by the LDA model and can be defined as an upper concept set. T '{ T1', T2 ', …, tm' } is a set of sub-concepts, defined as a set of concepts at a lower level of the set of concepts at an upper level T. E is a set of edges, each eij E represents that the ith concept ti in the concept set T is connected with the jth concept tj 'in the sub-concept set T' by an edge, G ═ T, E }, where T ═ T1, T2, …, tm } is a set of concepts.
In order to build the connection between the concepts of the upper layer and the lower layer, the concept levels to which the concept nodes belong, which belong to the concept set of the upper layer, and which belong to the concept set of the lower layer need to be determined, and the connection between the concept sets of the two layers is more complicated. The boundaries between concepts using the LDA model are not particularly clear, the concepts need to be layered by using a certain measurement method, and relationships between layers need to be established, some concepts may have several parents, some concepts may have no children, the more concept layers are generated, the tighter relationship between concept layers is, so the number of generated concept layers cannot be increased without limitation, and the number of layers of an ontology construction needs to be set manually.
Two) related rules
Before proposing a method for implementing automatic ontology learning in detail, two basic rules are first defined. It is common practice to continually reuse the LDA model to generate a set of concepts for building the concepts required by the hierarchy. The present invention defines rules that limit the concepts generated by the model for use in building the hierarchy ontology.
According to intuition, concepts at higher levels are more abstract, and vice versa are more concrete; the higher the level the fewer concepts and vice versa. Then based on these common sense, the following rules are defined:
rule 1: if ti ∈ T, tj ' ∈ T ', NT < NT ', the conclusion is: the sub-concept set T ' is at a higher level than the concept set T, where NT and NT ' are the hierarchy levels of the concept set T and the sub-concept set T ', respectively.
When repeated de-learning with the LDA model yields a set of concepts, NT < NT' must first be determined. The rules are therefore very important to the method of building the ontology.
Every concept of each layer learned by LDA through a document corpus is a word which appears in documents at a high frequency, and a concept set appearing at a high frequency at a high layer is highly likely to appear at the same high frequency in a low-layer concept set, so that the same words can be connected in the process of constructing an ontology, which is unreasonable. The following rules are thus defined:
rule 2: if ti e T, tj 'e T',
there is a high likelihood of a high and low level relationship between ti and tj', where,
is an empty set.
This rule can help us define similarity measures between concepts as described in this patent below.
Three) similarity measurement
The invention utilizes a similarity measurement method to construct the hierarchy of the ontology, namely, the relation between concepts is established through the similarity between the concepts. Until a certain similarity value is reached between two concepts in the two hierarchical concept sets, the connection can be established, otherwise, the two concepts are considered to be not connected. In order to calculate semantic similarity between two concepts, a concept matrix generated in the process of generating a concept set is used by using an LDA model, and each matrix input is the probability size of the concept appearing in an ontology.
In general, similarity between concepts is measured by using point Mutual information PMI (point Mutual information), the invention defines a new semantic similarity measurement method between words w and w', and PMI is defined by using expectation of two concepts, wherein each concept has a series of word compositions, which is also a special property of LDA model. The mutual point information of the two words w and w 'is PMI (w, w'), and there are:
wherein P (w, w ') ═ P (w) P (w' | w);
where z is the topic, P (z ═ j) is the probability for topic j, and P (w | z ═ j) is the probability for topic j, the vocabulary is representedProbability of w, k is the number of concepts;
in the formula, P (w' | z ═ j) is a probability with a topic j, and P (z | j | w) is a conditional probability with a topic j, with a word w.
The invention provides a calculation formula of point mutual information of two vocabularies, which is used for preparing the hierarchical structure of concepts between ontologies for subsequent organization and construction, and the formula can be used for defining semantic similarity between other concepts.
Each concept generated by LDA corresponds to a concept within the ontology structure. The semantic similarity measure is a measure of semantic similarity between two concepts. In the context of a special context, the semantic similarity of two other concepts. In the context of the pth concept tp and the concept tp in the upper concept set T, the semantic similarity CosTMI (ts ', tr '; tp) of the two concepts s ' and r ' in the lower concept set T ')
In the formula, tp comprises the vocabulary sequence { wp1, wp2, …, wpn }; ts 'contains the lexical sequence { ws' 1, ws '2, …, ws' n }; tr 'contains the sequence of words wr' 1, wr '2, …, wr' n.
A threshold value thct is preset, and if the CosTMI (ts ', tr '; tp) value is larger than a certain threshold value thct, a relation is established between tp and ts, ts '. Through the definition and the calculation of semantic similarity, the obtained concepts capable of establishing the relationship are all concepts in the ontology construction. The threshold Thct is a value to be determined by experiment, and a larger value indicates a larger semantic similarity between two concepts, whereas a smaller semantic similarity is obtained.
The validity and practicality of the ontology construction method proposed herein are verified below using the real GENIA corpus and the ontology GENIA ontology.
The ontology construction method provided by the invention is used for carrying out experimental verification by using the GENIA ontology corresponding to the GENIA corpus. The GENIA corpus is a biological corpus. This corpus contains 1,999 medical vocabularies, which were collected from MeSH, human, and blood cells. The GENIA ontology contains 45 concepts and 42 relationships. The experimental content of the invention is to input the GENIA expectation into the LDA model and calculate the required concept of the ontology to be constructed. Compared with the method algorithms proposed by the methods AOL, Zavidisanos and the like, the method algorithms are executed on a PC with Pentium 4 and memory 2GB, and compared with the CI methods proposed by CosTMI, Zavidisanos and the like, the threshold values of parameter setting are 0.93 and 3-10-6 respectively.
The algorithm proposed by the invention finally evaluates the effectiveness and the quality of the body structure by the recall rate, the accuracy and the F1 measurement. The results of the comparison performed by the two methods are shown in table 1.
TABLE 1 results of similarity metric based execution of concept C and relationship S
From table 1 we can see that our proposed method AOL performs very efficiently, can be used for ontology construction of other domain knowledge, and both accuracy and recall are higher than CI methods.
FIG. 2 shows the number of words contained in each concept, and in the process of experiments, the number of words contained in each concept influences the accuracy of ontology construction. The experimental result shows that if each concept contains less than 10 vocabulary quantities, the accuracy of ontology construction is seriously influenced. Conversely, if each concept contains a larger number of words, the accuracy of constructing the ontology is higher. However, the more the concepts that are not included, the better, through experimental test analysis, the result that each concept includes 16 vocabularies is better, if the concepts include too many vocabularies, some low-frequency vocabularies appearing in the corpus can appear in the concepts, the abstract meaning of the concepts in the ontology construction is not large, and the actual quality of the ontology construction can be influenced.
In fig. 3, we show a detailed diagram of the accuracy of the algorithm execution, which shows how the algorithm execution is a change of F1 value when the download threshold value thct of CosTMI metric is 0.93, and in fig. 3, we can see that the F1 value is the highest when the number of body levels is 7.