CN107895012B - Ontology construction method based on Topic Model - Google Patents
Ontology construction method based on Topic Model Download PDFInfo
- Publication number
- CN107895012B CN107895012B CN201711112981.9A CN201711112981A CN107895012B CN 107895012 B CN107895012 B CN 107895012B CN 201711112981 A CN201711112981 A CN 201711112981A CN 107895012 B CN107895012 B CN 107895012B
- Authority
- CN
- China
- Prior art keywords
- concept
- concepts
- ontology
- topic
- hierarchy
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000010276 construction Methods 0.000 title claims abstract description 24
- 238000000034 method Methods 0.000 claims abstract description 30
- 238000000691 measurement method Methods 0.000 claims abstract description 10
- 238000004364 calculation method Methods 0.000 abstract description 4
- 238000004422 calculation algorithm Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000011524 similarity measure Methods 0.000 description 2
- 241000611421 Elia Species 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 210000000601 blood cell Anatomy 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 210000005260 human cell Anatomy 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Animal Behavior & Ethology (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a body construction method based on a Topic Model. The invention provides an AOL method, which supports automatic domain ontology construction, invents a measurement method for semantic similarity between information calculation concepts, is used for calculating the semantic similarity between concepts generated by an LDA model, does not limit the number of child nodes of a root node, and does not need to be assisted by a seed ontology as an initial learning ontology. Experimental results show that the method for performing automatic ontology construction by using the Topic Model is very effective.
Description
Technical Field
The invention relates to a method for constructing an ontology, which utilizes TopicModel as a unit for generating basic concepts and can learn the ontology without ontology seeds so as to achieve the purpose of constructing the ontology.
Background
In recent years, ontology has been applied to various fields such as artificial intelligence, information extraction, machine translation, and the like. However, the manual construction of the ontology is time-consuming and labor-consuming work, and for this reason, the automatic construction of the ontology by means of computer data analysis and data mining is a significant research, and many researchers are attracted to perform a great deal of intensive research on the ontology. Most current ontology learning methods focus on expanding and updating existing ontology seeds by extracting concepts or vocabulary units from the document lexicon to update and broaden the ontology seeds. There are some methods for automatically learning ontologies, but most of these methods for automatically learning ontologies are based on ontologies in special knowledge fields, such as SKOS models, but these methods have certain limitations.
The Topic Model probabilistic Model is a Model that has proven to be very effective by the industry to identify concepts from scientific publications without a priori knowledge being available. The Topic Model has now been widely applied in the field of text mining.
Elias zaavitsanos et al propose an automatic ontology learning method based on a statistical method, which is to continuously and repeatedly use a concept set trained by a Topic Model, and then judge the relation between recognized concepts by using condition independence, but the method cannot carry out the relation between concepts of two hierarchical structures. Wang wei et al propose two methods, both of which are based on a semantic Web learning ontology structure, which show good recall rate and accuracy by combining an information theory and a Topic Model, but need to limit the number of sub-concept nodes of the nearest root node.
Disclosure of Invention
The invention aims to provide a measurement method for calculating semantic similarity between concepts of information, which is used for calculating the semantic similarity between the concepts generated by an LDA model.
In order to achieve the above object, the technical solution of the present invention is to provide a body construction method based on a Topic Model, which is characterized by comprising the following steps:
the method comprises the steps of firstly, extracting concepts from a given document corpus by using an LDA model, generating a concept set by using the extracted concepts, and then performing concept hierarchy subdivision to generate a hierarchy G of an ontology, wherein T is { T1, T2, … and tm } which is a concept set and is defined as an upper-layer concept set; t '{ T1', T2 ', …, tm' } is a set of sub-concepts, defined as a set of concepts at a lower level of the set of concepts at an upper level T; e is a set of edges, and each eij epsilon E represents that the ith concept ti in the concept set T is connected with the jth concept tj 'in the sub-concept set T' by an edge;
secondly, identifying similarity among all concepts in the hierarchical structure G, namely potential connection of concepts among adjacent hierarchies by using a CosTMI similarity measurement method, wherein in the context of the p-th concept tp and the concept tp in the upper-layer concept set T, semantic similarity CosTMI (ts ', tr '; tp) of the s-th concept ts ' and the r-th concept tr ' in the lower-layer concept set T ')
In the formula, tp comprises the vocabulary sequence { wp1, wp2, …, wpn }; ts 'contains the lexical sequence { ws' 1, ws '2, …, ws' n }; tr 'contains the lexical sequence { wr' 1, wr '2, …, wr' n }; PMI () is point mutual information of two vocabularies, and point mutual information of two vocabularies w and w 'is PMI (w, w'), there are:
wherein P (w, w ') ═ P (w) P (w' | w);
where z is the topic, P (z ═ j) is the probability for the topic j, P (w | z ═ j) is the conditional probability for the word w for the topic j, and k is the number of concepts;
in the formula, when P (w '| z ═ j) is a conditional probability with topic j, w' and when P (z | j | w) is a word w, the conditional probability with topic j.
Preferably, in the first step, the following rules are followed when performing concept hierarchy subdivision to generate the ontology-structured hierarchy G:
rule 1: if ti ∈ T, tj ' ∈ T ', NT < NT ', the conclusion is: the sub-concept set T ' is higher than the concept set T, wherein NT and NT ' are the level of hierarchy of the concept set T and the sub-concept set T ', respectively;
rule 2: if ti e T, tj 'e T',there is a high likelihood of a high and low level relationship between ti and tj', where,is an empty set.
The invention provides an AOL method, which supports automatic domain ontology construction, invents a measurement method for semantic similarity between information calculation concepts, is used for calculating the semantic similarity between concepts generated by an LDA model, does not limit the number of child nodes of a root node, and does not need to be assisted by a seed ontology as an initial learning ontology. Experimental results show that the method for performing automatic ontology construction by using the Topic Model is very effective.
The invention constructs the concepts of the ontology and the structural hierarchy among the concepts by repeatedly utilizing an LDA Model, namely a Topic Model to generate the concepts and defining a measurement method capable of accurately measuring semantic similarity among the concepts.
Drawings
FIG. 1 is a process of building a body structure;
FIG. 2 is a diagram of the accuracy of concepts versus lexical dimensions;
FIG. 3 is a graph of the number of body levels versus the F1 metric.
Detailed Description
The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.
The invention provides a body construction method based on a Topic Model, which comprises the following steps:
the method comprises the steps that firstly, concept extraction is carried out from a given document corpus by utilizing an LDA model, and then concept hierarchy subdivision is carried out to generate a hierarchy structure constructed by an ontology;
secondly, designing a CosTMI similarity measurement method, and identifying the similarity between the concepts of the hierarchical structure, namely the potential relation of the concepts between adjacent hierarchies;
the steps involve the following technical innovations:
one) ontology construction process
FIG. 1 illustrates a process of ontology construction. And constructing a hierarchical structure G, G ═ T, E }, wherein T ═ T1, T2, …, tm } is a concept set, called a concept layer, produced by the LDA model and can be defined as an upper concept set. T '{ T1', T2 ', …, tm' } is a set of sub-concepts, defined as a set of concepts at a lower level of the set of concepts at an upper level T. E is a set of edges, each eij E represents that the ith concept ti in the concept set T is connected with the jth concept tj 'in the sub-concept set T' by an edge, G ═ T, E }, where T ═ T1, T2, …, tm } is a set of concepts.
In order to build the connection between the concepts of the upper layer and the lower layer, the concept levels to which the concept nodes belong, which belong to the concept set of the upper layer, and which belong to the concept set of the lower layer need to be determined, and the connection between the concept sets of the two layers is more complicated. The boundaries between concepts using the LDA model are not particularly clear, the concepts need to be layered by using a certain measurement method, and relationships between layers need to be established, some concepts may have several parents, some concepts may have no children, the more concept layers are generated, the tighter relationship between concept layers is, so the number of generated concept layers cannot be increased without limitation, and the number of layers of an ontology construction needs to be set manually.
Two) related rules
Before proposing a method for implementing automatic ontology learning in detail, two basic rules are first defined. It is common practice to continually reuse the LDA model to generate a set of concepts for building the concepts required by the hierarchy. The present invention defines rules that limit the concepts generated by the model for use in building the hierarchy ontology.
According to intuition, concepts at higher levels are more abstract, and vice versa are more concrete; the higher the level the fewer concepts and vice versa. Then based on these common sense, the following rules are defined:
rule 1: if ti ∈ T, tj ' ∈ T ', NT < NT ', the conclusion is: the sub-concept set T ' is at a higher level than the concept set T, where NT and NT ' are the hierarchy levels of the concept set T and the sub-concept set T ', respectively.
When repeated de-learning with the LDA model yields a set of concepts, NT < NT' must first be determined. The rules are therefore very important to the method of building the ontology.
Every concept of each layer learned by LDA through a document corpus is a word which appears in documents at a high frequency, and a concept set appearing at a high frequency at a high layer is highly likely to appear at the same high frequency in a low-layer concept set, so that the same words can be connected in the process of constructing an ontology, which is unreasonable. The following rules are thus defined:
rule 2: if ti e T, tj 'e T',there is a high likelihood of a high and low level relationship between ti and tj', where,is an empty set.
This rule can help us define similarity measures between concepts as described in this patent below.
Three) similarity measurement
The invention utilizes a similarity measurement method to construct the hierarchy of the ontology, namely, the relation between concepts is established through the similarity between the concepts. Until a certain similarity value is reached between two concepts in the two hierarchical concept sets, the connection can be established, otherwise, the two concepts are considered to be not connected. In order to calculate semantic similarity between two concepts, a concept matrix generated in the process of generating a concept set is used by using an LDA model, and each matrix input is the probability size of the concept appearing in an ontology.
In general, similarity between concepts is measured by using point Mutual information PMI (point Mutual information), the invention defines a new semantic similarity measurement method between words w and w', and PMI is defined by using expectation of two concepts, wherein each concept has a series of word compositions, which is also a special property of LDA model. The mutual point information of the two words w and w 'is PMI (w, w'), and there are:
wherein P (w, w ') ═ P (w) P (w' | w);
where z is the topic, P (z ═ j) is the probability for topic j, and P (w | z ═ j) is the probability for topic j, the vocabulary is representedProbability of w, k is the number of concepts;
in the formula, P (w' | z ═ j) is a probability with a topic j, and P (z | j | w) is a conditional probability with a topic j, with a word w.
The invention provides a calculation formula of point mutual information of two vocabularies, which is used for preparing the hierarchical structure of concepts between ontologies for subsequent organization and construction, and the formula can be used for defining semantic similarity between other concepts.
Each concept generated by LDA corresponds to a concept within the ontology structure. The semantic similarity measure is a measure of semantic similarity between two concepts. In the context of a special context, the semantic similarity of two other concepts. In the context of the pth concept tp and the concept tp in the upper concept set T, the semantic similarity CosTMI (ts ', tr '; tp) of the two concepts s ' and r ' in the lower concept set T ')
In the formula, tp comprises the vocabulary sequence { wp1, wp2, …, wpn }; ts 'contains the lexical sequence { ws' 1, ws '2, …, ws' n }; tr 'contains the sequence of words wr' 1, wr '2, …, wr' n.
A threshold value thct is preset, and if the CosTMI (ts ', tr '; tp) value is larger than a certain threshold value thct, a relation is established between tp and ts, ts '. Through the definition and the calculation of semantic similarity, the obtained concepts capable of establishing the relationship are all concepts in the ontology construction. The threshold Thct is a value to be determined by experiment, and a larger value indicates a larger semantic similarity between two concepts, whereas a smaller semantic similarity is obtained.
The validity and practicality of the ontology construction method proposed herein are verified below using the real GENIA corpus and the ontology GENIA ontology.
The ontology construction method provided by the invention is used for carrying out experimental verification by using the GENIA ontology corresponding to the GENIA corpus. The GENIA corpus is a biological corpus. This corpus contains 1,999 medical vocabularies, which were collected from MeSH, human, and blood cells. The GENIA ontology contains 45 concepts and 42 relationships. The experimental content of the invention is to input the GENIA expectation into the LDA model and calculate the required concept of the ontology to be constructed. Compared with the method algorithms proposed by the methods AOL, Zavidisanos and the like, the method algorithms are executed on a PC with Pentium 4 and memory 2GB, and compared with the CI methods proposed by CosTMI, Zavidisanos and the like, the threshold values of parameter setting are 0.93 and 3-10-6 respectively.
The algorithm proposed by the invention finally evaluates the effectiveness and the quality of the body structure by the recall rate, the accuracy and the F1 measurement. The results of the comparison performed by the two methods are shown in table 1.
TABLE 1 results of similarity metric based execution of concept C and relationship S
From table 1 we can see that our proposed method AOL performs very efficiently, can be used for ontology construction of other domain knowledge, and both accuracy and recall are higher than CI methods.
FIG. 2 shows the number of words contained in each concept, and in the process of experiments, the number of words contained in each concept influences the accuracy of ontology construction. The experimental result shows that if each concept contains less than 10 vocabulary quantities, the accuracy of ontology construction is seriously influenced. Conversely, if each concept contains a larger number of words, the accuracy of constructing the ontology is higher. However, the more the concepts that are not included, the better, through experimental test analysis, the result that each concept includes 16 vocabularies is better, if the concepts include too many vocabularies, some low-frequency vocabularies appearing in the corpus can appear in the concepts, the abstract meaning of the concepts in the ontology construction is not large, and the actual quality of the ontology construction can be influenced.
In fig. 3, we show a detailed diagram of the accuracy of the algorithm execution, which shows how the algorithm execution is a change of F1 value when the download threshold value thct of CosTMI metric is 0.93, and in fig. 3, we can see that the F1 value is the highest when the number of body levels is 7.
Claims (1)
1. A body construction method based on a Topic Model is characterized by comprising the following steps:
the method comprises the steps of firstly, extracting concepts from a given document corpus by using an LDA model, generating a concept set by using the extracted concepts, and then performing concept hierarchy subdivision to generate a hierarchy G of an ontology, wherein T is { T1, T2, … and tm } which is a concept set and is defined as an upper-layer concept set; t '{ T1', T2 ', …, tm' } is a set of sub-concepts, defined as a set of concepts at a lower level of the set of concepts at an upper level T; e is a set of edges, each eij epsilon E represents that the ith concept ti in the concept set T is connected with the jth concept tj 'in the sub-concept set T' by an edge, wherein the following rules are followed when the concept hierarchy subdivision is carried out to generate the hierarchy G constructed by the ontology:
rule 1: if ti ∈ T, tj ' ∈ T ', NT < NT ', the conclusion is: the concept level of the concept set T is higher than that of the concept set T, wherein NT and NT 'are the level levels of the concept set T and the concept set T', respectively;
rule 2: if ti e T, tj 'e T',there is a high likelihood of a high and low level relationship between ti and tj', where,is an empty set;
secondly, identifying similarity among all concepts in the hierarchical structure G, namely potential connection of concepts among adjacent hierarchies by using a CosTMI similarity measurement method, wherein in the context of the p-th concept tp and the concept tp in the upper-layer concept set T, semantic similarity CosTMI (ts ', tr '; tp) of the s-th concept ts ' and the r-th concept tr ' in the lower-layer concept set T ')
In the formula, tp comprises the vocabulary sequence { wp1, wp2, …, wpn }; ts 'contains the lexical sequence { ws' 1, ws '2, …, ws' n }; tr 'contains the lexical sequence { wr' 1, wr '2, …, wr' n }; PMI () is point mutual information of two vocabularies, and point mutual information of two vocabularies w and w 'is PMI (w, w'), there are:
wherein P (w, w ') ═ P (w) P (w' | w);
where z is the topic, P (z ═ j) is the probability for the topic j, P (w | z ═ j) is the conditional probability for the word w for the topic j, and k is the number of concepts;
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711112981.9A CN107895012B (en) | 2017-11-10 | 2017-11-10 | Ontology construction method based on Topic Model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711112981.9A CN107895012B (en) | 2017-11-10 | 2017-11-10 | Ontology construction method based on Topic Model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107895012A CN107895012A (en) | 2018-04-10 |
CN107895012B true CN107895012B (en) | 2021-10-08 |
Family
ID=61805185
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711112981.9A Expired - Fee Related CN107895012B (en) | 2017-11-10 | 2017-11-10 | Ontology construction method based on Topic Model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107895012B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11003638B2 (en) * | 2018-10-29 | 2021-05-11 | Beijing Jingdong Shangke Information Technology Co., Ltd. | System and method for building an evolving ontology from user-generated content |
CN113312910B (en) * | 2021-05-25 | 2022-10-25 | 华南理工大学 | Ontology learning method, system, device and medium based on topic model |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105095229A (en) * | 2014-04-29 | 2015-11-25 | 国际商业机器公司 | Method for training topic model, method for comparing document content and corresponding device |
US10417301B2 (en) * | 2014-09-10 | 2019-09-17 | Adobe Inc. | Analytics based on scalable hierarchical categorization of web content |
CN106611038A (en) * | 2016-07-28 | 2017-05-03 | 四川用联信息技术有限公司 | Ontology concept-based lexical semantic similarity solving method |
CN106228023B (en) * | 2016-08-01 | 2018-08-28 | 清华大学 | A kind of clinical path method for digging based on ontology and topic model |
-
2017
- 2017-11-10 CN CN201711112981.9A patent/CN107895012B/en not_active Expired - Fee Related
Also Published As
Publication number | Publication date |
---|---|
CN107895012A (en) | 2018-04-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108229582A (en) | Entity recognition dual training method is named in a kind of multitask towards medical domain | |
Xu et al. | An overview of deep generative models | |
CN108108449A (en) | A kind of implementation method based on multi-source heterogeneous data question answering system and the system towards medical field | |
CN106776711A (en) | A kind of Chinese medical knowledge mapping construction method based on deep learning | |
CN104268197A (en) | Industry comment data fine grain sentiment analysis method | |
CN109284406A (en) | Intension recognizing method based on difference Recognition with Recurrent Neural Network | |
Ha et al. | Automated construction of visual-linguistic knowledge via concept learning from cartoon videos | |
CN107798624A (en) | A kind of technical label in software Ask-Answer Community recommends method | |
Shaikh et al. | Bloom’s learning outcomes’ automatic classification using lstm and pretrained word embeddings | |
Zayaraz | Concept relation extraction using Naïve Bayes classifier for ontology-based question answering systems | |
CN112836051B (en) | Online self-learning court electronic file text classification method | |
CN113343690B (en) | Text readability automatic evaluation method and device | |
CN111710428B (en) | Biomedical text representation method for modeling global and local context interaction | |
CN111274790A (en) | Chapter-level event embedding method and device based on syntactic dependency graph | |
Hanifi et al. | Problem formulation in inventive design using Doc2vec and Cosine Similarity as Artificial Intelligence methods and Scientific Papers | |
CN110046228A (en) | Short text subject identifying method and system | |
Gu et al. | Enhancing text classification by graph neural networks with multi-granular topic-aware graph | |
Dsouza et al. | Chat with bots intelligently: A critical review & analysis | |
CN107895012B (en) | Ontology construction method based on Topic Model | |
Whitney | Bootstrapping via graph propagation | |
Jeon et al. | Measuring the novelty of scientific publications: a fastText and local outlier factor approach | |
Ahmed et al. | Developed third iterative dichotomizer based on feature decisive values for educational data mining | |
CN108009187A (en) | A kind of short text Topics Crawling method for strengthening Text Representation | |
Zhu et al. | Artificial Intelligence Classification Model for Modern Chinese Poetry in Education | |
Lin et al. | Learning ontology automatically using topic model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20211008 |