CN107895012B - Ontology construction method based on Topic Model - Google Patents

Ontology construction method based on Topic Model Download PDF

Info

Publication number
CN107895012B
CN107895012B CN201711112981.9A CN201711112981A CN107895012B CN 107895012 B CN107895012 B CN 107895012B CN 201711112981 A CN201711112981 A CN 201711112981A CN 107895012 B CN107895012 B CN 107895012B
Authority
CN
China
Prior art keywords
concept
concepts
ontology
topic
hierarchy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201711112981.9A
Other languages
Chinese (zh)
Other versions
CN107895012A (en
Inventor
林志杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Dianji University
Original Assignee
Shanghai Dianji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Dianji University filed Critical Shanghai Dianji University
Priority to CN201711112981.9A priority Critical patent/CN107895012B/en
Publication of CN107895012A publication Critical patent/CN107895012A/en
Application granted granted Critical
Publication of CN107895012B publication Critical patent/CN107895012B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Animal Behavior & Ethology (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a body construction method based on a Topic Model. The invention provides an AOL method, which supports automatic domain ontology construction, invents a measurement method for semantic similarity between information calculation concepts, is used for calculating the semantic similarity between concepts generated by an LDA model, does not limit the number of child nodes of a root node, and does not need to be assisted by a seed ontology as an initial learning ontology. Experimental results show that the method for performing automatic ontology construction by using the Topic Model is very effective.

Description

Ontology construction method based on Topic Model
Technical Field
The invention relates to a method for constructing an ontology, which utilizes TopicModel as a unit for generating basic concepts and can learn the ontology without ontology seeds so as to achieve the purpose of constructing the ontology.
Background
In recent years, ontology has been applied to various fields such as artificial intelligence, information extraction, machine translation, and the like. However, the manual construction of the ontology is time-consuming and labor-consuming work, and for this reason, the automatic construction of the ontology by means of computer data analysis and data mining is a significant research, and many researchers are attracted to perform a great deal of intensive research on the ontology. Most current ontology learning methods focus on expanding and updating existing ontology seeds by extracting concepts or vocabulary units from the document lexicon to update and broaden the ontology seeds. There are some methods for automatically learning ontologies, but most of these methods for automatically learning ontologies are based on ontologies in special knowledge fields, such as SKOS models, but these methods have certain limitations.
The Topic Model probabilistic Model is a Model that has proven to be very effective by the industry to identify concepts from scientific publications without a priori knowledge being available. The Topic Model has now been widely applied in the field of text mining.
Elias zaavitsanos et al propose an automatic ontology learning method based on a statistical method, which is to continuously and repeatedly use a concept set trained by a Topic Model, and then judge the relation between recognized concepts by using condition independence, but the method cannot carry out the relation between concepts of two hierarchical structures. Wang wei et al propose two methods, both of which are based on a semantic Web learning ontology structure, which show good recall rate and accuracy by combining an information theory and a Topic Model, but need to limit the number of sub-concept nodes of the nearest root node.
Disclosure of Invention
The invention aims to provide a measurement method for calculating semantic similarity between concepts of information, which is used for calculating the semantic similarity between the concepts generated by an LDA model.
In order to achieve the above object, the technical solution of the present invention is to provide a body construction method based on a Topic Model, which is characterized by comprising the following steps:
the method comprises the steps of firstly, extracting concepts from a given document corpus by using an LDA model, generating a concept set by using the extracted concepts, and then performing concept hierarchy subdivision to generate a hierarchy G of an ontology, wherein T is { T1, T2, … and tm } which is a concept set and is defined as an upper-layer concept set; t '{ T1', T2 ', …, tm' } is a set of sub-concepts, defined as a set of concepts at a lower level of the set of concepts at an upper level T; e is a set of edges, and each eij epsilon E represents that the ith concept ti in the concept set T is connected with the jth concept tj 'in the sub-concept set T' by an edge;
secondly, identifying similarity among all concepts in the hierarchical structure G, namely potential connection of concepts among adjacent hierarchies by using a CosTMI similarity measurement method, wherein in the context of the p-th concept tp and the concept tp in the upper-layer concept set T, semantic similarity CosTMI (ts ', tr '; tp) of the s-th concept ts ' and the r-th concept tr ' in the lower-layer concept set T ')
Figure BDA0001464011170000021
In the formula, tp comprises the vocabulary sequence { wp1, wp2, …, wpn }; ts 'contains the lexical sequence { ws' 1, ws '2, …, ws' n }; tr 'contains the lexical sequence { wr' 1, wr '2, …, wr' n }; PMI () is point mutual information of two vocabularies, and point mutual information of two vocabularies w and w 'is PMI (w, w'), there are:
Figure BDA0001464011170000022
wherein P (w, w ') ═ P (w) P (w' | w);
Figure BDA0001464011170000023
where z is the topic, P (z ═ j) is the probability for the topic j, P (w | z ═ j) is the conditional probability for the word w for the topic j, and k is the number of concepts;
Figure BDA0001464011170000024
in the formula, when P (w '| z ═ j) is a conditional probability with topic j, w' and when P (z | j | w) is a word w, the conditional probability with topic j.
Preferably, in the first step, the following rules are followed when performing concept hierarchy subdivision to generate the ontology-structured hierarchy G:
rule 1: if ti ∈ T, tj ' ∈ T ', NT < NT ', the conclusion is: the sub-concept set T ' is higher than the concept set T, wherein NT and NT ' are the level of hierarchy of the concept set T and the sub-concept set T ', respectively;
rule 2: if ti e T, tj 'e T',
Figure BDA0001464011170000025
there is a high likelihood of a high and low level relationship between ti and tj', where,
Figure BDA0001464011170000026
is an empty set.
The invention provides an AOL method, which supports automatic domain ontology construction, invents a measurement method for semantic similarity between information calculation concepts, is used for calculating the semantic similarity between concepts generated by an LDA model, does not limit the number of child nodes of a root node, and does not need to be assisted by a seed ontology as an initial learning ontology. Experimental results show that the method for performing automatic ontology construction by using the Topic Model is very effective.
The invention constructs the concepts of the ontology and the structural hierarchy among the concepts by repeatedly utilizing an LDA Model, namely a Topic Model to generate the concepts and defining a measurement method capable of accurately measuring semantic similarity among the concepts.
Drawings
FIG. 1 is a process of building a body structure;
FIG. 2 is a diagram of the accuracy of concepts versus lexical dimensions;
FIG. 3 is a graph of the number of body levels versus the F1 metric.
Detailed Description
The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.
The invention provides a body construction method based on a Topic Model, which comprises the following steps:
the method comprises the steps that firstly, concept extraction is carried out from a given document corpus by utilizing an LDA model, and then concept hierarchy subdivision is carried out to generate a hierarchy structure constructed by an ontology;
secondly, designing a CosTMI similarity measurement method, and identifying the similarity between the concepts of the hierarchical structure, namely the potential relation of the concepts between adjacent hierarchies;
the steps involve the following technical innovations:
one) ontology construction process
FIG. 1 illustrates a process of ontology construction. And constructing a hierarchical structure G, G ═ T, E }, wherein T ═ T1, T2, …, tm } is a concept set, called a concept layer, produced by the LDA model and can be defined as an upper concept set. T '{ T1', T2 ', …, tm' } is a set of sub-concepts, defined as a set of concepts at a lower level of the set of concepts at an upper level T. E is a set of edges, each eij E represents that the ith concept ti in the concept set T is connected with the jth concept tj 'in the sub-concept set T' by an edge, G ═ T, E }, where T ═ T1, T2, …, tm } is a set of concepts.
In order to build the connection between the concepts of the upper layer and the lower layer, the concept levels to which the concept nodes belong, which belong to the concept set of the upper layer, and which belong to the concept set of the lower layer need to be determined, and the connection between the concept sets of the two layers is more complicated. The boundaries between concepts using the LDA model are not particularly clear, the concepts need to be layered by using a certain measurement method, and relationships between layers need to be established, some concepts may have several parents, some concepts may have no children, the more concept layers are generated, the tighter relationship between concept layers is, so the number of generated concept layers cannot be increased without limitation, and the number of layers of an ontology construction needs to be set manually.
Two) related rules
Before proposing a method for implementing automatic ontology learning in detail, two basic rules are first defined. It is common practice to continually reuse the LDA model to generate a set of concepts for building the concepts required by the hierarchy. The present invention defines rules that limit the concepts generated by the model for use in building the hierarchy ontology.
According to intuition, concepts at higher levels are more abstract, and vice versa are more concrete; the higher the level the fewer concepts and vice versa. Then based on these common sense, the following rules are defined:
rule 1: if ti ∈ T, tj ' ∈ T ', NT < NT ', the conclusion is: the sub-concept set T ' is at a higher level than the concept set T, where NT and NT ' are the hierarchy levels of the concept set T and the sub-concept set T ', respectively.
When repeated de-learning with the LDA model yields a set of concepts, NT < NT' must first be determined. The rules are therefore very important to the method of building the ontology.
Every concept of each layer learned by LDA through a document corpus is a word which appears in documents at a high frequency, and a concept set appearing at a high frequency at a high layer is highly likely to appear at the same high frequency in a low-layer concept set, so that the same words can be connected in the process of constructing an ontology, which is unreasonable. The following rules are thus defined:
rule 2: if ti e T, tj 'e T',
Figure BDA0001464011170000041
there is a high likelihood of a high and low level relationship between ti and tj', where,
Figure BDA0001464011170000042
is an empty set.
This rule can help us define similarity measures between concepts as described in this patent below.
Three) similarity measurement
The invention utilizes a similarity measurement method to construct the hierarchy of the ontology, namely, the relation between concepts is established through the similarity between the concepts. Until a certain similarity value is reached between two concepts in the two hierarchical concept sets, the connection can be established, otherwise, the two concepts are considered to be not connected. In order to calculate semantic similarity between two concepts, a concept matrix generated in the process of generating a concept set is used by using an LDA model, and each matrix input is the probability size of the concept appearing in an ontology.
In general, similarity between concepts is measured by using point Mutual information PMI (point Mutual information), the invention defines a new semantic similarity measurement method between words w and w', and PMI is defined by using expectation of two concepts, wherein each concept has a series of word compositions, which is also a special property of LDA model. The mutual point information of the two words w and w 'is PMI (w, w'), and there are:
Figure BDA0001464011170000051
wherein P (w, w ') ═ P (w) P (w' | w);
Figure BDA0001464011170000052
where z is the topic, P (z ═ j) is the probability for topic j, and P (w | z ═ j) is the probability for topic j, the vocabulary is representedProbability of w, k is the number of concepts;
Figure BDA0001464011170000053
in the formula, P (w' | z ═ j) is a probability with a topic j, and P (z | j | w) is a conditional probability with a topic j, with a word w.
The invention provides a calculation formula of point mutual information of two vocabularies, which is used for preparing the hierarchical structure of concepts between ontologies for subsequent organization and construction, and the formula can be used for defining semantic similarity between other concepts.
Each concept generated by LDA corresponds to a concept within the ontology structure. The semantic similarity measure is a measure of semantic similarity between two concepts. In the context of a special context, the semantic similarity of two other concepts. In the context of the pth concept tp and the concept tp in the upper concept set T, the semantic similarity CosTMI (ts ', tr '; tp) of the two concepts s ' and r ' in the lower concept set T ')
Figure BDA0001464011170000054
In the formula, tp comprises the vocabulary sequence { wp1, wp2, …, wpn }; ts 'contains the lexical sequence { ws' 1, ws '2, …, ws' n }; tr 'contains the sequence of words wr' 1, wr '2, …, wr' n.
A threshold value thct is preset, and if the CosTMI (ts ', tr '; tp) value is larger than a certain threshold value thct, a relation is established between tp and ts, ts '. Through the definition and the calculation of semantic similarity, the obtained concepts capable of establishing the relationship are all concepts in the ontology construction. The threshold Thct is a value to be determined by experiment, and a larger value indicates a larger semantic similarity between two concepts, whereas a smaller semantic similarity is obtained.
The validity and practicality of the ontology construction method proposed herein are verified below using the real GENIA corpus and the ontology GENIA ontology.
The ontology construction method provided by the invention is used for carrying out experimental verification by using the GENIA ontology corresponding to the GENIA corpus. The GENIA corpus is a biological corpus. This corpus contains 1,999 medical vocabularies, which were collected from MeSH, human, and blood cells. The GENIA ontology contains 45 concepts and 42 relationships. The experimental content of the invention is to input the GENIA expectation into the LDA model and calculate the required concept of the ontology to be constructed. Compared with the method algorithms proposed by the methods AOL, Zavidisanos and the like, the method algorithms are executed on a PC with Pentium 4 and memory 2GB, and compared with the CI methods proposed by CosTMI, Zavidisanos and the like, the threshold values of parameter setting are 0.93 and 3-10-6 respectively.
The algorithm proposed by the invention finally evaluates the effectiveness and the quality of the body structure by the recall rate, the accuracy and the F1 measurement. The results of the comparison performed by the two methods are shown in table 1.
TABLE 1 results of similarity metric based execution of concept C and relationship S
Figure BDA0001464011170000061
From table 1 we can see that our proposed method AOL performs very efficiently, can be used for ontology construction of other domain knowledge, and both accuracy and recall are higher than CI methods.
FIG. 2 shows the number of words contained in each concept, and in the process of experiments, the number of words contained in each concept influences the accuracy of ontology construction. The experimental result shows that if each concept contains less than 10 vocabulary quantities, the accuracy of ontology construction is seriously influenced. Conversely, if each concept contains a larger number of words, the accuracy of constructing the ontology is higher. However, the more the concepts that are not included, the better, through experimental test analysis, the result that each concept includes 16 vocabularies is better, if the concepts include too many vocabularies, some low-frequency vocabularies appearing in the corpus can appear in the concepts, the abstract meaning of the concepts in the ontology construction is not large, and the actual quality of the ontology construction can be influenced.
In fig. 3, we show a detailed diagram of the accuracy of the algorithm execution, which shows how the algorithm execution is a change of F1 value when the download threshold value thct of CosTMI metric is 0.93, and in fig. 3, we can see that the F1 value is the highest when the number of body levels is 7.

Claims (1)

1. A body construction method based on a Topic Model is characterized by comprising the following steps:
the method comprises the steps of firstly, extracting concepts from a given document corpus by using an LDA model, generating a concept set by using the extracted concepts, and then performing concept hierarchy subdivision to generate a hierarchy G of an ontology, wherein T is { T1, T2, … and tm } which is a concept set and is defined as an upper-layer concept set; t '{ T1', T2 ', …, tm' } is a set of sub-concepts, defined as a set of concepts at a lower level of the set of concepts at an upper level T; e is a set of edges, each eij epsilon E represents that the ith concept ti in the concept set T is connected with the jth concept tj 'in the sub-concept set T' by an edge, wherein the following rules are followed when the concept hierarchy subdivision is carried out to generate the hierarchy G constructed by the ontology:
rule 1: if ti ∈ T, tj ' ∈ T ', NT < NT ', the conclusion is: the concept level of the concept set T is higher than that of the concept set T, wherein NT and NT 'are the level levels of the concept set T and the concept set T', respectively;
rule 2: if ti e T, tj 'e T',
Figure FDA0003080057400000011
there is a high likelihood of a high and low level relationship between ti and tj', where,
Figure FDA0003080057400000012
is an empty set;
secondly, identifying similarity among all concepts in the hierarchical structure G, namely potential connection of concepts among adjacent hierarchies by using a CosTMI similarity measurement method, wherein in the context of the p-th concept tp and the concept tp in the upper-layer concept set T, semantic similarity CosTMI (ts ', tr '; tp) of the s-th concept ts ' and the r-th concept tr ' in the lower-layer concept set T ')
Figure FDA0003080057400000013
In the formula, tp comprises the vocabulary sequence { wp1, wp2, …, wpn }; ts 'contains the lexical sequence { ws' 1, ws '2, …, ws' n }; tr 'contains the lexical sequence { wr' 1, wr '2, …, wr' n }; PMI () is point mutual information of two vocabularies, and point mutual information of two vocabularies w and w 'is PMI (w, w'), there are:
Figure FDA0003080057400000014
wherein P (w, w ') ═ P (w) P (w' | w);
Figure FDA0003080057400000015
where z is the topic, P (z ═ j) is the probability for the topic j, P (w | z ═ j) is the conditional probability for the word w for the topic j, and k is the number of concepts;
Figure FDA0003080057400000016
in the formula, P (w '| z ═ j) is a conditional probability of w' when the topic is j, and P (z ═ j | w) is a conditional probability of the topic j when the vocabulary is w.
CN201711112981.9A 2017-11-10 2017-11-10 Ontology construction method based on Topic Model Expired - Fee Related CN107895012B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711112981.9A CN107895012B (en) 2017-11-10 2017-11-10 Ontology construction method based on Topic Model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711112981.9A CN107895012B (en) 2017-11-10 2017-11-10 Ontology construction method based on Topic Model

Publications (2)

Publication Number Publication Date
CN107895012A CN107895012A (en) 2018-04-10
CN107895012B true CN107895012B (en) 2021-10-08

Family

ID=61805185

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711112981.9A Expired - Fee Related CN107895012B (en) 2017-11-10 2017-11-10 Ontology construction method based on Topic Model

Country Status (1)

Country Link
CN (1) CN107895012B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11003638B2 (en) * 2018-10-29 2021-05-11 Beijing Jingdong Shangke Information Technology Co., Ltd. System and method for building an evolving ontology from user-generated content
CN113312910B (en) * 2021-05-25 2022-10-25 华南理工大学 Ontology learning method, system, device and medium based on topic model

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095229A (en) * 2014-04-29 2015-11-25 国际商业机器公司 Method for training topic model, method for comparing document content and corresponding device
US10417301B2 (en) * 2014-09-10 2019-09-17 Adobe Inc. Analytics based on scalable hierarchical categorization of web content
CN106611038A (en) * 2016-07-28 2017-05-03 四川用联信息技术有限公司 Ontology concept-based lexical semantic similarity solving method
CN106228023B (en) * 2016-08-01 2018-08-28 清华大学 A kind of clinical path method for digging based on ontology and topic model

Also Published As

Publication number Publication date
CN107895012A (en) 2018-04-10

Similar Documents

Publication Publication Date Title
CN108229582A (en) Entity recognition dual training method is named in a kind of multitask towards medical domain
Xu et al. An overview of deep generative models
CN108108449A (en) A kind of implementation method based on multi-source heterogeneous data question answering system and the system towards medical field
CN106776711A (en) A kind of Chinese medical knowledge mapping construction method based on deep learning
CN104268197A (en) Industry comment data fine grain sentiment analysis method
CN109284406A (en) Intension recognizing method based on difference Recognition with Recurrent Neural Network
Ha et al. Automated construction of visual-linguistic knowledge via concept learning from cartoon videos
CN107798624A (en) A kind of technical label in software Ask-Answer Community recommends method
Shaikh et al. Bloom’s learning outcomes’ automatic classification using lstm and pretrained word embeddings
Zayaraz Concept relation extraction using Naïve Bayes classifier for ontology-based question answering systems
CN112836051B (en) Online self-learning court electronic file text classification method
CN113343690B (en) Text readability automatic evaluation method and device
CN111710428B (en) Biomedical text representation method for modeling global and local context interaction
CN111274790A (en) Chapter-level event embedding method and device based on syntactic dependency graph
Hanifi et al. Problem formulation in inventive design using Doc2vec and Cosine Similarity as Artificial Intelligence methods and Scientific Papers
CN110046228A (en) Short text subject identifying method and system
Gu et al. Enhancing text classification by graph neural networks with multi-granular topic-aware graph
Dsouza et al. Chat with bots intelligently: A critical review & analysis
CN107895012B (en) Ontology construction method based on Topic Model
Whitney Bootstrapping via graph propagation
Jeon et al. Measuring the novelty of scientific publications: a fastText and local outlier factor approach
Ahmed et al. Developed third iterative dichotomizer based on feature decisive values for educational data mining
CN108009187A (en) A kind of short text Topics Crawling method for strengthening Text Representation
Zhu et al. Artificial Intelligence Classification Model for Modern Chinese Poetry in Education
Lin et al. Learning ontology automatically using topic model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20211008