CN108304488A - A method of utilizing the automatic study ontology of Topic Model - Google Patents

A method of utilizing the automatic study ontology of Topic Model Download PDF

Info

Publication number
CN108304488A
CN108304488A CN201810009239.3A CN201810009239A CN108304488A CN 108304488 A CN108304488 A CN 108304488A CN 201810009239 A CN201810009239 A CN 201810009239A CN 108304488 A CN108304488 A CN 108304488A
Authority
CN
China
Prior art keywords
concept
ontology
vocabulary
concept set
formula
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810009239.3A
Other languages
Chinese (zh)
Inventor
林志杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Dianji University
Original Assignee
Shanghai Dianji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Dianji University filed Critical Shanghai Dianji University
Priority to CN201810009239.3A priority Critical patent/CN108304488A/en
Publication of CN108304488A publication Critical patent/CN108304488A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Animal Behavior & Ethology (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The present invention provides a kind of methods of the automatic study ontology using Topic Model, this method supports automatic domain body structure, a kind of measure of Semantic Similarity between the calculating concept of information is invented, the Semantic Similarity between concept for calculating the generation of LDA models, the method for this automatic study ontology are divided into two steps:The first step is to carry out concept identification from text corpus or web corpus;Second step is to carry out the relationship between concept using Semantic Similarity defined herein measurement CP to establish.This method need not have auxiliary of the seed ontology as initial study ontology.The experimental results showed that the method proposed by the present invention for carrying out automated ontology structure using Topic Model is very effective.

Description

A method of utilizing the automatic study ontology of Topic Model
Technical field
The present invention relates to a kind of methods of ontological construction not to be had to using TopicModel as basic conception unit is generated Ontology seed can learn ontology and achieve the purpose that build ontology.
Background technology
Ontological construction has been applied to various fields, such as artificial intelligence, information extraction, machine translation field.But people Work structure ontology is the work of very time and effort consuming, is updated with the continuous extension of concept and realm information, structure is large-scale Ontology needs more and more manpower material resources and energy, institute big as webdirectories, Wordnet taking human as structure Type ontology seed needs to expend more effort and energy.Therefore it is required to build ontology automatically strongly to keep up with this field The current demand that information rises suddenly and sharply, to reduce the cost for being considered to build up and safeguarding ontology.So analyzed recently using computer data, It is a meaningful research that the mode of data mining builds ontology automatically, has attracted many researchers a large amount of to this progress In-depth study.
Automatic structure ontology has changed into a new research field, has many methods to have been proposed for building automatically Ontology, ontology has had at present much applies immediately, and knowledge engineer can be helped to combine automatically or semi-automatically machine learning Technology builds and extends ontology, greatly reduces the artificial constructed cost for safeguarding ontology.Most of present body learning sides Method concentrates on extending, updates existing ontology seed, is updated using concept or lexical unit is extracted from document dictionary With spread ontology seed.There are also the methods of automatic study ontology, but the method for most this automatic study ontologies is all Based on the ontological construction in special knowledge field, such as SKOS models, but these methods all have certain limitation.
There are the method for much learning ontology from text corpus, such as the ontological construction side based on lexico-syntactic Method, these methods are mainly closed using natural language processing technique and existing lexicon resources to learn the is-a between concept System, i.e., so-called Hearst-parterns, but such methods, which have a disadvantage to be exactly that Hearst-parterns is this, needs frequency The vocabulary pattern of numerous appearance will not frequently occur, while he can only handle some very fuzzy lexical semantic relations. The common sense such as P.Cimicano and F.M.Suchanek are gone to extract more using this web search engine of Wikipedia, Wordnet Language mode.
Statistical learning method based on cluster and classification is also applied in body learning, these methods usually utilize similitude It measures and carries out the foundation of conceptual relation with measure of dissimilarity.The limitation of such methods is the ontology based on cluster and classification Learning method is difficult to execute.The hierarchical structure of Method for Ontology Learning study ontology based on information extraction technique, such methods are only capable of Enough extract the similar mankind, place, this concept generally changed very much of animal and their sub- concept.
Topic Model probabilistic models be one kind in the case where no priori provides, know from scientific publications Do not go out concept by the very effective model of industry-proven.Topic Model models have been widely applied to text now This excavation applications.It is a kind of new research method to carry out body learning using Topic Model.Elias Zavitsanos etc. are carried Go out a kind of automated ontology learning method based on statistical method, this method is by constantly reusing Topic Model moulds Then the concept set that type trains recycles conditional independence to judge the contact between the concept that identifies, but this method It cannot carry out the contact of concept between two hierarchical structures.Wang wei et al. propose two methods and are all based on Semantic Web Learn body construction method, this method by information theory with Topic Model be combined in the way of, show to recall well Rate and accuracy rate, but need to limit the quantity of the sub- concept node of nearest root node.
Invention content
It is improper accurately to determine between concept the object of the present invention is to provide a kind of method of automatic study ontology Correlation, and depth and that ontology is determined during ontology can be learnt in the case where not providing priori Practise the terminal of time.
In order to achieve the above object, the technical solution of the present invention is to provide a kind of automatic using Topic Model The method for practising ontology, which is characterized in that include the following steps:
The first step carries out concept extraction using LDA models from given document corpus, is generated by the concept being drawn into Go out concept set, then the hierarchical structure G, G={ T, E } of progress concept hierarchy subdivision generation ontological construction, in formula, T=t1, T2 ..., tm } it is concept set, it is defined as Upper Concept set;T '=t1 ', t2 ' ..., tm ' it is sub- concept set, it is fixed Justice is the next layer of concept set of Upper Concept set T, and concept set T and sub- concept set T ' are successive two layers;E is side Set, each eij ∈ E indicate that i-th of concept ti and j-th of concept tj ' in sub- concept set T ' in concept set T has side phase Even;
Second step, using CosTMI method for measuring similarity, the semanteme in identification hierarchical structure G between successive two layers is similar Property, wherein in Upper Concept set T in the context of p-th of concept tp and concept tp, s-th in next layer of concept set T ' Semantic similarity CosTMI (ts ', tr ' of two concepts of concept ts ' and r-th of concept tr ';tp)
In formula, tp includes sequence of words { wp1, wp2 ..., wpn };Ts ' include sequence of words ws'1, ws'2 ..., ws’n};Tr ' includes sequence of words { wr ' 1, wr'2 ..., wr ' n };PMI () is the point mutual information of two vocabulary, two vocabulary The point mutual information of w and w ' is PMI (w, w '), then has:
In formula, P (w, w ')=P (w) P (w ' | w);
In formula, z is theme, the probability that P (z=j) is theme when being j, P (w | Z=j when) to be theme be j, the conditional probability of vocabulary w, k is the quantity of concept;
In formula, the condition of P (w ' | z=j) is theme when being j w ' Probability, P (z=j | w) is vocabulary when being w, the conditional probability of theme j;
If CosTMI (ts ', tr ';Tp) be more than certain threshold value thc, then in tp and ts ', tr ' opening relationships;
Third step calculates standard similarity measurement L (ts ', tr ';Tp), In formula, P (ts ' | tp) is (being the probability of the generation of ts ' under tp context vocabulary environment), P (tr ' | tp) be (be on tp Hereafter under vocabulary environment the generation of ts ' probability);
Passing through standard similarity measurement L (ts ', tr ';When tp) defining the relationship between Ontological concept, each pass through Topic Model learn the concept that the concept all corresponds to an ontology, context rings of each concept ts ' or tr ' in tp Conditional probability under border, for calculating the semantic similarity between same layer concept, value is smaller to show that the Semantic Similarity of value is got over It is high;
4th step, the hierarchical structure for determining ontology
If learning three concept hierarchies Th, Tm, Tl using TopicModel, Th is highest level, and Tm is the intermediate level, Tl is lowest level, the entropys of these three variables is denoted as H (Th), H (Tm), H (Tl), and H (Tl | Tm) is the condition in message area Entropy, then successive two layers of concept set information gain Δ (I (Th, Tm, Tl)) be defined as:
Δ (I (Th, Tm, Tl))=H (Th)-H (Tl | Tm)
When Δ (I (Th, Tm, Tl)) is less than defined threshold value ω, stop utilizing LDA model learning concept set.
Preferably, in the first step, carry out concept hierarchy subdivision generate ontological construction hierarchical structure G when follow with Lower rule:
Rule 1:If ti ∈ T, tj ' ∈ T', NT < NT', conclusion are:Sub- concept set T ' is than concept set T, wherein NT and NT' is the floor height rank of concept set T and sub- concept set T ' respectively;
Rule 2:If ti ∈ T, tj ' ∈ T',In ti and tj ' between very likely there is relationship between superior and subordinate, Wherein,It is empty set.
The present invention, which proposes a kind of new method, to practise ontology from given corpus of text Kuku middle school automatically.We utilize one The concept that the probabilistic model that kind is widely used i.e. TopicModel models generate is used as the structure required concept units of ontology, There are these concepts, it is also necessary to there is a kind of method to measure the similitude of these concepts, to define adjacent upper and lower two in body construction Contact between layer concept, that is, in order to build this subject structure to side is set up between concept, form the level framework of ontology. Ensure to be related between the concept for learning, and the most compact and reasonable of the contact between concept.We define two phases thus Measured like property, and it is proposed that a new differentiation ontology level constructional depth standard, that is, propose one it is new Method is come the standard that cycle terminates when differentiating study ontology.
The present invention generates concept by recycling LDA models i.e. Topic Model models, and definition can be measured accurately generally The measure of Semantic Similarity builds the layer of structure between the concept and concept of ontology between thought.Using true GENIA corpus and the ontology GENIA ontologies verification present invention propose the validity and practicability of body constructing method.
Description of the drawings
Fig. 1 is variation relation figure of the accuracy rate with topic dimensions;
Fig. 2 is the variation relation figure of the increase accuracy rate with ontology depth under CP measurements;
Fig. 3 is the variation relation figure of the increase accuracy rate with ontology depth under L1 measurements.
Specific implementation mode
Present invention will be further explained below with reference to specific examples.It should be understood that these embodiments are merely to illustrate the present invention Rather than it limits the scope of the invention.In addition, it should also be understood that, after reading the content taught by the present invention, people in the art Member can make various changes or modifications the present invention, and such equivalent forms equally fall within the application the appended claims and limited Range.
A kind of method of automatic study ontology using Topic Model provided by the invention generally comprises following step Suddenly:
The first step carries out concept extraction using LDA models from given document corpus, by constantly reusing Model Xu Xi goes out to build the required concept set of ontology.
Second step devises CP similarity measurements, identifies the similitude between level structuring concept, i.e., general between adjacent level The potential contact read;L1 standards have been formulated to judge the level quantity of ontological construction.
Third step passes through the validity and actual effect of this body constructing method of experimental verification.
Above-mentioned steps are related to following technological innovation:
One) ontological construction process
Fig. 1 illustrates the process of ontological construction.Hierarchical structure a G, G={ T, E } are built, in formula, T=t1, t2 ..., Tm } to be concept set, referred to as conceptual level can be defined as Upper Concept set by LDA model outputs.T '=t1 ', T2 ' ..., tm ' } it is sub- concept set, it is defined as the next layer of concept set of Upper Concept set T.E is the set on side, each Eij ∈ E indicate i-th of concept ti in concept set T and j-th of concept tj ' in sub- concept set T ' have side be connected G=T, E }, T={ t1, t2 ..., tm } is the set of a concept in figure.
In order to build the contact between upper layer and lower layer concept, it is thus necessary to determine that the concept hierarchy belonging to these concept nodes, which Belong to high one layer of concept set, which belongs to low one layer of concept set, and sets up the connection between this two layers of concept set System can be more complicated.It is not especially clear using the boundary between the concept of LDA models, needing will using certain measure These Concept Hierarchies, and relationship between layers is also set up, some concepts may have several fathers, some concepts can Can there is no child, the concept hierarchy of generation is more, and the relationship between conceptual level is closer, so the quantity that level concept generates is not It can increase without limitation, need the level quantity that an ontological construction is manually set.
Two) related rule
Before proposing to implement automated ontology learning method, two basic rules are defined first.General situation Under be that continuous recycling LDA models produce concept set, for building the required concept of hierarchical structure.The present invention is fixed Some rules of justice, these rules are used for limiting the concept that the model produces, use when building hierarchical structure ontology.
It is more abstract more the concept for being in high-rise according to intuition, on the contrary it is more specific;It is fewer more the concept for being in high-rise, it is on the contrary It is more.These common sense are so based on, rule is defined as follows:
Rule 1:If ti ∈ T, tj ' ∈ T', NT < NT', conclusion are:Sub- concept set T ' is than concept set T, wherein NT and NT' is the floor height rank of concept set T and sub- concept set T ' respectively.
When being repeated with LDA models when going study to generate concept set, it is necessary to determine NT < NT ' first.Therefore should Rule is very important the method for building ontology.
Each of the every layer concept for learning by LDA by document corpus is the word that high frequency occurs in the literature It converges, in the concept set that high-rise high frequency occurs, very likely same high frequency occurs in low layer concept set, so building this These identical vocabulary may establish contact during body, this is unreasonable.Therefore it is defined as follows rule:
Rule 2:If ti ∈ T, tj ' ∈ T',In ti and tj ' between very likely there is relationship between superior and subordinate, Wherein,It is empty set.
The rule can help the similarity measurement to be introduced below this patent between our defined notions.
Three) similarity measurement
The present invention builds the hierarchical structure of ontology using the method for similarity measurement, that is to say, that the contact between concept It is to be established by the similarity between concept.Reach certain similarity between two concepts that two level concepts are concentrated Value, could establish contact, otherwise it is assumed that being not in contact between them.It is similar in order to calculate the semanteme between two concepts Property, utilize LDA models generate concept set symphysis at concept matrix, each Input matrix is that concept appears in ontology Possibility size.
Similitude between usual concept is spent using point mutual information PMI (Pointwise Mutual Information) Amount, invention defines Semantic Similarity measures between a kind of new vocabulary w and w ', are determined using the expectation of two concepts Adopted PMI, each concept are made of a series of vocabulary, this is also a special nature of LDA models.Two vocabulary w's and w ' Point mutual information is PMI (w, w '), then has:
In formula, P (w, w ')=P (w) P (w ' | w);
In formula, z is theme, the probability that P (z=j) is theme when being j, and P (w | z =j) when to be theme be j, the probability of vocabulary w, k is the quantity of concept;
In formula, P (w ' | z=j) is theme when being j, the probability of w ', P (z=j | w) is vocabulary when being w, the conditional probability that theme is j.
The calculation formula that the present invention provides the point mutual information of two vocabulary is the concept between follow-up tissue construction ontology Hierarchical structure is prepared, and the Semantic Similarity defined between another concept can also use the formula.
The each concept generated by LDA corresponds to a concept inside body construction.Semantic Similarity measurement is measurement two Semantic similarity between a concept.In the context of special linguistic context, the semantic similarity of other two concept.Upper Concept S-th of concept ts ' and r-th be generally in set T in the context of p-th concept tp and concept tp, in next layer of concept set T ' Read semantic similarity CosTMI (ts ', tr ' of two concepts of tr ';tp)
In formula, tp includes sequence of words { wp1, wp2 ..., wpn };Ts ' include sequence of words ws'1, ws'2 ..., ws’n};Tr ' includes sequence of words { wr ' 1, wr'2 ..., wr ' n }.
Preset threshold value thct, if CosTMI (ts ', tr ';Tp) value is more than certain threshold value thct, in tp and ts, Ts ' opening relationships.By above-mentioned definition and the calculating of Semantic Similarity, what is obtained all can be ontology with the concept of opening relationships A concept in structure in ontology.Threshold value Thct is by testing the value to be determined, this value two concepts of bigger explanation Between Semantic Similarity it is bigger, otherwise Semantic Similarity is smaller.
L1 standard similarity measurements:
In formula, P (ts ' | tp) be (be The probability of the generation of ts ' under tp context vocabulary environment), P (tr ' | tp) be (be the hair of the tr ' under tp context vocabulary environment Raw probability);
By standard similarity measurement L (ts ', tr ';When tp) defining the relationship between Ontological concept, each pass through Topic Model learn the concept that the concept all corresponds to an ontology, context rings of each concept ts ' or tr ' in tp Conditional probability under border, for calculating the semantic similarity between same layer concept, value is smaller to show that the Semantic Similarity of value is got over It is high.
Determine the hierarchical structure of ontology:
According to rule 1, the concept of low level is more more specific than high-level concept, and in natural language processing, we can It was found that most abstract concept.Namely those most abstract concepts are the concepts that can not be finely divided again.Therefore study ontology mistake Cheng Zhong, concept cannot be subdivided always, it is proposed that a new method determines the suitable size of ontology, i.e., in given neck The quantity of ontology level is determined in domain knowledge base.
If learning three concept hierarchies Th, Tm, Tl using TopicModel, Th is highest level, and Tm is the intermediate level, Tl is lowest level, the entropys of these three variables is denoted as H (Th), H (Tm), H (Tl), and H (Tl | Tm) is the condition in message area Entropy, then successive two layers of concept set information gain Δ (I (Th, Tm, Tl)) be defined as:
Δ (I (Th, Tm, Tl))=H (Th)-H (Tl | Tm)
When Δ (I (Th, Tm, Tl)) is less than defined threshold value ω, stop utilizing LDA model learning concept set.This is not Equation means that the concept distributional semantic model of Tm and Tl are quite similar when Δ (I (Th, Tm, Tl)) value is less than certain threshold value, LDA concept learnings reach the best expectation of an Ontological concept hierarchy learning, and concept hierarchy quantity is exactly the concept of ontology at this time Level quantity.In actual experiment, we set ω values close to 0.
The structure bulk process that we are proposed carries out experimental verification by the corresponding GENIA ontologies of GENIA corpus. GENIA corpus is a biological corpus.The corpus includes 1,999 medical vocabularies, be from MeSH, human and It collects and obtains in blood cells.Include 45 concepts and 42 relationships in GENIA ontologies.Our experiment contents are by GENIA Expectation is input to LDA models, calculates the required concept of ontology to be built.We compared it is proposed that method and The execution for the algorithm that Zavitsanos et al. is proposed is completed in Pentium 4, the PC machine of memory 2GB, we compared The CI methods that CosTMI and Zavitsanos et al. are proposed, the parameter setting such as table 1 of tri- measures of CI, CP and L1.
The parameter setting of 1 method for measuring similarity of table
Our experimental result is described in detail below.GENIA ontology hierarchical structures are comprising two different ontologies. We measured using recall rate, accuracy rate and F1 assess it is proposed that method execution efficiency and obtained body construction Quality.The calculation formula of recall rate Rec is as follows:
In formula, nrcIt is the quantity of the correct concept for the study that algorithm provides, NrIt is that model totally calculates Concept quantity.
Accuracy rate Prec is defined as follows:
In formula, npcIt is the quantity of concept learning, NpIt is that algorithm learns whole concept quantity.
The calculation formula of F1 measurements is as follows:
Two methods execute as shown in table 2 in the conceptive comparing result of study:
Implementing results of the concept C based on similarity measurement of 2 algorithm of table study
From table 2 it will be seen that it is proposed that method AOL implementing results be that effectively, can be used for The ontological construction of other field knowledge, accuracy rate and recall rate are all above CI methods.
The implementing result of relationship between the different similarity measurement structure reconstruction of table 3
Shown in table 2 and table 3, comparing result is very satisfactory, and the algorithm that we are proposed is proved to be to be led in structure It is effectively in terms of the ontology of domain.As can be seen that our algorithm is in the accuracy for identifying concept from Experimental comparison results Aspect is really slightly inferior, and reason may be that algorithm is still short of in terms of identifying very specific concept.
Fig. 1 illustrates the vocabulary quantity that each concept includes, and is found in we do experimentation, and each concept is included Vocabulary quantity influence whether the accuracy of ontological construction.The experimental results showed that if each concept includes 10 vocabulary below Quantity can seriously affect the accuracy of ontological construction., whereas if the vocabulary quantity that each concept includes is more, this is constructed The accuracy of body is also higher.But be not that the concept that includes is The more the better, by experimental test and analysis, each concept includes 16 A vocabulary result can be relatively good, if concept includes that vocabulary is too many, some low-frequency words occurred in corpus are will appear in concept It converges, it is little to the abstract sense of concept in ontological construction, the actual mass of ontological construction is influenced whether instead.Result in Fig. 1 Shown, it is extraordinary that CP and the L1 measurement proposed using this patent, which is effect,.The experimental result also indicates that this patent defined Word semantic similarity measure carries out the validity of Concept Semantic similarity measurement during building ontology.
, also by algorithm during building ontology, the variation between the variation and accuracy rate of ontology level depth is closed for we System.Experimental result is as shown in Figures 2 and 3, when main presentation is measured using CP and L1, with the variation of ontology depth, accurately The situation of change of rate F1.We test the parameter th of settingcp=0.93, thfl=1.24.The standard that algorithm terminates is according to us The value of the DMI of definition, the value are set as ω=0.01.Fig. 2 is described when the depth of ontology reaches 7, this accurate rate metric F1 Reach highest.It is when ontology depth reaches 8 for Fig. 3, this accuracy rate F1 reaches highest.
Finally we must mention some related some factors for influencing ontological constructions, automated ontology build one it is open Research field, there are no a fixed standards to assess all quality and effect for seeing ontology;In addition it is exactly this patent experiment Used ontology GENIA is also to be built by the method for the domain expert i.e. subjectivity of people or comparison basis.So passing through visitor The method of sight goes one subjective method of measurement that can more have any problem.Also it is exactly to build the assessment of ontology automatically than with kind Complexity and the difficult meeting that ontology is extended and updated on the basis of sub- ontology are more.

Claims (2)

1. a kind of method of automatic study ontology using Topic Model, which is characterized in that include the following steps:
The first step carries out concept extraction using LDA models from given document corpus, is produced generally by the concept being drawn into Set is read, then the hierarchical structure G, G={ T, E } of progress concept hierarchy subdivision generation ontological construction, in formula, T=t1, T2 ..., tm } it is concept set, it is defined as Upper Concept set;T '=t1 ', t2 ' ..., tm ' it is sub- concept set, definition For the next layer of concept set of Upper Concept set T, concept set T and sub- concept set T ' is successive two layers;E is the collection on side It closes, each eij ∈ E indicate that i-th of concept ti and j-th of concept tj ' in sub- concept set T ' in concept set T has side phase Even;
Second step, using CosTMI method for measuring similarity, the Semantic Similarity in identification hierarchical structure G between successive two layers, Wherein, in Upper Concept set T in the context of p-th of concept tp and concept tp, s-th of concept in next layer of concept set T ' Two concepts of ts ' and r-th of concept tr ' semantic similarity CosTMI (ts ', tr ';tp)
In formula, tp includes sequence of words { wp1, wp2 ..., wpn };Ts ' includes sequence of words { ws ' 1, ws ' 2 ..., ws ' n }; Tr ' includes sequence of words { wr ' 1, wr ' 2 ..., wr ' n };PMI () is the point mutual information of two vocabulary, two vocabulary w and w ' Point mutual information be PMI (w, w '), then have:
In formula, P (w, w ')=P (w) P (w ' | w);
In formula, z is theme, the probability that P (z=j) is theme when being j, P (w | z=j) When to be theme be j, the conditional probability of vocabulary w, k is the quantity of concept;
In formula, the conditional probability of P (w ' | z=j) is theme when being j w ', P (z=j | w) is vocabulary when being w, the conditional probability of theme j;
If CosTMI (ts ', tr ';Tp) be more than certain threshold value thc, then in tp and ts ', tr ' opening relationships;
Third step, calculate standard similarity measurement L (ts ', tr ';Tp), In formula, P (ts ' | tp) is (being the probability of the generation of ts ' under tp context vocabulary environment), P (tr ' | tp) be (be on tp Hereafter under vocabulary environment the generation of tr ' probability);
By standard similarity measurement L (ts ', tr ';When tp) defining the relationship between Ontological concept, each pass through Topic Model learns the concept that the concept all corresponds to an ontology, and each concept ts ' or tr ' is under the context environmental of tp Conditional probability, for calculating the semantic similarity between same layer concept, value is smaller to show that the Semantic Similarity of value is higher;
4th step, the hierarchical structure for determining ontology
If learning three concept hierarchies Th, Tm, Tl using TopicModel, Th is highest level, and Tm is the intermediate level, and Tl is Lowest level, the entropys of these three variables are denoted as H (Th), H (Tm), H (Tl), and H (Tl | Tm) is the conditional entropy in message area, then Successive two layers of concept set information gain Δ (I (Th, Tm, Tl)) is defined as:
Δ (I (Th, Tm, Tl))=H (Th)-H (Tl | Tm)
When Δ (I (Th, Tm, Tl)) is less than defined threshold value ω, stop utilizing LDA model learning concept set.
2. a kind of body constructing method based on Topic Model as described in claim 1, which is characterized in that described the In one step, carry out following following rule when concept hierarchy subdivision generates the hierarchical structure G of ontological construction:
Rule 1:If ti ∈ T, tj ' ∈ T',Conclusion is:Sub- concept set T ' is than concept set T, wherein NT and NT' It is the floor height rank of concept set T and sub- concept set T ' respectively;
Rule 2:If ti ∈ T, tj ' ∈ T',In ti and tj ' between very likely there is relationship between superior and subordinate, whereinIt is empty set.
CN201810009239.3A 2018-01-04 2018-01-04 A method of utilizing the automatic study ontology of Topic Model Pending CN108304488A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810009239.3A CN108304488A (en) 2018-01-04 2018-01-04 A method of utilizing the automatic study ontology of Topic Model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810009239.3A CN108304488A (en) 2018-01-04 2018-01-04 A method of utilizing the automatic study ontology of Topic Model

Publications (1)

Publication Number Publication Date
CN108304488A true CN108304488A (en) 2018-07-20

Family

ID=62868677

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810009239.3A Pending CN108304488A (en) 2018-01-04 2018-01-04 A method of utilizing the automatic study ontology of Topic Model

Country Status (1)

Country Link
CN (1) CN108304488A (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106547739A (en) * 2016-11-03 2017-03-29 同济大学 A kind of text semantic similarity analysis method
CN107133283A (en) * 2017-04-17 2017-09-05 北京科技大学 A kind of Legal ontology knowledge base method for auto constructing

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106547739A (en) * 2016-11-03 2017-03-29 同济大学 A kind of text semantic similarity analysis method
CN107133283A (en) * 2017-04-17 2017-09-05 北京科技大学 A kind of Legal ontology knowledge base method for auto constructing

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHIJIE LIN: "《Terminological ontology learning based on LDA》", 《THE 2017 4TH INTERNATIONL CONFERENCE ON SYSTEMS AND INFORMATICS (ICSAI 2017)》 *

Similar Documents

Publication Publication Date Title
Wang et al. Feature extraction and analysis of natural language processing for deep learning English language
CN107992597B (en) Text structuring method for power grid fault case
CN109376242B (en) Text classification method based on cyclic neural network variant and convolutional neural network
CN111078889B (en) Method for extracting relationship between medicines based on various attentions and improved pre-training
CN104268197B (en) A kind of industry comment data fine granularity sentiment analysis method
CN109145112A (en) A kind of comment on commodity classification method based on global information attention mechanism
CN108897857A (en) The Chinese Text Topic sentence generating method of domain-oriented
CN106815293A (en) System and method for constructing knowledge graph for information analysis
CN109284406A (en) Intension recognizing method based on difference Recognition with Recurrent Neural Network
Sadr et al. Unified topic-based semantic models: a study in computing the semantic relatedness of geographic terms
CN109325231A (en) A kind of method that multi task model generates term vector
CN113343690B (en) Text readability automatic evaluation method and device
CN111274790A (en) Chapter-level event embedding method and device based on syntactic dependency graph
Gu et al. Enhancing text classification by graph neural networks with multi-granular topic-aware graph
Sadr et al. Exploring the efficiency of topic-based models in computing semantic relatedness of geographic terms
Siddharth et al. Toward automatically assessing the novelty of engineering design solutions
Jeon et al. Measuring the novelty of scientific publications: a fastText and local outlier factor approach
Xia et al. Study of text emotion analysis based on deep learning
CN107895012B (en) Ontology construction method based on Topic Model
Han et al. Automatic business process structure discovery using ordered neurons LSTM: a preliminary study
CN117077631A (en) Knowledge graph-based engineering emergency plan generation method
Lin et al. Learning ontology automatically using topic model
Atmaja et al. Deep learning-based categorical and dimensional emotion recognition for written and spoken text
Hua Study on the application of rough sets theory in machine learning
Revanesh et al. An Optimized Question Classification Framework Using Dual-Channel Capsule Generative Adversarial Network and Atomic Orbital Search Algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180720

RJ01 Rejection of invention patent application after publication