CN103455638A - Behavior knowledge extracting method and device combining reasoning and semi-automatic learning - Google Patents
Behavior knowledge extracting method and device combining reasoning and semi-automatic learning Download PDFInfo
- Publication number
- CN103455638A CN103455638A CN2013104522928A CN201310452292A CN103455638A CN 103455638 A CN103455638 A CN 103455638A CN 2013104522928 A CN2013104522928 A CN 2013104522928A CN 201310452292 A CN201310452292 A CN 201310452292A CN 103455638 A CN103455638 A CN 103455638A
- Authority
- CN
- China
- Prior art keywords
- behavior
- knowledge
- template
- candidate
- behavior knowledge
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000006399 behavior Effects 0.000 title claims abstract description 327
- 238000000034 method Methods 0.000 title claims abstract description 25
- 238000000605 extraction Methods 0.000 claims description 26
- 230000000694 effects Effects 0.000 claims description 8
- 238000007689 inspection Methods 0.000 claims description 4
- 230000008569 process Effects 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 238000010276 construction Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 239000002360 explosive Substances 0.000 description 2
- 239000003337 fertilizer Substances 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
Images
Abstract
The invention provides a behavior knowledge extracting method and device combining reasoning and semi-automatic learning. Aiming at massive open source texts, a small amount of behavior knowledge extracting models and semantic relation among behavior knowledge are utilized to incrementally obtain behavior premises, behavior results and time sequence relations among behaviors from texts. The behavior premises, the behavior results and the time sequence relations among behaviors are respectively obtained on the basis of Bootstrapping, and on the basis of the semantic relation among behavior knowledge, knowledge reasoning is used in Bootstrapping for knowledge extracting. By the method, behavior knowledge extracting efficiency and quality are increased, automatic behavior modeling and analyzing aiming at massive texts in different fields can be achieved.
Description
Technical field
The invention belongs to the computer science and technology field, be specifically related to a kind ofly based on a small amount of initial behavior knowledge, extract template, in conjunction with behavior knowledge extraction method and the device of reasoning and semi-automatic study, for from the mass text increment obtain behavior knowledge.
Background technology
Behavior knowledge is the very important knowledge type of a class, in a plurality of fields that relate to behavior modeling, analysis and prediction, has very important application.Along with the development of Internet technology and universal, the mass text gathered has on the net also proposed severe technological challenge when to behavior knowledge acquisition work, bringing Data support.
Behavior knowledge extraction work in the past is general to be adopted based on supervised learning or the method based on manual rule, representative work comprises: Sil etc. (" Extracting action and event semantics from web text; " in AAAI Fall Symposium on Common-Sense Knowledge (AAAI-CSK), 2010) utilize support vector machine to extract behavior prerequisite and knowledge of result; Li etc. (" Automatic construction of domain theory for attack planning; " in2010IEEE International Conference on Intelligence and Security Informatics (IEEE-ISI), 2010, pp.65-70) utilize manual template to extract behavior prerequisite and result.Behavior knowledge extraction method in the past mainly has the following disadvantages: (1) needs the language material of a large amount of manual marks or the manual construction that places one's entire reliance upon to extract template, thereby causes efficiency lower; (2) only extract behavior prerequisite and knowledge of result, ignored the extraction to relation between behavior, particularly obtain the important behavior knowledge of this class of sequential relationship between behavior; (3) only extract separately every kind of behavior knowledge, can not utilize the semantic association between behavior knowledge to promote the mutual expansion between behavior knowledge not of the same race.
Summary of the invention
The technical problem to be solved in the present invention is: for the text of increasing income of magnanimity, use a small amount of behavior knowledge extract template and utilize the semantic association between behavior knowledge, increment ground obtains the three kinds of main behavior knowledge of sequential relationship between behavior prerequisite, behavior outcome and behavior from text.
For solving the problems of the technologies described above, the present invention proposes a kind of behavior knowledge extraction method, comprises the steps:
S1, utilize cooccurrence relation and semantic relevant information between template and behavior knowledge, statistical correlation degree between calculated candidate template and behavior Knowledge Set, candidate's knowledge and template set, and the semantic similarity between candidate's behavior Knowledge and behavior Knowledge Set, between candidate template and template set, and then the confidence level of calculated candidate behavior knowledge and template, and obtain new behavior Knowledge Set and template set according to described confidence level;
S2, utilize the semantic association between different types of behavior knowledge, by Method of Knowledge Reasoning, expand the behavior Knowledge Set;
S3, behavior knowledge is carried out to the knowledge refinement, mainly comprise and merge similar situation and remove the contradiction situation, the quality of extracting to improve behavior knowledge.
According to a kind of embodiment of the present invention, described step S1 comprises repeatedly iteration, and each iteration comprises that increment obtains template and increment and obtains these two of behavior knowledge step by step.Increment refers to the carrying out along with iteration, and each is taken turns and obtains than last round of more template and behavior knowledge.
According to a kind of embodiment of the present invention, described increment obtains the as follows step by step of template:
S1.1, the behavior knowledge obtained based on last round of iteration obtain the candidate template collection from input text; Utilize the cooccurrence relation between current behavior Knowledge Set and candidate template to calculate its statistical correlation degree, and the semantic similarity between the template set that obtains of calculated candidate template and last round of iteration, and then obtain the confidence level of candidate template.
S1.2, candidate template is sorted from high to low by confidence level, chosen the template that a front k template obtains as the epicycle iteration.Template number and n that wherein k is last round of iteration
tsum, n
trefer to the template number that each iteration newly increases, value is determined by embodiment.
According to a kind of embodiment of the present invention, described increment obtains the as follows step by step of behavior knowledge:
S1.3, the template obtained based on the epicycle iteration obtain candidate's behavior Knowledge Set from input text; Utilize the cooccurrence relation between current template set and candidate's behavior knowledge to calculate its statistical correlation degree, and the semantic similarity between the behavior Knowledge Set that obtains of calculated candidate behavior knowledge and last round of iteration, and then obtain the confidence level of candidate's behavior knowledge.
S1.4, respectively three class behavior knowledge are sorted from high to low by confidence level, and chosen front k the behavior knowledge obtained as the epicycle iteration.K be last round of iteration every kind of behavior knowledge quantity and n wherein
ksum, n
krefer to every quantity that every kind of behavior knowledge of iteration newly increases of taking turns, value is determined by embodiment.
According to a kind of embodiment of the present invention, the confidence level of described template and behavior knowledge is defined as follows:
Wherein, C
iand C (t)
i(k) mean respectively candidate template t and the candidate's knowledge k confidence level when i wheel iteration, SA
i() and SS
i() means respectively candidate template or statistical correlation degree and the semantic similarity of knowledge when i wheel iteration, max
t 'c
i(t ') and maX
k 'c
i(k ') is respectively the maximal value of the confidence level of all templates and knowledge in the i wheel, δ is weight factor, its codomain be set as [0,1), when the δ value is 0, mean that confidence level calculating only carrys out the reliability of evaluate candidate behavior knowledge and template with the statistical correlation degree.
According to a kind of embodiment of the present invention, the formula that is calculated as follows of the statistical correlation degree in i wheel iteration between candidate template and Knowledge Set, between candidate's behavior knowledge and template set:
In front, t means candidate template, K
i-1mean the behavior Knowledge Set that the i-1 wheel obtains, C
i-1(k) mean the confidence level of behavior knowledge k in i-1 wheel iteration; In rear formula, k means candidate's behavior knowledge, T
ithe candidate template collection of epicycle iteration, C
i(t) be the confidence level of template t in the epicycle iteration.
According to a kind of embodiment of the present invention, the template set T obtained in candidate template t and last round of iteration
i-1between the formula that is calculated as follows of semantic similarity:
Wherein, Sim (t, e) means that template t and e are at similarity degree semantically;
The behavior Knowledge Set K obtained in candidate's behavior knowledge k and last round of iteration
i-1between the formula that is calculated as follows of semantic similarity:
Wherein, Sim (k, e) means that behavior knowledge k and e are at similarity degree semantically.
According to a kind of embodiment of the present invention, in described step S2, behavior knowledge comprises three kinds, refers to respectively the sequential relationship knowledge between behavior prerequisite, behavior outcome and behavior.
Mutual inference method between described row knowledge:
Wherein, a
1and a
2the expression behavior, s means state, Effect (a
1, s) mean that s is a
1result, Precondition (a
2, s) mean that s is a
2prerequisite, Temporal-relation (a
1, a
2) expression a
1to occur in a
2behavior before.
According to a kind of embodiment of the present invention, the every wheel after iteration finishes, on the basis of the behavior prerequisite of obtaining in epicycle, result and sequential relationship set, expand in accordance with the following steps the three behaviors knowledge collection: at first, to each behavior prerequisite knowledge (a
2, s), check whether state s is present in results set, if exist, each be take to the behavior a that s is result
1same a
2the behavior formed together is to (a
1, a
2) add in the set of candidate's sequential relationship; Secondly, each behavior in inspection sequential knowledge collection is to (a
1, a
2), if (a
1, s) be present in (or (a in results set
2, s) be present in the prerequisite set), by (a
2, s) add in the set of candidate's behavior prerequisite (or by (a
1, s) add in candidate's behavior results set); Finally, for each the behavior knowledge k in candidate's behavior prerequisite, result and sequential relationship set, if k also is based on candidate's behavior knowledge that the statistical correlation degree obtains, the confidence level of k is made as to 1, and k is added in corresponding behavior knowledge simultaneously.
According to a kind of embodiment of the present invention, in described step S3, the behavior, behavior prerequisite and the result that merge the redundancy that similar situation obtains pre-service are merged; Remove the contradiction situation for the every sequential relationship of taking turns between the behavior that iteration obtains of Bootstrapping step, remove the behavior pair of contradiction each other.
In addition, the present invention also provides a kind of behavior knowledge extraction element, comprise as lower module,
The first module, for utilizing cooccurrence relation and the semantic relevant information between template and behavior knowledge, the statistical correlation degree of calculated candidate behavior knowledge and template, and the semantic similarity between candidate's behavior Knowledge and behavior Knowledge Set, between candidate template and template set, and then the confidence level of calculated candidate behavior knowledge and template, and obtain new behavior knowledge and template according to described confidence level;
The second module, for utilizing the semantic association between different types of behavior knowledge, expand behavior knowledge by Method of Knowledge Reasoning;
The 3rd module, merge similar situation and remove the contradiction situation for the behavior knowledge that described the first module is obtained, and improves the quality that behavior knowledge is extracted.
Compared with prior art, the knowledge extraction method of the behavior in conjunction with reasoning and semi-automatic study that the present invention proposes and device are owing to having utilized statistical information and semantic information, and combine implicit expression behavior knowledge acquisition and the explicit behavior knowledge acquisition based on Text Information Extraction of knowledge-based inference, therefore, the validity and reliability extracted in behavior knowledge and be applicable to process existing method aspect extensive text and there is obvious advantage:
Based on a small amount of initial extraction template increment obtain a large amount of behavior knowledge, be applicable to extracting towards the behavior knowledge of mass text;
Knowledge reasoning and Bootstrapping technology are organically combined, obviously improved the performance that behavior knowledge is extracted;
Designed Bootstrapping step utilizes statistical correlation information and semantic analog information to estimate the confidence level of knowledge, can effectively improve the reliability that behavior knowledge is extracted.
The accompanying drawing explanation
Fig. 1 is the behavior knowledge extraction method process flow diagram that the present invention proposes.
Embodiment
For making the purpose, technical solutions and advantages of the present invention clearer, below in conjunction with specific embodiment, and, with reference to accompanying drawing, the present invention is described in further detail.
Fig. 1 shows in the present invention the behavior knowledge extraction method process flow diagram in conjunction with reasoning and semi-automatic study.As shown in Figure 1, the method comprises the following steps:
S1, the Bootstrapping step based on statistical correlation degree and semantic similarity.
This step specifically refers to: utilize cooccurrence relation and semantic relevant information between template and behavior knowledge, statistical correlation degree between calculated candidate template and behavior Knowledge Set, between candidate's behavior knowledge and template set, and the semantic similarity between candidate's behavior Knowledge and behavior Knowledge Set, between candidate template and template set, and then the confidence level of calculated candidate behavior knowledge and candidate template, finally according to confidence level, obtain new behavior Knowledge Set and template set.
Described Bootstrapping step refers in statistical learning utilizes initial given a small amount of behavior template, by the process of iteration Stepwise Refinement result.Described template refers to for extracting the syntactic pattern of behavior knowledge, for example sentence " The terrorists use fertilizer to make explosives. " can mate the prerequisite template " need|use<Precondition > to<Verb ><Object ", thereby obtain prerequisite knowledge: " fertilizer " is the prerequisite of " make explosives ".Described cooccurrence relation refers to single template and the common situation about occurring of behavior knowledge, with non-negative some mutual information, measures (hereinafter can describe in detail).
Described semantic relevant information refers to the semantic hierarchies relation in semantic dictionary (as WordNet, synonym word woods etc.) according to two words, by calculating both semantic similarities, finally obtain between template and template (collection), the semantic similarity between behavior Knowledge and behavior knowledge (collection).The statistical correlation degree of described candidate's behavior knowledge and template can be weighed by non-negative some mutual information and corresponding confidence level between single behavior knowledge and template, between candidate's behavior Knowledge and behavior Knowledge Set, candidate template weighs by the semantic similarity between single behavior knowledge and template and corresponding confidence level with the semantic similarity between template set.
S2, behavior knowledge reasoning step.
This step is utilized the semantic association between different types of behavior knowledge, by Method of Knowledge Reasoning, expands the behavior Knowledge Set.Method of Knowledge Reasoning refers to according to existing behavior knowledge, the process of the behavior knowledge of utilizing the semantic association deduction between behavior knowledge to make new advances.
S3, behavior knowledge refinement step.
So-called " refinement " refers to and merges the behavior knowledge that the phase Sihe is removed contradiction.This step merges similar situation and removes the contradiction situation text pretreatment stage and the behavior knowledge obtained in the Bootstrapping step, improves the quality that behavior knowledge is extracted.Described text pre-service is before the Bootstrapping step, to utilize the natural language processing instrument to carry out participle, part-of-speech tagging and syntactic analysis to the magnanimity text of increasing income, and identifies the process of the behavior that state that noun phrase expresses and verb+object form express from the syntax analysis result.
Below introduce in detail above-mentioned each step.
S1, the Bootstrapping step based on statistical correlation degree and semantic similarity.
This step comprises repeatedly iteration, and the number of times of iteration can be determined according to concrete enforcement.Wherein, mainly comprise two step by step in the Bootstrapping step of iteration each time: increment obtains template and increment obtains behavior knowledge.Increment refers to the carrying out along with iteration, and each is taken turns and obtains than last round of more template and behavior knowledge.
Increment obtains the as follows step by step of template:
S1.1, the behavior knowledge obtained based on last round of iteration obtain the candidate template collection from input text; Utilize the cooccurrence relation between current behavior Knowledge Set and candidate template to calculate its statistical correlation degree, and the semantic similarity between the template set that obtains of calculated candidate template and last round of iteration, and then obtain the confidence level of candidate template.
S1.2, candidate template is sorted from high to low by confidence level, chosen the template that a front k template obtains as the epicycle iteration.Template number and n that wherein k is last round of iteration
tsum, n
trefer to the template number that each iteration newly increases, value is determined by embodiment.
Increment obtain the process of behavior knowledge and step that above-mentioned increment obtains template similar, comprise:
S1.3, the template obtained based on the epicycle iteration obtain candidate's behavior Knowledge Set from input text; Utilize the cooccurrence relation between current template set and candidate's behavior knowledge to calculate its statistical correlation degree, and the semantic similarity between the behavior Knowledge Set that obtains of calculated candidate behavior knowledge and last round of iteration, and then obtain the confidence level of candidate's behavior knowledge.
S1.4, respectively three class behavior knowledge are sorted from high to low by confidence level, and chosen front k the behavior knowledge obtained as the epicycle iteration.K be last round of iteration every kind of behavior knowledge quantity and n wherein
ksum, n
krefer to every quantity that every kind of behavior knowledge of iteration newly increases of taking turns, value is determined by embodiment.
Described behavior knowledge comprises three kinds, refers to respectively the sequential relationship knowledge between behavior prerequisite knowledge, behavior outcome knowledge and behavior.
The computing method of the confidence level of candidate template and behavior knowledge are based on two category informations, i.e. statistical correlation degree (Statistical Association, SA) and semantic similarity (Semantic Similarity, SS).Particularly, the confidence level of template and behavior knowledge is defined as follows:
Here, C
iand C (t)
i(k) mean respectively candidate template t and the candidate's behavior knowledge k confidence level when i wheel iteration, SA
i() and SS
i() means respectively candidate template or statistical correlation degree and the semantic similarity of behavior knowledge when i wheel iteration, max
t 'c
i(t ') and max
k 'c
i(k ') is respectively the maximal value of the confidence level of all templates and behavior knowledge in the i wheel, for normalization.δ is weight factor, its codomain be set as [0,1), when the δ value is 0, meaning that confidence level is calculated only carrys out the reliability of evaluate candidate behavior knowledge and template with the statistical correlation degree.When initial, the confidence level of template is set as 1.
Below introduce respectively the statistical correlation degree of behavior knowledge and template and the computing method of semantic similarity.
(1) the statistical correlation degree calculates
The cooccurrence relation of the calculating of statistical correlation degree based between template and behavior knowledge, the relevance between tolerance candidate template and behavior Knowledge Set, candidate's behavior knowledge and template set.For calculating the statistical correlation degree between single behavior knowledge and single template, the present invention has designed non-negative some mutual information (Nonnegative Pointwise Mutual Information, PMI
+):
Wherein, k means single behavior knowledge, and t means single template.Probability of occurrence when P (k), P (t) and P (k, t) mean respectively probability that knowledge k occurs, probability that template t occurs and behavior knowledge k and template t.Non-negative some mutual information PMI of the present invention's design
+the value perseverance is non-negative, can prevent from obtaining the negative that absolute value is larger under conventional point mutual information (PMI) account form, to statistical certainty, calculates and brings impact.
Take turns iteration every, at first choose template, then utilize the template obtained to choose behavior knowledge, therefore when the statistical correlation of calculated candidate behavior knowledge is spent, template and confidence level thereof in the template set that can utilize the epicycle iteration to obtain; And, when the statistical correlation of calculated candidate template is spent, be knowledge and the confidence level thereof in the behavior Knowledge Set that utilizes last round of iteration to obtain.
The formula that is calculated as follows of the statistical correlation degree in i wheel iteration between candidate template and behavior Knowledge Set, between candidate's behavior knowledge and template set:
In formula (4), t means candidate template, K
i-1mean the behavior Knowledge Set that the i-1 wheel obtains, C
i-1(k) mean the confidence level of behavior knowledge k in i-1 wheel iteration.In formula (5), k means candidate's behavior knowledge, T
ithe candidate template collection of epicycle iteration, C
i(t) be the confidence level of template t in the epicycle iteration.
(2) semantic similarity calculates
Semantic similarity between behavior knowledge and the calculating of the semantic similarity between template adopt similar thought: the semantic similarity that at first calculates word and word, and then the semantic similarity seen of the semantic similarity between the calculating behavior and state (comprising behavior prerequisite and behavior outcome), finally calculate between template and template (collection), the semantic similarity between behavior Knowledge and behavior knowledge (collection).
The present invention utilizes the semantic hierarchies relation in general semantics dictionary (as: WordNet, synonym word woods etc.) to calculate two word w
1and w
2between semantic similarity, concrete form is as follows:
D (w in above formula
1, w
2) be defined as word w
1with word w
2semantic distance in the general semantics dictionary: if w
1and w
2synonym, D (w
1, w
2)=0; If the two is set membership, D (w
1, w
2)=1, the rest may be inferred; If w
1and w
2there do not is hyponymy, D (w
1, w
2)=∞.
State s
1and s
2between semantic similarity be defined as s
1and s
2core noun n
1and n
2between semantic similarity Sim (n
1, n
2).Behavior a
1(verb v
1+ object o
1) and a
2(verb v
2+ object o
2) between semantic similarity by Sim (v
1, v
2) and Sim (o
1, o
2) product determine.
Single behavior knowledge k
1and k
2between semantic similarity calculate minute two kinds of situations: if behavior prerequisite and knowledge of result (i.e. the form of " behavior a-state s "), k
1and k
2between similarity by Sim (s
1, s
2) and Sim (a
1, a
2) product determine; If k
1and k
2sequential relationship between behavior (i.e. " behavior a
1-behavior a
2" form), k
1and k
2between semantic similarity be Sim (a
1, a
2).During semantic similarity between calculation template, at first check that whether the represented syntactic structure of two templates is consistent, if the syntactic structure of two templates is consistent, the semantic similarity of the two is defined as the product of the semantic similarity between the word of syntax tree same position; If the syntactic structure of two templates is inconsistent, the semantic similarity of the two is 0.
The calculating of the semantic similarity based between single behavior knowledge and template, according to the statistical correlation degree, calculating similar method, the template set T obtained in candidate template t and last round of iteration
i-1between the formula that is calculated as follows of semantic similarity:
Wherein, Sim (t, e) means that template t and e are at similarity degree semantically.Similarly, the behavior Knowledge Set K obtained in candidate's behavior knowledge k and last round of iteration
i-1between the formula that is calculated as follows of semantic similarity:
Wherein, Sim (k, e) means that behavior knowledge k and e are at similarity degree semantically.With the calculating difference of statistical correlation degree, be that the semantic similarity of candidate's behavior knowledge and template calculates all behavior knowledge and template sets based on obtaining in last round of iteration.
S2, behavior knowledge reasoning step.
The present invention utilizes the semantic association between behavior knowledge to obtain implicit behavior knowledge, often in automatic expansion Bootstrapping step takes turns the behavior Knowledge Set that iteration obtains.
Particularly, can utilize behavior prerequisite and knowledge of result to expand the sequential relationship set, utilize behavior prerequisite and sequential relationship knowledge to carry out the propagation behavior results set, and utilize behavior outcome and sequential relationship knowledge to carry out the set of propagation behavior prerequisite.Below the mutual inference method between behavior prerequisite, result and sequential relationship knowledge:
Wherein, a
1and a
2the expression behavior, s means state, Effect (a
1, s) mean that s is a
1result, Precondition (a
2, s) mean that s is a
2prerequisite, Temporal-relation (a
1, a
2) expression a
1to occur in a
2behavior before.The every wheel after iteration finishes, on the basis of the behavior prerequisite of obtaining in epicycle, result and sequential relationship set, expand in accordance with the following steps the three behaviors knowledge collection: at first, to each behavior prerequisite knowledge (a
2, s), check whether state s is present in results set, if exist, each be take to the behavior a that s is result
1same a
2the behavior formed together is to (a
1, a
2) add in the set of candidate's sequential relationship; Secondly, each behavior in inspection sequential knowledge collection is to (a
1, a
2), if (a
1, s) be present in (or (a in results set
2, s) be present in the prerequisite set), by (a
2, s) add in the set of candidate's behavior prerequisite (or by (a
1, s) add in candidate's behavior results set); Finally, for each the behavior knowledge k in candidate's behavior prerequisite, result and sequential relationship set, if k also is based on candidate's behavior knowledge that the statistical correlation degree obtains, the confidence level of k is made as to 1, and k is added in corresponding behavior Knowledge Set simultaneously.
S3, behavior knowledge refinement step.
The refinement of behavior knowledge comprises the merging of similar situation and the removal of contradiction situation.
Wherein, merge similar situation and occur in the pretreatment stage to input text, mainly for behavior and state (comprising behavior prerequisite and result);
Remove the contradiction situation and be for every and take turns the behavior knowledge that iteration obtains, mainly for the sequential relationship between behavior.
Merge similar situation based on the general semantics dictionary, check two behavior a in the behavior set
1and a
2whether the verb of (being verb+object form) part or object part are synonym, if synonym each other merges this two behaviors; Similarly, the state in state set is merged.In the sequential relationship set of the removal inspection behavior of contradiction situation, whether exist behavior to (a simultaneously
1, a
2) and (a
2, a
1), if exist, remove (a simultaneously
1, a
2) and (a
2, a
1).
Below according to specific embodiment, further illustrate the technique scheme that the present invention proposes.
In this embodiment, using the Al-Qaeda terrorist organization's relevant online news report as input, input text by come from the epoch online, 26699 news web pages of BBC, USA Today, the New York Times, Guardian, Washington Post and Los Angeles Times form.For guaranteeing the quality of input text, the sentence of only reserved character length between 4 to 80 finally obtains 801570 sentences from input text.
At first these input texts are carried out to pre-service, based on the syntactic analysis result, generate initial behavior and state set, and respectively behavior collection and the state set obtained carried out to the knowledge refinement, remove wherein behavior and the state of redundancy.Then, set a small amount of initial behavior prerequisite and result and extract template, the confidence level of these original templates is set as to 1.The initial prerequisite of using in the present embodiment and template is as follows as a result:
The prerequisite template:
1.need|use<Precondition>to<Verb><Object>
2.have|possess<Precondition>need to<Verb><Object>
3.<Precondition>[that could|could]be used to|for|in<Verb><Object>
4.use<Precondition>to<Verb><Object>
5.can<Verb><Object>,use<Precondition>
6.be|to<Verb><Object>use<Precondition>
Template as a result:
1.<Verb><Object>[in order]to have<Effect>
2.cause|obtain<Effect>by<Verb><Obiect>
3.<Verb><Object>[,]cause|obtain<Effect>
4.<Effect>be caused|obtained by<Verb><Obiect>
When the first round, iteration started, due to also, without any behavior knowledge, first utilize original template extract every kind of behavior knowledge and calculate its confidence level from text.Set δ=0.5 in the present embodiment, often take turns the behavior knowledge quantity n that iteration newly increases
kbe made as 5, the template number n newly increased
tbe made as 1.When first round iteration finishes, the behavior knowledge and the confidence level thereof that get are as follows:
Prerequisite knowledge:
Knowledge of result:
The behavior prerequisite and the knowledge of result that according to first round iteration, obtain, utilize knowledge reasoning to obtain sequential relationship between behavior and corresponding confidence level as follows:
1.1aunch attack find haven 1.0
2.1aunch attack create haven 1.0
At first second takes turns iteration utilizes the behavior knowledge got in first round iteration to obtain new template from input text, calculates all candidate template confidence level of (comprising the template in the first round), and presses the reliability order of template.According to default n
t, epicycle is more last round of newly increases a template.Second to take turns newly-increased each class template and confidence level thereof as follows:
Template:<Verb as a result ><Object >, put<Effect > and 1.0
Prerequisite template: be<Precondition > to<Verb ><Obiect > 1.0
Sequential template:<Verb2 ><Object2 > to<Verbl ><Objectl > 1.0
Then, the template of obtaining according to epicycle, in employing and the first round, behavior knowledge is extracted identical step, obtains new behavior knowledge from input text.So move in circles, until reach default iterations.After iteration finishes, the behavior knowledge and the confidence level thereof that finally get are as follows:
Prerequisite knowledge:
Knowledge of result:
Sequential relationship:
Based on the described input text of the present embodiment, the experimental results of the behavior knowledge extraction method that the present invention proposes following (wherein, iterations is made as 24 times, and the step-length of δ is 0.25, and comprise the inscience reasoning and in conjunction with the knowledge reasoning situation):
Weight factor | The knowledge of result accuracy | Prerequisite knowledge accuracy | The sequential relationship accuracy |
δ=0 (without reasoning) | 0.533 | 0.817 | / |
δ=0 | 0.55 | 0.842 | 0.788 |
δ=0.25 | 0.575 | 0.842 | 0.805 |
δ=0.5 | 0.558 | 0.875 | 0.813 |
δ=0.75 | 0.542 | 0.808 | 0.743 |
The advantage of method proposed by the invention is as follows:
The present invention, only based on a small amount of initial extraction template, just can obtain a large amount of behavior knowledge increment, time saving and energy saving, is applicable to extracting towards the behavior knowledge of mass text;
The behavior knowledge extraction method of the present invention's design combines knowledge reasoning and Bootstrapping technology, has obviously improved the performance that behavior knowledge is extracted;
The Bootstrapping step that the present invention adopts has utilized statistical correlation and semantic analog information to estimate the confidence level of knowledge, can effectively improve the reliability that behavior knowledge is extracted.
Above-described specific embodiment; purpose of the present invention, technical scheme and beneficial effect are further described; be understood that; the foregoing is only specific embodiments of the invention; be not limited to the present invention; within the spirit and principles in the present invention all, any modification of making, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.
Claims (11)
1. a behavior knowledge extraction method, comprise the steps:
S1, utilize cooccurrence relation and semantic relevant information between template and behavior knowledge, statistical correlation degree between calculated candidate template and behavior Knowledge Set, candidate's knowledge and template set, and the semantic similarity between candidate's behavior Knowledge and behavior Knowledge Set, between candidate template and template set, and then the confidence level of calculated candidate behavior knowledge and template, and obtain new behavior Knowledge Set and template set according to described confidence level;
S2, utilize the semantic association between different types of behavior knowledge, by Method of Knowledge Reasoning, expand the behavior Knowledge Set;
S3, behavior knowledge is carried out to the knowledge refinement, mainly comprise and merge similar situation and remove the contradiction situation, the quality of extracting to improve behavior knowledge.
2. behavior knowledge extraction method as claimed in claim 1, it is characterized in that: described step S1 comprises repeatedly iteration, each iteration comprises that increment obtains template and increment and obtains these two of behavior knowledge step by step.Increment refers to the carrying out along with iteration, and each is taken turns and obtains than last round of more template and behavior knowledge.
3. behavior knowledge extraction method as claimed in claim 2, it is characterized in that: described increment obtains the as follows step by step of template:
S1.1, the behavior knowledge obtained based on last round of iteration obtain the candidate template collection from input text; Utilize the cooccurrence relation between current behavior Knowledge Set and candidate template to calculate its statistical correlation degree, and the semantic similarity between the template set that obtains of calculated candidate template and last round of iteration, and then obtain the confidence level of candidate template.
S1.2, candidate template is sorted from high to low by confidence level, chosen the template that a front k template obtains as the epicycle iteration.Template number and n that wherein k is last round of iteration
tsum, n
trefer to the template number that each iteration newly increases, value is determined by embodiment.
4. behavior knowledge extraction method as claimed in claim 2, it is characterized in that: described increment obtains the as follows step by step of behavior knowledge:
S1.3, the template obtained based on the epicycle iteration obtain candidate's behavior Knowledge Set from input text; Utilize the cooccurrence relation between current template set and candidate's behavior knowledge to calculate its statistical correlation degree, and the semantic similarity between the behavior Knowledge Set that obtains of calculated candidate behavior knowledge and last round of iteration, and then obtain the confidence level of candidate's behavior knowledge.
S1.4, respectively three class behavior knowledge are sorted from high to low by confidence level, and chosen front k the behavior knowledge obtained as the epicycle iteration.K be last round of iteration every kind of behavior knowledge quantity and n wherein
ksum, n
krefer to every quantity that every kind of behavior knowledge of iteration newly increases of taking turns, value is determined by embodiment.
5. behavior knowledge extraction method as described as claim 3 or 4, it is characterized in that: the confidence level of described template and behavior knowledge is defined as follows:
Wherein, C
iand C (t)
i(k) mean respectively candidate template t and the candidate's knowledge k confidence level when i wheel iteration, SA
i() and SS
i() means respectively candidate template or statistical correlation degree and the semantic similarity of knowledge when i wheel iteration, max
t 'ci (t ') and max
k 'c
i(k ') is respectively the maximal value of the confidence level of all templates and knowledge in the i wheel, δ is weight factor, its codomain be set as [0,1), when the δ value is 0, mean that confidence level calculating only carrys out the reliability of evaluate candidate behavior knowledge and template with the statistical correlation degree.
6. behavior knowledge extraction method as claimed in claim 5 is characterized in that: the formula that is calculated as follows of the statistical correlation degree in i wheel iteration between candidate template and Knowledge Set, between candidate's behavior knowledge and template set:
In front, t means candidate template, K
i-1mean the behavior Knowledge Set that the i-1 wheel obtains, C
i-1(k) mean the confidence level of behavior knowledge k in i-1 wheel iteration; In rear formula, k means candidate's behavior knowledge, T
ithe candidate template collection of epicycle iteration, C
i(t) be the confidence level of template t in the epicycle iteration.
7. behavior knowledge extraction method as claimed in claim 5 is characterized in that:
The template set T obtained in candidate template t and last round of iteration
i-lbetween the formula that is calculated as follows of semantic similarity:
Wherein, Sim (t, e) means that template t and e are at similarity degree semantically;
The behavior Knowledge Set K obtained in candidate's behavior knowledge k and last round of iteration
i-1between the formula that is calculated as follows of semantic similarity:
Wherein, Sim (k, e) means that behavior knowledge k and e are at similarity degree semantically.
8. behavior knowledge extraction method as claimed in claim 1, it is characterized in that: in described step S2, behavior knowledge comprises three kinds, refers to respectively the sequential relationship knowledge between behavior prerequisite, behavior outcome and behavior.
Mutual inference method between described row knowledge:
Wherein, a
1and a
2the expression behavior, s means state, Effect (a
1, s) mean that s is a
1result, Precondition (a
2, s) mean that s is a
2prerequisite, Temporal-relation (a
1, a
2) expression a
1to occur in a
2behavior before.
9. behavior knowledge extraction method as claimed in claim 8, it is characterized in that: the every wheel after iteration finishes, on the basis of the behavior prerequisite of obtaining in epicycle, result and sequential relationship set, expand in accordance with the following steps the three behaviors knowledge collection: at first, to each behavior prerequisite knowledge (a
2, s), check whether state s is present in results set, if exist, each be take to the behavior a that s is result
1same a
2the behavior formed together is to (a
1, a
2) add in the set of candidate's sequential relationship; Secondly, each behavior in inspection sequential knowledge collection is to (a
1, a
2), if (a
1, s) be present in (or (a in results set
2, s) be present in the prerequisite set), by (a
2, s) add in the set of candidate's behavior prerequisite (or by (a
1, s) add in candidate's behavior results set); Finally, for each the behavior knowledge k in candidate's behavior prerequisite, result and sequential relationship set, if k also is based on candidate's behavior knowledge that the statistical correlation degree obtains, the confidence level of k is made as to 1, and k is added in corresponding behavior knowledge simultaneously.
10. behavior knowledge extraction method as claimed in claim 1 is characterized in that: in described step S3, the behavior, behavior prerequisite and the result that merge the redundancy that similar situation obtains pre-service are merged; Remove the contradiction situation for the every sequential relationship of taking turns between the behavior that iteration obtains of Bootstrapping step, remove the behavior pair of contradiction each other.
11. a behavior knowledge extraction element, comprise as lower module,
The first module, for utilizing cooccurrence relation and the semantic relevant information between template and behavior knowledge, the statistical correlation degree of calculated candidate behavior knowledge and template, and the semantic similarity between candidate's behavior Knowledge and behavior Knowledge Set, between candidate template and template set, and then the confidence level of calculated candidate behavior knowledge and template, and obtain new behavior knowledge and template according to described confidence level;
The second module, for utilizing the semantic association between different types of behavior knowledge, expand behavior knowledge by Method of Knowledge Reasoning;
The 3rd module, merge similar situation and remove the contradiction situation for the behavior knowledge that described the first module is obtained, and improves the quality that behavior knowledge is extracted.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2013104522928A CN103455638A (en) | 2013-09-26 | 2013-09-26 | Behavior knowledge extracting method and device combining reasoning and semi-automatic learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2013104522928A CN103455638A (en) | 2013-09-26 | 2013-09-26 | Behavior knowledge extracting method and device combining reasoning and semi-automatic learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103455638A true CN103455638A (en) | 2013-12-18 |
Family
ID=49738001
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2013104522928A Pending CN103455638A (en) | 2013-09-26 | 2013-09-26 | Behavior knowledge extracting method and device combining reasoning and semi-automatic learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103455638A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104572982A (en) * | 2014-12-31 | 2015-04-29 | 东软集团股份有限公司 | Personalized recommendation method and system based on question guide |
CN109615006A (en) * | 2018-12-10 | 2019-04-12 | 北京市商汤科技开发有限公司 | Character recognition method and device, electronic equipment and storage medium |
CN111401671A (en) * | 2019-01-02 | 2020-07-10 | 中国移动通信有限公司研究院 | Method and device for calculating derivative features in accurate marketing and readable storage medium |
CN114492387A (en) * | 2022-04-18 | 2022-05-13 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Domain self-adaptive aspect term extraction method and system based on syntactic structure |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101000626A (en) * | 2007-01-12 | 2007-07-18 | 宋晓伟 | Information storing method and method for converting search inquiry into inquiry statement |
US20090119649A1 (en) * | 2007-11-02 | 2009-05-07 | Klocwork Corp. | Static analysis defect detection in the presence of virtual function calls |
-
2013
- 2013-09-26 CN CN2013104522928A patent/CN103455638A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101000626A (en) * | 2007-01-12 | 2007-07-18 | 宋晓伟 | Information storing method and method for converting search inquiry into inquiry statement |
US20090119649A1 (en) * | 2007-11-02 | 2009-05-07 | Klocwork Corp. | Static analysis defect detection in the presence of virtual function calls |
Non-Patent Citations (1)
Title |
---|
ANSHENG GE ET AL: "Action Knowledge Extraction from Web Text", 《2013 IEEE INTERNATIONAL CONFERENCE ON INTELLIGENCE AND SECURITY INFORMATICS (ISI)》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104572982A (en) * | 2014-12-31 | 2015-04-29 | 东软集团股份有限公司 | Personalized recommendation method and system based on question guide |
CN104572982B (en) * | 2014-12-31 | 2017-10-31 | 东软集团股份有限公司 | Personalized recommendation method and system based on problem guiding |
CN109615006A (en) * | 2018-12-10 | 2019-04-12 | 北京市商汤科技开发有限公司 | Character recognition method and device, electronic equipment and storage medium |
CN111401671A (en) * | 2019-01-02 | 2020-07-10 | 中国移动通信有限公司研究院 | Method and device for calculating derivative features in accurate marketing and readable storage medium |
CN111401671B (en) * | 2019-01-02 | 2023-11-21 | 中国移动通信有限公司研究院 | Derived feature calculation method and device in accurate marketing and readable storage medium |
CN114492387A (en) * | 2022-04-18 | 2022-05-13 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Domain self-adaptive aspect term extraction method and system based on syntactic structure |
CN114492387B (en) * | 2022-04-18 | 2022-07-19 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Domain self-adaptive aspect term extraction method and system based on syntactic structure |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Wang et al. | Probabilistic tree-edit models with structured latent variables for textual entailment and question answering | |
Wang et al. | Structure learning via parameter learning | |
Carlson et al. | Coupling semi-supervised learning of categories and relations | |
Ru et al. | Using semantic similarity to reduce wrong labels in distant supervision for relation extraction | |
CN103207860A (en) | Method and device for extracting entity relationships of public sentiment events | |
Bonet-Jover et al. | Exploiting discourse structure of traditional digital media to enhance automatic fake news detection | |
Jang et al. | Metaphor detection in discourse | |
Ji et al. | Data selection in semi-supervised learning for name tagging | |
CN103631858A (en) | Science and technology project similarity calculation method | |
Nayak et al. | Knowledge graph based automated generation of test cases in software engineering | |
Wang et al. | Joint information extraction and reasoning: A scalable statistical relational learning approach | |
CN103455638A (en) | Behavior knowledge extracting method and device combining reasoning and semi-automatic learning | |
Zhang et al. | Stanford at TAC KBP 2016: Sealing Pipeline Leaks and Understanding Chinese. | |
Dung | Natural language understanding | |
Musdholifah et al. | FVEC feature and machine learning approach for Indonesian opinion mining on YouTube comments | |
CN117009213A (en) | Metamorphic testing method and system for logic reasoning function of intelligent question-answering system | |
Chen et al. | Semantic information extraction for improved word embeddings | |
Wu et al. | ParsingPhrase: Parsing-based automated quality phrase mining | |
Nie et al. | Measuring semantic similarity by contextualword connections in chinese news story segmentation | |
Munir et al. | A comparison of topic modelling approaches for urdu text | |
CN103793491B (en) | Chinese news story segmentation method based on flexible semantic similarity measurement | |
Lai et al. | An unsupervised approach to discover media frames | |
Mathew et al. | Paraphrase identification of Malayalam sentences-an experience | |
Fu et al. | Research on Chinese Text Classification Based on Improved RNN | |
Shams et al. | Intent Detection in Urdu Queries Using Fine-Tuned BERT Models |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20131218 |