CN108363816A

CN108363816A - Open entity relation extraction method based on sentence justice structural model

Info

Publication number: CN108363816A
Application number: CN201810234056.1A
Authority: CN
Inventors: 罗森林; 尹继泽; 潘丽敏; 郭佳; 吴舟婷
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2018-03-21
Filing date: 2018-03-21
Publication date: 2018-08-03

Abstract

The present invention relates to the open entity relation extraction methods based on sentence justice structural model, belong to computer and information science technical field.The present invention extracts the text of microblog data first, carries out subordinate sentence, segments, removes stop words and part-of-speech tagging, recycles dependency analysis tool, obtain interdependent syntax analytic tree；Candidate argument is determined secondly by basic noun recognition rule, and marriage relation word decimation rule and argument decimation rule obtain entity relationship triple, using confidence calculations Rules Filtering entity relationship triple, obtain candidate entity relationship pair；It is then based on CSM calculating sentence similarities and obtains Sim₁, sentence similarity is calculated based on PV and obtains Sim₂, and carry out Similarity-Weighted and merge to obtain sentence similarity, to obtain sentence similarity matrix；Finally by the sentence similarity matrix of generation, according to similarity threshold, similar sentence group is divided, and combines sentence includes in group entity relationship to corresponding confidence level, entity relationship pair in merging group.The present invention tests on NLP＆＆CC microbloggings evaluation and test language material, the results showed that confidence level and divides similar sentence group by computational entity relationship, entity relationship pair in merging group improves accuracy rate and recall rate, achieved the effect that de-redundancy.

Description

Open entity relation extraction method based on sentence justice structural model

Technical field

The present invention relates to the open entity relation extraction methods based on sentence justice structural model, belong to computer and Information Center Learn technical field.

Background technology

Open entity relation extraction technology from mixed and disorderly unordered network data, can extract unlimited classification entity, Entity relationship forms structured information output.The extraction result that mixed and disorderly redundancy property in order to solve microblog data causes is accurate The problem of true rate is low and redundancy, the characteristics of needing redundancy mixed and disorderly for microblogging, study open entity relation extraction technology.Therefore, The present invention improves system in the micro- of mixed and disorderly redundancy by the open entity relation extraction method based on sentence justice structural model is provided The ability of entity relationship is extracted in rich data.

Open entity relation extraction method based on sentence justice structural model needs the basic problem solved to be：From mixed and disorderly nothing In the network data of sequence, entity, the entity relationship of unlimited classification are extracted, forms structured information output.Take a broad view of existing open Formula entity relation extraction system and method are put, are specifically included following several：

1.TextRunner systems and WOE systems

TextRunner is first open information extraction system, is trained by features such as part of speech and base noun phrases Model-naive Bayesian extracts the relationship between entity.Further work shows to be modeled according to text sequence characteristic information Grader can obtain better effect, such as linear conditions random field and Markov Logic Network.WOE systems are by Wiki hundred For section's data as training set, TextRunner systems can effectively be promoted using the dependence in data by being experimentally confirmed Ability.TextRunner and WOE systems belong to first identification name entity, then the method for extracting entity relationship.

The rule-based method of 2.ReVerb and Gamallo et al.

ReVerb determines a relationship phrase centered on verb first, is taken out in conjunction with semantic rules and syntax rule constraint Entity relationship triple is taken, position constraint rule extraction entity relationship triple is then passed through.This method passes through part-of-speech tagging, life Entity relationship pair is extracted in name Entity recognition and the artificial matching that lays down a regulation.Multilingual opening imformation is extracted, Gamallo etc. The entity relationship of English, Portuguese, Galician and Spanish is extracted using rule-based dependency analysis.

3. for the open entity relation extraction system of Chinese

For Chinese open entity relation extraction, mainly there are three systems：ZORE, UnCORE and CORE.ZORE distich Son carries out dependency analysis, obtains interdependent analytic tree, then extracts sentence according to the dependence iteration between entity and relative Entity triple.UnCORE systems by formulating the position restriction rule in sentence between entity between relationship deictic words, Candidate relationship triple is extracted, information gain is then utilized to screen relationship deictic words, bond type sort method obtains each real The relationship deictic words of body relationship type, is filtered candidate triple finally by relative and clause rule.CORE is first Syntactic structure is analyzed using CKIP resolvers, then identifies that the center relationship in sentence indicates by " head-driven " criterion Word finally combines dependence to find central entity word.

In conclusion existing open entity relation extraction method is difficult to handle the mixed and disorderly and redundancy property of microblog data, So the present invention proposes the open entity relation extraction method based on sentence justice structural model.

Invention content

The purpose of the present invention is to alleviate existing method, to extract microblog data entity relationship to accuracy rate low, as a result redundancy Problem proposes the open entity relationship based on sentence justice structural model to improve the comprehensive performance of open entity relation extraction Abstracting method.

The present invention design principle be：The text for extracting microblog data first carries out subordinate sentence to text, segments, goes to deactivate Word and part-of-speech tagging recycle dependency analysis tool, obtain interdependent syntax analytic tree；Followed by basic noun recognition rule Determine candidate's argument, marriage relation word decimation rule and argument decimation rule obtain entity relationship triple, utilize confidence level meter Rules Filtering entity relationship triple is calculated, candidate entity relationship pair is obtained；CSM calculating sentence similarities are then based on to obtain Sim₁, sentence similarity is calculated based on PV and obtains Sim₂, carry out Similarity-Weighted and merge to obtain sentence similarity, and then obtain sentence Sub- similarity matrix；Finally according to sentence similarity matrix and similarity threshold, similar sentence group is divided, in conjunction with sentence packet in group The entity relationship contained is to corresponding confidence level, entity relationship pair in merging group.

The technical scheme is that be achieved by the steps of：

Step 1, microblog data is pre-processed.

Step 1.1, the text of microblog data is extracted.

Step 1.2, subordinate sentence carried out to the text of microblog data, segment, remove stop words and part-of-speech tagging.

Step 1.3, using dependency analysis tool, interdependent syntax analytic tree is obtained.

Step 2, candidate entity relationship pair is extracted.

Step 2.1, entity relationship is extracted in conjunction with base noun phrase rule, relative decimation rule and argument decimation rule Triple.

Step 2.2, by confidence calculations rule, entity relationship triple is screened, generates entity relationship to waiting Selected works.

Step 3, sentence similarity is calculated.

Step 3.1, sentence similarity is calculated based on CSM and obtains Sim₁。

Step 3.2, sentence similarity is calculated based on PV and obtains Sim₂。

Step 3.3, it carries out Similarity-Weighted to merge to obtain sentence similarity, and then obtains sentence similarity matrix.

Step 4, entity relationship is to merging.

Step 4.1, similar sentence group is divided according to sentence similarity matrix and similarity threshold.

Step 4.2, the entity relationship for including in conjunction with sentence in group is to corresponding confidence level, entity relationship in merging group It is right, obtain final result.

Advantageous effect

Compared to existing open entity relation extraction system and method, the present invention, which is effectively relieved, extracts microblog number factually The problem of body relationship pair, as a result accuracy rate is low and redundancy.

Description of the drawings

Fig. 1 is that the present invention is based on the schematic diagrams of the open entity relation extraction method of sentence justice structural model.

Fig. 2 is the schematic diagram of the open entity relation extraction preprocessing process based on sentence justice structural model.

Fig. 3 is interdependent example syntax figure.

Fig. 4 is the schematic diagram of PV-CSM sentence similarity computational methods.

Fig. 5 is the schematic diagram of the sentence similarity computational methods based on CSM models.

Fig. 6 is Paragraph Vector frames.

Fig. 7 is entity relationship to merging schematic diagram.

Specific implementation mode

Objects and advantages in order to better illustrate the present invention do the embodiment of the method for the present invention with reference to example It is further described.

Detailed process is：

Step 1, microblog data is pre-processed.

Step 1.1, the html labels and noise symbol concentrated using canonical filtering microblog data, extract body matter, into Row either traditional and simplified characters convert.

Step 1.2, to textual data carry out subordinate sentence, in conjunction with Harbin Institute of Technology language cloud LTP each sentence is segmented, word Property mark and dependency analysis, and will include less than 4 effective words (including noun, verb, adjective, number, time word Deng) text removal.

Step 1.3, interdependent syntactic analysis discloses it by the dependence between ingredient in linguistic unit in parsing sentence Syntactic structure, the LTP dependency analysis tool analysis sentence " Democratic Party's Monday of budget committee of the White House provided using Harbin Institute of Technology Dependence in publication report " between ingredient is shown in Fig. 3.Interdependent syntax mark relationship and meaning are shown in Table 1.

1. interdependent syntax of table marks relation table

Step 2, candidate entity relationship pair is extracted.

Step 2.1, base noun phrase is obtained according to part-of-speech tagging result and noun phrase decimation rule first；Then will There are VOB (dynamic guest's relationship) or the verb in FOB (preposition object) dependence path to be considered as candidate relationship word in sentence；Then will Ingredient in base noun phrase and candidate relationship word there are the argument as the verb of SBV (subject-predicate relationship), VOB, FOB, Obtain the entity relationship pair of " SBV- relatives-VOB " and " SBV-FOB- relatives " two kinds of dependence paths.

Sentence with Negative Structure needs specially treated, for example, " Some University Students do not participate in party ", according to above-mentioned Entity relationship obtains " e1 to decimation rule：Some University Students, e2：Party, r：Participate in " entity relationship pair, as a result incorrect, institute To need consideration negative word, correct result that should be：“e1：Some University Students, e2：Party, r：It does not participate in ".

Negative word is identified by establishing a negative word set, for the negative word identified, is added into and is deposited therewith In the relative of dependence path (ADV).Negative word includes：It is non-, do not have, nothing, or not prevent, do not have, being difficult to, forbidding, is difficult With, forget, ignore, abandon, prevent, refuse, do not have almost, almost, unclear.

Step 2.2, the confidence level of computational entity relationship pair.When confidence level be more than threshold value when, corresponding entity relationship at For candidate entity relationship pair；Conversely, entity relationship is to being rejected.

Selected feature and the weight after 200 mark language material training are shown in Table 2：

2. feature of table and respective weights

{ x in table₁…x₁₀Value meet situation duration described in feature be 1, otherwise value be 0.Distance in table and length Degree all refers to the number of word, x₁₁The computational methods of corresponding Dis weights are as shown in Equation 1.

Wherein e₁、e₂It is two arguments of entity relationship centering respectively, r is the relative of entity relationship centering, dis (e₁, e₂) indicate distance of two arguments in sentence, i.e., the number of word between the two, dis (e₁, r) and statement argument e₁With relative r Distance in sentence, dis (r, e₂) indicate relative r and argument e₂Distance in sentence.Binding characteristic weight and It is as shown in Equation 2 to the computational methods of confidence level Confidence that sigmoid functions obtain entity relationship.

Wherein x is the parameter in table, and value is 0 or 1, and w is its corresponding weighted value.

Step 3, sentence similarity is calculated.

Step 3.1, the sentence similarity computational methods principle based on CSM models is shown in Fig. 5.For the semantic feature of short text Sparse Problems, this method excavates potential thematic knowledge on the basis of sentence justice structural analysis, using LDA topic models, to single The semantic feature of sentence is expanded, and is then carried out vectorial expression to sentence, is finally calculated sentence similarity.

Sentence justice is mainly divided into topic and states topic by CSM, and topic refers to main description object in sentence, and stating topic is then Description to topic object.The different semantic roles undertaken in sentence justice according to word in sentence divide fundamental mesh and general lattice, For example, " Xiao Ming has broken the window in classroom." " Xiao Ming " in one undertake the implementer of action, belong to " applying in fundamental mesh Thing lattice ", and " classroom " undertakes the restriction effect to " window ", belongs to " range lattice " in general lattice.Based on sentence justice structural analysis Word in sentence is divided into four classes by the method for as a result dividing word by its semantic role, including elementary item under topic, is stated under topic General term and the lower general term of topic is stated under elementary item, topic.In conjunction with LDA analyses as a result, the semantic feature to sentence expands.

The input of LDA analysis modules is the set of four class words in text set, and output is this four classes word under multiple themes The distribution of language.By in the distributed intelligence deposit knowledge base of word, sentence semantics feature is expanded in subsequent module knowledge based library.LDA Topic model assumes to include multiple themes in text, and each theme corresponds to multiple words in text and obeys multinomial point in word set Cloth, the word for belonging to a theme have potential semantic dependency.Therefore by in the maximum theme of the sentence degree of correlation Top n word extends in sentence, can not introduce excessive noise while expanding semantic feature.

Expanding the semantic feature of sentence can be divided into based on topic ingredient and based on the expansion for stating topic ingredient.

Semantic feature based on topic ingredient expands process：First, all elementary items under topic are calculated in theme P_i Under the sum of probability value；Then be all general terms in theme P_iUnder the weighting of the sum of probability value, the results added with back, It is specific as shown in Equation 3,

Wherein T_miIt is m-th of elementary item under sentence topic in theme P_iUnder probability value, G_niIt is n-th under sentence topic General term is in theme P_iUnder probability value；The highest theme of select probability value (i.e. with the maximum theme of the sentence degree of correlation) and corresponding Top n word extend in short text；Finally, the sentence vector based on topic is built based on VSM, the weights of sentence original word are Corresponding TF*IDF values, the weights for expanding word are 1.

The rest may be inferred expands process based on the semantic feature for stating topic ingredient, obtains based on the sentence vector for stating topic.

For the similarity calculation between sentence, it is utilized respectively between the two feature vectors calculating sentence after semantic feature expands Cosine similarity, be then weighted addition, obtain the sentence similarity value Sim based on CSM-LDA₁, circular As shown in Equation 4.

Wherein, S_AAnd S_BIndicate arbitrary two sentences,WithIt indicates to obtain after sentence justice structural analysis respectively Sentence topic vector,WithThen indicate that two stating for sentence inscribe vector, topic and the weighting coefficient ω for stating topic are usually set It is set to 0.5.

Step 3.2, Paragraph Vector (PV) are a kind of unsupervised distributed vectorial representation methods, can handle and appoint Meaning length, the typically other text data of sentence level and paragraph level, to obtain the excellent vector table for sentence and paragraph Show.Similar Word2vec includes CBOW models and Skip-gram models, and PV includes two kinds of models of PV-DM and PV-DBOW.PV moulds Type has newly added Paragraphid marks to each sentence or paragraph.

PV-DM models are made of input layer, projection layer and output layer three-layer neural network.When PV-DM trains sentence vector, Paragraph id are considered as common word, are that its one vector of random generation is added in matrix D.It is random for the word in sentence It generates term vector to be added in matrix W, the vector of sentence is as the dimension of term vector, but the two is not belonging to the same space.PV-DM Term vector in model distich subvector and sentence carries out cumulative mean or head and the tail are connected to obtain input vector, then maximum The probability of occurrence for changing target word carrys out training pattern.Sentence vector training comparison term vector training the difference is that：PV-DM Hiding input codetermined by matrix W and D, and consider the semantic information of entire sentence in the training process.

Specific algorithm includes training and infers two stages, sees Fig. 6.

(1) training stage：Term vector matrix W is obtained by training, softmax weights U, b and the sentence that had occurred Subvector D.The Paragraph id initialized in training process are unique and do not share, and term vector is total to by entire training corpus It enjoys.Concentrate all words with the sliding window ergodic data of regular length, when window sliding update term vector matrix W and The vector matrix D of Paragraph id is until training terminates.

(2) deduction phase：The new Paragraph id of target sentences one are first allocated to, the combined training stage obtains PV model parameter W, U, b, optimize the vector of target sentences using gradient descent algorithm and BP algorithm, target sentences are made to exist The maximum probability occurred under conditions present indicates after restraining to get the vector to sentence to be predicted.

It is indicated according to obtained sentence vector, calculates the cosine similarity Sim between sentence₂。

Step 3.3, by Sim₁And Sim₂It is weighted summation according to formula 5, exports the similarity Sim (S between sentence₁, S₂).It repeats above-mentioned sentence similarity computational methods and obtains the similarity that sentence concentrates all sentences mutual, generate sentence phase Like degree matrix.

Sim(S₁,S₂)=α * Sim₁+β*Sim₂ (5)

Step 4, entity relationship is to merging.

Step 4.1, entity relationship is shown in Fig. 7 to merging schematic diagram.Sentence similarity square is obtained by sentence similarity module Battle array, the sentence for similarity being more than threshold value are divided into one group.Sentence similarity matrix is divided into the specific steps of similar sentence group It is as follows：

(1) one sentence S of selection is concentrated in sentence, which is added in similarity sentence group 1, deleted in sentence concentration Except sentence S；

(2) line number is of the positioning S in sentence similarity matrix, is more than similarity on the i-th row of matrix 0.75 all sentences Son is added in sentence group 1, and is concentrated in sentence and delete them；

(3) a sentence S2 is selected in remaining sentence at random, if sentence S2 and any sentence similarity in sentence group 1 More than 0.75, then S2 is added in sentence group 1, otherwise creates a similar sentence group, S2 is added, repeated (2)；

(4) constantly iteration (3) obtains n similar sentence group until sentence collection is sky.

Step 4.2, by comparing the confidence level of the candidate entity relationship pair of each sentence in same group, to all times in organizing Entity relationship is selected to being ranked up, takes the highest entity relationship that sorts to the candidate entity relationships pair of all sentences in replacement group, Optimal entity relationship pair as the sentence group.

Test result：Open entity relation extraction method based on sentence justice structural model, in social text (2013 NLP＆＆CC meetings publication towards Chinese microblogging viewpoint element extract evaluation and test task language material is disclosed) on carry out open entity pass It is the contrast experiment of abstracting method, control methods includes ZORE (2014) and CORE (2014).The present invention better than ZORE and CORE realizes the effect for improving accuracy rate and de-redundancy, and the results are shown in Table 3, effectively realizes open entity relationship It extracts.

3. comparative test result of table

Above-described specific descriptions have carried out further specifically the purpose, technical solution and advantageous effect of invention It is bright, it should be understood that the above is only a specific embodiment of the present invention, the protection model being not intended to limit the present invention It encloses, all within the spirits and principles of the present invention, any modification, equivalent substitution, improvement and etc. done should be included in the present invention Protection domain within.

Claims

1. the open entity relation extraction method based on sentence justice structural model, it is characterised in that the method includes walking as follows Suddenly：

Step 1, microblog data is pre-processed, including：The text for extracting microblog data, divides the text of microblog data Sentence segments, removes stop words and part-of-speech tagging, then utilizes dependency analysis tool, obtains interdependent syntax analytic tree；

Step 2, entity relationship ternary is extracted in conjunction with base noun phrase rule, relative decimation rule and argument decimation rule Group screens entity relationship triple then by confidence calculations rule, generates entity relationship to Candidate Set；

Step 3, sentence similarity is calculated based on CSM and obtains Sim₁, sentence similarity is calculated based on PV and obtains Sim₂, then carry out Similarity-Weighted merges to obtain sentence similarity, and then obtains sentence similarity matrix；

Step 4, similar sentence group is divided according to sentence similarity matrix and similarity threshold, includes then in conjunction with sentence in group For entity relationship to corresponding confidence level, entity relationship pair in merging group obtains final result.

2. the open entity relation extraction method according to claim 1 based on sentence justice structural model, it is characterised in that： When calculating the confidence level of entity relationship pair in step 2, selected feature includes：Relative among two arguments, two arguments are in relationship Word side, ER are to there are the paths VOB, ER to there are the paths FOB, the distance between arguments and relative.

3. the open entity relation extraction method according to claim 1 based on sentence justice structural model, it is characterised in that： When calculating the confidence level of entity relationship pair in step 2, the corresponding weight Dis calculating of feature the distance between " argument with relative " Method is as shown in Equation 1：

Wherein e₁、e₂It is two arguments of entity relationship centering respectively, r is the relative of entity relationship centering, dis (e₁,e₂) table Show distance of two arguments in sentence, i.e., the number of word between the two, dis (e₁, r) and statement argument e₁With relative r in sentence In distance, dis (r, e₂) indicate relative r and argument e₂Distance in sentence.

4. the open entity relation extraction method according to claim 1 based on sentence justice structural model, it is characterised in that： Sentence similarity is calculated based on CSM in step 3 and step 4 and obtains Sim₁, sentence similarity is calculated based on PV and obtains Sim₂, then Carry out Similarity-Weighted to merge to obtain sentence similarity, and then obtain sentence similarity matrix, according to sentence similarity matrix and Similarity threshold divides similar sentence group, and the entity relationship for including then in conjunction with sentence in group closes corresponding confidence level And entity relationship pair in organizing, realize that redundancy drops in entity relationship result.