CN108959418A

CN108959418A - Character relation extraction method and device, computer device and computer readable storage medium

Info

Publication number: CN108959418A
Application number: CN201810587061.0A
Authority: CN
Inventors: 黄杨琛; 黄九鸣; 贾焰; 韩伟红; 周斌; 徐菁; 张圣栋; 李爱平; 杨朝辉; 赫中翮; 王志超; 周忠诚; 曾琰; 黄谦; 李靖; 李丹
Original assignee: Hunan Xinghan Shuzhi Technology Co ltd; National University of Defense Technology
Current assignee: Hunan Xinghan Shuzhi Technology Co ltd; National University of Defense Technology
Priority date: 2018-06-06
Filing date: 2018-06-06
Publication date: 2018-12-07

Abstract

The invention relates to the technical field of natural language processing, and provides a character relation extraction method, which comprises the following steps: generating a weak label data set containing character pairs by aligning the natural language text data in the knowledge base and the corpus; marking a first sentence belonging to the same person pair in the weak label data set as a regular example packet of the same person pair relationship; filtering the first sentence in the regular example packet according to a filtering algorithm of a preset relation indicator to obtain training regular example data; and performing feature extraction on the training positive case data and the second sentence in the negative case packet to obtain a multi-factor feature vector of the second sentence, inputting the multi-factor feature vector into a relation classifier, and acquiring a relation classification result of the character pair by using a supervised method. The embodiment of the invention also provides a character relation extraction device, a computer device and a computer readable storage medium. The character relation extraction method provided by the embodiment of the invention improves the accuracy of character relation extraction, does not need to manually design a complex template, and is wider in application.

Description

A kind of character relation abstracting method, device, computer installation and computer-readable storage Medium

Technical field

The present invention relates to natural language processing technique field, in particular to a kind of character relation abstracting method, device, calculating Machine device and computer readable storage medium.

Background technique

In the electronic text information for the explosive growth that internet generates, a large amount of people entities and the pass between them It is that information covers wherein.In face of the data of such multi-element heterogeneous, it is necessary to it is therefrom fast to be just able to satisfy people using information extraction technique Speed obtains the demand of effective information.A vital task of the Relation extraction as information extraction, it is formal for the first time propose be 7th message in 1998 understands on conference (Message Understanding Conference, MUC) that it refers to from certainly The process of the semantic relation between two entities of identification is found in right language text.

The traditional mode by manual read, understanding of entity relation extraction technological break-through obtains the limit of semantic relation The automatic lookup and extraction of system instead semantic relation.As the popular research field in natural language processing, entity Relation extraction is always the important directions of information extraction research field.The early stage research of Relation extraction is mainly by manually establishing Syntax and semantics rule, then identifies the relationship of entity by the method for pattern match.Since these methods need largely The early-stage preparations of artificial treatment and professional knowledge, researcher begin trying machine learning method.

According to the degree of dependence to labeled data, the Relation extraction method based on machine learning can be divided into supervised learning, The mode of semi-supervised learning, remote supervisory study and unsupervised learning.Supervised learning method is using Relation extraction as one point Class problem designs effective feature according to training data, then constructs various disaggregated models, finally uses trained classifier Carry out projected relationship.In feature selecting, relationship classifier can be trained in conjunction with features such as vocabulary, syntax, semantemes, can also be added Enter syntactic analysis tree and dependency tree to form feature vector, in addition there are the location informations that research joined relationship characteristic word Feature carries out relationship classification.In addition, in order to avoid artificial design features engineering, scholars start with neural network structure Then automatic Learning from Nature language text feature carries out entity relation extraction, this kind of deep learning method has also belonged to supervision side Method.There are Relation extraction rate of accurateness and the recall rate of supervision all very high, but depends critically upon the relation object made in advance Type system and labeled data collection.The especially method of deep learning the characteristics of due to neural network itself, needs largely to train number According to can just obtain preferable sorter network model.Semi-supervised learning method mainly uses Bootstrapping, label propagation etc. Mode carries out Relation extraction.For the relationship to be extracted, this method sets several sub-instance by hand first, then iteratively The corresponding relationship templates of relationship and more examples are extracted from data.

Compared with the method for having supervision, semi-supervised method can greatly reduce the tagged corpus needed in learning process Scale, but the processing such as interference problem of noise bad will affect in the On The Choice of initial seed collection and iterative process The actual performance of this method.And unsupervised open Relation extraction method assumes that and possesses the entity of identical semantic relation to gathering around There is similar contextual information, so that the semantic relation of the entity pair is represented using the corresponding contextual information of each entity, And the semantic relation of all entities pair is clustered.Unsupervised entity relation extraction is not necessarily to pre-defined entity relationship type body System has field independence, this is advantageous when handling magnanimity Opening field data, but it clusters threshold value and is difficult in advance really Fixed, the accuracy rate for extracting result is lower, and still lacks more objective evaluation criterion at present.

In recent years, various large scale knowledge bases (Knowledge Base, KB), such as Freebase, DBpedia, YAGO and Online encyclopaedic knowledge library is completed, this has great value for being configured with the training data of supervision machine learning method.Mintz Et al. in 2009 for the first time Relation extraction field propose remote supervisory (Distant Supervision, DS) thought.Remotely Measure of supervision it is assumed that if two entities be in knowledge base it is related, all sentences comprising the two entities are all This relationship will be expressed.Relation extraction based on remote supervisory is spontaneously aligned natural language text and given knowledge base, so Learn relationship extraction using the resulting weak label training data of alignment afterwards.

As shown in Figure 1, being exactly the exemplary system for carrying out Relation extraction using remote supervisory technology.In systems, first Remote supervisory technology is first passed through when being aligned natural language text and knowledge base, the sentence containing certain people entities pair that will identify that Son label be in the entity to the weak label data of relationship.Then, for the relational query of related person pair, system is logical It crosses to be input in classifier from the correlated characteristic extracted in sentence and carries out relationship judgement, finally by the pass in classification results It is that correct relationship fact result is put into relational knowledge base by the size of probability.This had both solved measure of supervision excessively The problem of relying on handmarking's data, and problem that a certain extent can be lower to avoid unsupervised approaches accuracy rate.

But the basic assumption of remote supervisory is not rigorous, the entity in corpus is to might not be all in co-occurrence sentence Entity can be expressed to the relationship in knowledge base.For example, " Yao Ming leads everybody to come news briefing scene, and leaf jasmine is subsequent Also appear in scene." this co-occurrence sentence is can not semantically to express " man and wife " relationship between them true.It is this to contain Entity pair but the sentence that cannot extract relationship characteristic belong to the noise data of remote supervisory method generation, should be by its mistake Filter.Also, the research of current Relation extraction is concentrated mainly in the processing of English resource, this is primarily due to Chinese corpus needs Participle, and there is complicated sentence structure and implicit semantic, therefore Chinese character relation extraction is more difficult.In addition, Chinese For Knowledge Database than later, research of the remote supervisory in the Relation extraction of Chinese corpus is also fewer.

Pan Yun et al. attempts the character relation extraction system that Chinese is constructed using Chinese interaction encyclopaedia online resource for the first time, adopts It is label propagation algorithm training pattern, obtains 68% or so accuracy rate, there is no carry out remote supervisory data for the method Denoising, time consumption for training is too long and accuracy rate is not high.Huang Bei wait quietly people extracted using term vector and sentence pattern, cluster and The method of scoring, the noise sentence in original training set obtained to remote supervisory character relation extraction process are filtered, reach To the purpose of the training set denoising generated to remote supervisory, but this method requires manual intervention, and pattern extraction method used Transportable property is bad, has very strong domain feature.

Summary of the invention

The present invention provides a kind of character relation abstracting method, it is intended to solve Chinese character relation extraction side in the prior art Method causes training pattern accuracy rate not high because not carrying out denoising to remote supervisory data；The pattern extraction method of use can Migration is bad, the single problem of application scenarios.

The invention is realized in this way a kind of character relation abstracting method, the character relation abstracting method include:

By the natural language text data in alignment knowledge base and corpus, the weak label data comprising personage couple is generated Collection；

The weak label data is concentrated belong to the first sentence of same personage couple labeled as same personage to relationship just Example packet；

According to the filter algorithm of preset relationship deictic words, first sentence in the positive example packet is filtered, is instructed Practice positive example data；

The second sentence in the trained positive example data and negative example packet is subjected to feature extraction, obtains second sentence Multiple-factor feature vector；

The multiple-factor feature vector is input in relationship classifier, the relationship classification results of the personage couple are obtained.

The present invention also provides a kind of character relation draw-out device, the character relation draw-out device includes:

Data generating unit, for the natural language text data by being aligned in knowledge base and corpus, generation includes The weak label data collection of personage couple；

Marking unit, for concentrating the weak label data the first sentence for belonging to same personage couple labeled as described same Positive example packet of one personage to relationship；

Filter element filters described in the positive example packet for the filter algorithm according to preset relationship deictic words One sentence obtains training positive example data；

Extraction unit is obtained for the second sentence in the trained positive example data and negative example packet to be carried out feature extraction Obtain the multiple-factor feature vector of second sentence；

As a result acquiring unit obtains the personage for the multiple-factor feature vector to be input in relationship classifier Pair relationship classification results.

The embodiment of the present invention also provides a kind of computer installation, and the computer installation includes processor, the processor It realizes when for executing the computer program stored in memory such as the step of above-mentioned character relation abstracting method.

The embodiment of the present invention also provides a kind of computer readable storage medium, is stored thereon with computer program, the meter The step of calculation machine program realizes character relation abstracting method as described above when being executed by processor.

Character relation abstracting method provided by the invention passes through the natural language text number in alignment knowledge base and corpus According to generate include personage couple weak label data collection, and further mark what weak label data concentrated to belong to same personage couple The first sentence be a positive example packet, then the filter algorithm based on preset relationship deictic words says that the positive example packet of label carried out Filter obtains more accurately training positive example data, and in the case where not needing manually to participate in, can obtain a large amount of high quality has prison Supervise and instruct experienced data set；The lexical characteristics and syntax considered in natural language text are combined in the feature selecting of training process The syntactic feature that dependency analysis generates, then character relation classifier is trained to carry out people by combined multiple-factor feature vector The classification of object relationship；The accuracy rate of character relation extraction is effectively provided, engineer's template complex is not necessarily to, is suitable for new relation The extraction task of type, application range are more extensive.

Detailed description of the invention

Fig. 1 is the exemplary system that existing remote supervisory technology carries out Relation extraction；

Fig. 2 is a kind of implementation flow chart of character relation abstracting method provided in an embodiment of the present invention；

Fig. 3 be it is provided in an embodiment of the present invention to different relationship personages in the extraction result for thering is noise free data to obtain The comparison diagram of F1 value；

Fig. 4 is syntax dependency parsing exemplary diagram provided in an embodiment of the present invention；

Fig. 5 is a kind of positive example according to the filter algorithm filters of preset relationship deictic words provided in an embodiment of the present invention First sentence in packet obtains the implementation flow chart of training positive example data；

Fig. 6 is a kind of distance of entity pair provided in an embodiment of the present invention and the Figure of the quantitative relationship of relationship triple；

Fig. 7 is that a kind of the second sentence by training positive example data and negative example packet provided in an embodiment of the present invention carries out spy Sign is extracted, and the implementation flow chart of the multiple-factor feature vector of the second sentence is obtained；

Fig. 8 is a kind of structural schematic diagram of character relation draw-out device provided in an embodiment of the present invention；

Fig. 9 is the structural schematic diagram of filter element provided in an embodiment of the present invention；

Figure 10 is the structural schematic diagram of extraction unit provided in an embodiment of the present invention.

Specific embodiment

To keep the technical problems solved, the adopted technical scheme and the technical effect achieved by the invention clearer, below The present invention is described in further detail in conjunction with the accompanying drawings and embodiments.It is understood that specific implementation described herein Example is used only for explaining the present invention rather than limiting the invention.

Before introducing specific embodiment, main method of the present invention is illustrated with theory first:

Being put forward for the first time for remote supervisory thought is for solving the problems, such as biological information field, by development, Mintz et al. Remote supervisory method is applied in the task of Relation extraction for the first time.The thought of remote supervisory is mainly to utilize given knowledge base D and corpus C wherein includes a large amount of relationship triple (e in knowledge base D₁, r, e₂), wherein (r is relationship type to r ∈ R, and R is Relationship type set), (e₁, e₂) it is the entity pair with relationship r.D is aligned by the method for remote supervisory with C, any packet in C Sentence containing a pair of of entity in the relationship triple in knowledge base is all regarded to deposit between entity pair in triple in expression D Relationship it is true.

For example, Mintz et al. is using Freebase as the knowledge base of structuring, in the relationship example in Freebase Each entity pair, they find out the sentences comprising these entities pair all in wikipedia, and therefrom extract phase The text feature answered trains relationship classifier.But the hypothesis of this remote supervisory is excessively strong, can draw in training data Enter a large amount of noise data.Hereafter, some researchs are come by multi-instance learning (MultipleInstanceLearning, MIL) Loosen the hypothesis of remote supervisory.

Compared to a series of examples individually marked inputted in supervised learning, in MIL, input is a series of " packet " being marked, each " packet " include many examples.When all examples in packet are all negative examples, this packet can be marked The example that is negative packet.And when at least containing a positive example in packet, this packet can be noted as positive example packet.When receiving a series of be marked Packet when, classifier can learn: (1) summarizing a class concepts correctly to mark individual examples；(2) except conclusion How study removes one packet of mark.In the remote supervisory Relation extraction model under multi-instance learning guidance, it is assumed that in all realities Body is to (e₁, e₂) co-occurrence sentence at least one co-occurrence sentence can indicate this relationship.Then, by all same entities Pair co-occurrence sentence be put into the same packet, if as soon as packet at least contain a positive example, labeled as the relationship positive example.And one It is then the sentence that cannot express the relationship entirely in the packet of example that label, which is negative, and the sentence (example) in negative example packet is primarily used to instructing Relational model is allowed to distinguish positive example and negative example in white silk.The method of multi-instance learning by the training process study to it is same just Complementary information in example packet between sentence, can alleviate the noise data bring error flag of remote supervisory to a certain extent The problem of.

Fig. 2 shows a kind of implementation flow charts of character relation abstracting method provided in an embodiment of the present invention, comprising:

In step s101, by the natural language text data in alignment knowledge base and corpus, generating includes personage Pair weak label data collection.

In embodiments of the present invention, knowledge base includes: Freebase, DBpedia, YAGO and online encyclopaedic knowledge library Deng prestoring personage in knowledge base to, people entities noun, personage to relationship etc..

As an embodiment of the present invention, corpus includes text data, and text data includes text, symbol, picture Deng, such as news report, article, paper belong to text data.

Natural language text data include English text data, Chinese language text data, French text data, Russian textual data According to any text data or combinations thereof in waiting, specifically without limitation.

In embodiments of the present invention, weak label data collection includes positive example data, negative number of cases evidence.

In one embodiment of the invention, two people entities are a personage couple, if Yao Ming and leaf jasmine are a people Object pair；Lin Zhiying and kimi is personage's equity.

In knowledge base, for personage to generally being indicated using triple, triple includes title of the entity to two entities, personage To relationship.As (Yao Ming, wife, Ye Li), (Yao Ming, husband, Ye Li), (Yao Ming, man and wife, Ye Li), (Lin Zhiying, father and son, Kimi), (Lin Zhiying, son, kimi), (Lin Zhiying, father, kimi) etc..

In the embodiment of the present invention, the weak label data collection comprising 104,593 sentences is constructed, wherein 80% weak mark It signs data (83675 sentences) and is used as training data, remaining 20% (20919 sentences) is used as test data.This experiment choosing It selects five kinds of common character relations to be tested, is respectively as follows: father and son, mothers and sons, brother, sister, man and wife.It is illustrated in table 1 weak The data distribution of label data collection is as follows:

The distribution of the weak label data collection of table 1

S101 through the above steps, by corpus natural language text data and knowledge base in the data prestored into Row matching, generates the weak label data collection including personage couple.Example 1: a word in news report is that " Yao Ming leads everybody to come News briefing scene, Ye Li and its daughter then also appear in scene ", then by this word and the progress of Freebase database Matching, obtains three groups of personages couple and corresponding three groups of character relations；Three groups of personages are to being respectively as follows: A personage to for Yao Ming and leaf Jasmine, B personage are to being Ye Li and daughter, C personage to for Yao Ming and daughter；Corresponding character relation is respectively man and wife, mother and daughter, father Female.

In step s 102, concentrate the first sentence for belonging to same personage couple labeled as same people the weak label data Positive example packet of the object to relationship.

In one embodiment of the invention, positive example packet refers to the data of all sentences including being described same person's object pair Packet.

For example, in order to extract conjugal relation, then A personage closes Yao Ming with leaf jasmine and its corresponding personage in above-mentioned example 1 It is that man and wife then belongs to positive example packet；And B personage is then negative example packet to, C personage couple and its corresponding character relation.

In embodiments of the present invention, the first sentence includes at least one sentence, generally multiple sentences, particular number according to It is practical to determine.

In step s 103, according to the filter algorithm of preset relationship deictic words, described in the positive example packet is filtered One sentence obtains training positive example data.

In embodiments of the present invention, preset relationship deictic words includes man and wife, husband, wife, father and son, father, son, mother Son, mother, father and daughter, daughter, brother, elder brother, younger brother, sister, elder sister, younger sister, brother and sister, elder sister and younger brother, friend, colleague etc., specifically not It limits.All these preset relationship deictic words constitute preset relative dictionary.

Training positive example data indicate the data in positive example packet, refer generally to sentence or word.

For example, " Yao Ming leads everybody to come news briefing scene, and leaf jasmine then also appears in scene in example 2." this A sentence can not semantically express " man and wife " relationship fact between them.It is this to contain entity pair but extract Sentence to relationship characteristic then belongs to noise data, should be filtered.And " Yao Ming and wife Ye Li are attended for example, example 3 There are preset relative dictionary sheets to show people entities Yao Ming, leaf jasmine conjugal relation in the ribbon-cutting ceremony of Hope Primary School " sentence Word " wife " is then left trained positive example data.

It is appreciated that in embodiments of the present invention, being not present in preset relation word dictionary by what filter algorithm filters fell In data, i.e. noise data can effectively improve the accuracy rate for making personage's Relation extraction.Referring to Fig. 3 to different relationship personages To the F1 value comparison diagram in the extraction result for having noise free data to obtain, (wherein, abscissa means that five kinds in above-mentioned table 1 Relationship type, ordinate are F1 values, and AVG is the mean value of five kinds of relational results.)

According to Fig. 3 it was determined that the F1 value for the extraction result that the first data of removal noise data obtain significantly improves, take out The accuracy rate of character relation is taken to effectively improve.

In step S104, the second sentence in the trained positive example data and negative example packet is subjected to feature extraction, is obtained Obtain the multiple-factor feature vector of second sentence.

In embodiments of the present invention, the multiple-factor feature vector of second sentence include morphology factor vector sum syntax because Subvector.It is to convert the second sentence to the process of multiple-factor feature vector that second sentence, which is carried out feature extraction,.

The morphology further comprises because of subvector: distance feature, relative seat feature and part of speech feature；Wherein, distance Feature refers to word of two people entities in sentence away from being generally abbreviated as dis；Relative seat feature refers to people entities in sentence Tandem in son, is generally indicated with order；Part of speech feature refers to the quantity of verb and noun after segmenting in sentence, Generally indicated with vn.

The syntax further comprises because of subvector: the distance between syntax dependence feature, entity and core predicate Feature and entity context feature.Wherein, syntax dependence feature refers to everyone object entity sentence affiliated in sentence Method relationship interdependent value reflects the relationship between people entities, is generally indicated with parsing-r；Entity refers to the name of the personage in sentence Claim；Core predicate refers to the predicate verb that sentence core is embodied in sentence, the distance between body and core predicate feature, generally with p- dis；Entity context feature refers to the first two words of people entities and the weight of latter two word, the then zero padding of the situation less than two Processing, is generally indicated with context.

15 kinds of dependences that Harbin Institute of Technology is defined using language cloud are according to Key Relationships, punctuation mark, independence Structure, right additional relationships, left additional relationships, guest's Jie relationship, coordination, dynamic benefit relationship, verbal endocentric phrase, fixed middle relationship and language, Preposition object, guest's relationship, dynamic guest's relationship, the sequence of subject-predicate relationship, are corresponding in turn to assignment to 14 from 0.

As an embodiment of the present invention, the syntactic structure of sentence describes phrase structure, dependency structure in sentence And phrase structure and dependency structure function.

Referring to fig. 4, for example, " this be the yellow of heap of stone eldest daughter with Sun Li 11 years old the more, be named as Huang Yici greatly.", people entities " Huang of heap of stone " and relative " daughter " there is relationship in surely, relative " daughter " and core predicate " crying " there is subject-predicate relationship, And there is dynamic guest's relationships between core predicate " crying " and people entities " Huang Yici ", syntax dependency parsing in this way, It can be found that people entities " Huang is of heap of stone " all depend on relative " daughter " with " Huang Yici ".Further, by " Huang of heap of stone " with Coordination between " Sun Li ", but between available people entities " Sun Li " and " Huang Yici " with relative " daughter " according to Deposit relationship.As it can be seen that core predicate plays key effect to acquisition entity boundary, undertaking entity relationship, so, in natural language text In this sentence, entity is also a kind of implication relation feature between entity at a distance from core predicate.

In embodiments of the present invention, the second sentence includes at least one sentence, generally multiple sentences, particular number according to It is practical to determine.It is appreciated that the data that the second sentence includes under normal circumstances are greater than the data that the first sentence includes.

In step s105, the multiple-factor feature vector is input in relationship classifier, obtains the personage's couple Relationship classification results.

In embodiments of the present invention, relationship classifier includes man and wife's classifier, father and son's classifier, mother and daughter's classifier, brother Classifier, sister's classifier etc..

In embodiments of the present invention, the relationship classification results of personage couple generally indicate triple with personage, such as (Yao Ming, Man and wife, Ye Li), (Lin Zhiying, father and son, kimi), (Jia Jingwen, mother and daughter, Bu Bu), (Bu Bu, sister, wave girl), (Jiang Wen, brother, Jiang Wu), (Xiao Ming, friend, small red), (Lan Lan, colleague, little Huang) etc..

Character relation abstracting method provided in an embodiment of the present invention passes through the natural language in alignment knowledge base and corpus Text data generates the weak label data collection comprising personage couple, and further marks belonging to for weak label data concentration same The first sentence of personage couple is a positive example packet, then the filter algorithm based on preset relationship deictic words say the positive example packet of label into Row filtering, a large amount of high quality can be obtained in the case where not needing manually to participate in by obtaining more accurately training positive example data The data set of Training；Combined in the feature selecting of training process consider natural language text in lexical characteristics and Syntax dependency parsing generate syntactic feature, then by combined multiple-factor feature vector train character relation classifier come into The classification of row character relation；The accuracy rate of character relation extraction is effectively provided, engineer's template complex is not necessarily to, is suitable for new The extraction task of relationship type, application range are more extensive.

In embodiments of the present invention, the personage is to including multipair, above-mentioned steps S102, comprising:

The first sentence for belonging to same personage couple is concentrated to be referred in a positive example packet the weak label data；

Multiple positive example packets are marked respectively.

In embodiments of the present invention, natural language text data obtained in corpus may include multiple personages couple, this When, then respectively by multiple personages to classification, and the positive example packet of each personage couple is marked respectively.

Character relation abstracting method provided in an embodiment of the present invention, when personage is to including multipair, by first respectively to category It is referred in a positive example packet in the first sentence of same personage couple, then marks the method for multiple positive example packets respectively, convenient for working as corpus There are multiple personage's clock synchronizations in library, can carry out extraction of multiple personages to relationship simultaneously, make personage to abstracting method more intelligence Energyization.

Referring to Fig. 5, as an embodiment of the present invention, above-mentioned steps S103 is specifically included:

In step S1031, word after the first sentence described in the positive example packet segments is calculated by preset formula Weight, formula are as follows:

Wherein, TI (w) indicates the weight of any word w after the first sentence participle of some in positive example packet,

Tf (w, s) indicates the normalization word frequency of the word w in sentence s,

Idf (w, S) indicates reverse document-frequency of the word w in corpus S,

n_wjIt is the number that the word occurs in sentence s,

∑_kn_wjIt is the sum of the frequency of occurrence of all words occurred in sentence s,

| S | it is first sentence sum in the weak label positive example data in a relationship example packet,

| { j:w ∈ s, s ∈ S } | it is all sentence quantity comprising word w in corpus S.

In embodiments of the present invention, participle refers to the structure that the first sentence is split into word, after the sentence participle in example 3 Can split into: Yao Ming, with, wife, Ye Li,, attend, Hope Primary School, ribbon-cutting ceremony,.It should be noted that sentence Punctuation mark in son is also required to split into word structure.

In step S1032, first three word of weight ranking in first sentence is filtered out, and judge whether at least First three word of one weight ranking is present in preset relative dictionary；When sentencing interpretation result is to be, step is executed Rapid S1033；When the judgment result is no, step S1034 is executed, first sentence is deleted.

In step S1033, retains first sentence and be positive example sentence.

In practical application, for example, after the sentence of example 3 is segmented, and the weight of each word is calculated such as by above-mentioned formula Under:

It is followed successively by (comprising punctuation mark):

Weight (Yao Ming)=0.027,

Weight (with)=0.002,

Weight (wife)=0.408,

Weight (Ye Li)=0.018,

Weight ()=0,

Weight (attending)=0.029,

Weight ()=0,

Weight (Hope Primary School)=0.031,

Weight ()=0,

Weight (ribbon-cutting ceremony)=0.012,

Weight (.)=0, and then obtain the weight highest of Yao Ming, three Ye Li, wife words, and be all present in preset In relative dictionary, therefore the sentence " ribbon-cutting ceremony that Yao Ming has attended Hope Primary School with wife Ye Li " for retaining example 3 is positive example Sentence.

Character relation abstracting method provided in an embodiment of the present invention further carries out the first sentence by preset formula Filtering improves the accuracy rate of character relation extraction.

Referring to Fig. 7, in embodiments of the present invention, above-mentioned steps S104 is specifically included:

In step S1041, according to the word structure feature building morphology in second sentence because of subvector；

In step S1042, according to the semantic relation feature construction syntax in second sentence because of subvector；

In step S1043, by the morphology because subvector with the syntax because subvector merges, obtain it is described second The multiple-factor feature vector of son.

Referring to Fig. 6, the quantitative relation research through distance and relationship triple to a large amount of entities pair, it can be seen that point (5, 0.7923) indicate that the word between 2 entities accounts for total relationship triple number away from relationship example sum when being less than or equal to 5 79.23%.I.e. the incipient stage with word away from increase, the number of relationship triple increased dramatically.But when between 2 entities When word is away from more than 5, between entity word away from increase relationship triple quantity increasing degree it is smaller and smaller, this is also Illustrate that be closer two entities are bigger a possibility that there are entity relationships.

In another experiment based on Fig. 6 experiment basis, the entity of different relationships is added in a manner of superposition respectively The distance between relative seat feature, part of speech feature, syntax dependence feature, entity and core predicate feature, physically Following traits have obtained the average value that different relationship entities compare performance, referring specifically to table 2.From Table 2, it can be seen that with The increase of feature, accuracy rate (p), recall rate (R) and the F1 value that entity extracts character relation it is higher.(wherein, recall rate is Recall ratio is the ratio of relevant documentation number all in the relevant documentation number retrieved and document library, and measurement is searching system Recall ratio)

The average value that the different relationship entities of table 2 compare performance

As a practical application of the invention, in above-mentioned example 3, distance feature dis=2

Relative seat feature order=1

Part of speech feature vn=5

Syntax dependence feature is parsing-r=(9,14)

The distance between entity and predicate feature p-dis are (5,2)

The contextual feature of entity=(0,0,0.002,0.408,0.002,0.408,0,0.272), therefore " Yao Ming and wife Cotyledon jasmine has attended the ribbon-cutting ceremony of Hope Primary School " multiple-factor feature vector are as follows:

(2,1,5,9,14,5,2,0,0,0.002,0.408,0.002,0.408,0,0.272)

The multiple-factor feature vector of above-mentioned acquisition is input in conjugal relation classifier, because including institute in above-mentioned sentence There is multiple-factor feature vector, therefore can accurately extract personage is conjugal relation to entity Yao Ming-Ye Li, extracts result more Precisely；And when needing to extract relationship new personage, after filtering noise data can be first passed through, sentence is converted At multiple-factor feature vector, then the extraction that can carry out new persona relationship is input in relationship classifier, without carrying out other behaviour It is suitable for the extraction task of new relation type, application range is more extensive.

Fig. 8 shows a kind of structural schematic diagram of character relation draw-out device 200 provided in an embodiment of the present invention, in order to just Part relevant in the embodiment of the present invention is illustrated only in explanation.Personage's Relation extraction device 200, comprising:

Data generating unit 210, for generating packet by the natural language text data in alignment knowledge base and corpus Weak label data collection containing personage couple.

The distribution of the weak label data collection of table 1

By above-mentioned data generating unit 210, by the natural language text data in corpus and prestoring in knowledge base Data matched, generate include personage couple weak label data collection.Example 1: a word in news report is that " Yao Ming leads Everybody has come news briefing scene, and Ye Li and its daughter then also appear in scene ", then by this word and Freebase number It is matched according to library, obtains three groups of personages couple and corresponding three groups of character relations；Three groups of personages to be respectively as follows: A personage to for Yao Ming and leaf jasmine, B personage are to being Ye Li and daughter, C personage to for Yao Ming and daughter；Corresponding character relation be respectively man and wife, Mother and daughter, father and daughter.

Marking unit 220, for concentrating the first sentence for belonging to same personage couple labeled as institute the weak label data Same personage is stated to the positive example packet of relationship.

Filter element 230 filters described in the positive example packet for the filter algorithm according to preset relationship deictic words First sentence obtains training positive example data.

It is appreciated that in embodiments of the present invention, being not present in preset relation word dictionary by what filter algorithm filters fell In data, i.e. noise data can effectively improve the accuracy rate for making personage's Relation extraction.Referring to Fig. 3 to different personages to (wherein, abscissa means that five kinds of relationship types in above-mentioned table 1, AVG to the F1 value for the extraction result for having noise free data to obtain It is the mean value of five kinds of relational results.Ordinate is F1 value)

Extraction unit 240, for the second sentence in the trained positive example data and negative example packet to be carried out feature extraction, Obtain the multiple-factor feature vector of second sentence.

The multiple-factor feature vector of second sentence includes morphology factor vector sum syntax because of subvector；

The morphology further comprises because of subvector: distance feature, relative seat feature and part of speech feature；

The syntax further comprises because of subvector: the distance between syntax dependence feature, entity and core predicate Feature and entity context feature.

As a result acquiring unit 250 obtain the people for the multiple-factor feature vector to be input in relationship classifier The relationship classification results of object pair.

Character relation draw-out device provided in an embodiment of the present invention passes through the natural language in alignment knowledge base and corpus Text data generates the weak label data collection comprising personage couple, and further marks belonging to for weak label data concentration same The first sentence of personage couple is a positive example packet, then the filter algorithm based on preset relationship deictic words say the positive example packet of label into Row filtering, a large amount of high quality can be obtained in the case where not needing manually to participate in by obtaining more accurately training positive example data The data set of Training；Combined in the feature selecting of training process consider natural language text in lexical characteristics and Syntax dependency parsing generate syntactic feature, then by combined multiple-factor feature vector train character relation classifier come into The classification of row character relation；The accuracy rate of character relation extraction is effectively provided, engineer's template complex is not necessarily to, is suitable for new The extraction task of relationship type, application range are more extensive.

In embodiments of the present invention, the personage is to including multipair, above-mentioned marking unit 220, comprising:

Sort out subelement, for concentrating the first sentence for belonging to same personage couple to be referred to one the weak label data In positive example packet；

Subelement is marked, for marking multiple positive example packets respectively.

Character relation draw-out device provided in an embodiment of the present invention, when personage is to including multipair, by first respectively to category It is referred in a positive example packet in the first sentence of same personage couple, then marks the method for multiple positive example packets respectively, convenient for working as corpus There are multiple personage's clock synchronizations in library, can carry out extraction of multiple personages to relationship simultaneously, make personage to abstracting method more intelligence Energyization.

Referring to Fig. 9, above-mentioned filter element 230 is specifically included:

Weight computing subelement 231 is segmented for calculating the first sentence described in the positive example packet by preset formula The weight of word, formula are as follows afterwards:

Idf (w, S) indicates reverse document-frequency of the word w in corpus S,

n_wjIt is the number that the word occurs in sentence s,

Judgment sub-unit 232, for filtering out first three word of weight ranking in first sentence, and judge whether to First three word of a rare weight ranking is present in preset relative dictionary.

Retain subelement 233, for before there is at least one described weight ranking in the preset relative dictionary When three word, retains first sentence and be positive example sentence.

It is followed successively by (comprising punctuation mark):

Weight (Yao Ming)=0.027,

Weight (with)=0.002,

Weight (wife)=0.408,

Weight (Ye Li)=0.018,

Weight ()=0,

Weight (attending)=0.029,

Weight ()=0,

Weight (Hope Primary School)=0.031,

Weight ()=0,

Weight (ribbon-cutting ceremony)=0.012,

Character relation draw-out device provided in an embodiment of the present invention further carries out the first sentence by preset formula Filtering improves the accuracy rate of character relation extraction.

Referring to Figure 10, said extracted unit 240, comprising:

First building subelement 241, for according in second sentence word structure feature construct the morphology factor to Amount；

Second building subelement 242, for according to the semantic relation feature construction syntax factor in second sentence to Amount；

Extract subelement 243, for by the morphology because subvector with the syntax because subvector merges, acquisition described the The multiple-factor feature vector of two sentences.

The average value that the different relationship entities of table 4 compare performance

Relative seat feature order=1

Part of speech feature vn=5

Dependence feature is r=(9,14)

The distance between entity and predicate feature p-dis are (5,2)

(2,1,5,9,14,5,2,0,0,0.002,0.408,0.002,0.408,0,0.272)

The embodiment of the present invention provides a kind of computer installation, which includes processor, and processor is for executing The step of character relation abstracting method that above-mentioned each embodiment of the method provides is realized when the computer program stored in memory.

Illustratively, computer program can be divided into one or more modules, one or more module is stored In memory, and by processor it executes, to complete the present invention.One or more modules, which can be, can complete specific function Series of computation machine program instruction section, the instruction segment is for describing implementation procedure of the computer program in computer installation.Example Such as, computer program can be divided into the step of character relation abstracting method that above-mentioned each embodiment of the method provides.

It will be understood by those skilled in the art that the description of above-mentioned computer installation is only example, do not constitute to calculating The restriction of machine device may include component more more or fewer than foregoing description, perhaps combine certain components or different portions Part, such as may include input-output equipment, network access equipment, bus etc..

Alleged processor can be central processing unit (Central Processing Unit, CPU), can also be it His general processor, digital signal processor (Digital SignalProcessor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field- Programmable GateArray, FPGA) either other programmable logic device, discrete gate or transistor logic, Discrete hardware components etc..General processor can be microprocessor or the processor is also possible to any conventional processor Deng the processor is the control centre of the computer installation, utilizes various interfaces and the entire computer installation of connection Various pieces.

The memory can be used for storing the computer program and/or module, and the processor is by operation or executes Computer program in the memory and/or module are stored, and calls the data being stored in memory, described in realization The various functions of computer installation.The memory can mainly include storing program area and storage data area, wherein storage program It area can application program (such as sound-playing function, image player function etc.) needed for storage program area, at least one function Deng；Storage data area, which can be stored, uses created data (such as audio data, phone directory etc.) etc. according to mobile phone.In addition, Memory may include high-speed random access memory, can also include nonvolatile memory, such as hard disk, memory, grafting Formula hard disk, intelligent memory card (SmartMedia Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card), at least one disk memory, flush memory device or other volatile solid-state parts.

If the integrated module/unit of the computer installation is realized in the form of SFU software functional unit and as independent Product when selling or using, can store in a computer readable storage medium.Based on this understanding, the present invention is real All or part of the process in existing above-described embodiment method, can also instruct relevant hardware come complete by computer program At the computer program can be stored in a computer readable storage medium, which is being executed by processor When, it can be achieved that the step of above-mentioned each character relation abstracting method embodiment.Wherein, the computer program includes computer journey Sequence code, the computer program code can be source code form, object identification code form, executable file or certain intermediate shapes Formula etc..The computer-readable medium may include: any entity or device, note that can carry the computer program code Recording medium, USB flash disk, mobile hard disk, magnetic disk, CD, computer storage, read-only memory (ROM, Read-OnlyMemory), with Machine access memory (RAM, RandomAccess Memory), electric carrier signal, electric signal and software distribution medium etc..

Note that above content is only presently preferred embodiments of the present invention.It will be appreciated by those skilled in the art that the present invention is not limited to Specific embodiment described here is able to carry out various apparent variations for a person skilled in the art, readjusts and replace In generation, is without departing from protection scope of the present invention.Therefore, although having been carried out by above embodiments to the present invention more detailed Illustrate, but the present invention is not limited to the above embodiments only, can also include more without departing from the inventive concept Other equivalent embodiments, and the scope of the invention is determined by the scope of the appended claims.

Claims

1. a kind of character relation abstracting method characterized by comprising

By the natural language text data in alignment knowledge base and corpus, the weak label data collection comprising personage couple is generated；

The weak label data is concentrated into positive example packet of the first sentence for belonging to same personage couple labeled as same personage to relationship；

According to the filter algorithm of preset relationship deictic words, first sentence in the positive example packet is filtered, is obtaining training just Number of cases evidence；

The second sentence in the trained positive example data and negative example packet is subjected to feature extraction, obtains the more of second sentence Ratio characteristics vector；

2. character relation abstracting method according to claim 1, which is characterized in that the personage to include it is multipair, it is described The weak label data is concentrated into positive example packet of the sentence for belonging to same personage couple labeled as same personage to relationship, comprising:

Multiple positive example packets are marked respectively.

3. character relation abstracting method according to claim 1, which is characterized in that described according to preset relationship deictic words Filter algorithm, filter first sentence in the positive example packet, specifically include:

The weight of word after the participle of the first sentence described in the positive example packet is calculated by preset formula, formula is as follows:

Idf (w, S) indicates reverse document-frequency of the word w in corpus S,

n_wjIt is the number that the word occurs in sentence s,

| { j:w ∈ s, s ∈ S } | it is all sentence quantity comprising word w in corpus S；

First three word of weight ranking in first sentence is filtered out, and before judging whether at least one described weight ranking Three word is present in preset relative dictionary；

When the judgment result is yes, retain first sentence to be positive example sentence.

4. character relation abstracting method according to claim 1, which is characterized in that the multiple-factor feature of second sentence Vector includes morphology factor vector sum syntax because of subvector；

5. character relation abstracting method according to claim 4, which is characterized in that it is described by the trained positive example data with And the second sentence in negative example packet carries out feature extraction, obtains the multiple-factor feature vector of second sentence, specifically includes:

According to the word structure feature building morphology in second sentence because of subvector；

According to the semantic relation feature construction syntax in second sentence because of subvector；

By the morphology because subvector with the syntax because subvector merges, obtain the multiple-factor feature of second sentence to Amount.

6. a kind of character relation draw-out device, which is characterized in that the character relation draw-out device includes:

Data generating unit, for by the natural language text data in alignment knowledge base and corpus, generating to include personage Pair weak label data collection；

Marking unit, for concentrating the first sentence for belonging to same personage couple labeled as the same people the weak label data Positive example packet of the object to relationship；

Filter element filters described first in the positive example packet for the filter algorithm according to preset relationship deictic words Son obtains training positive example data；

Extraction unit obtains institute for the second sentence in the trained positive example data and negative example packet to be carried out feature extraction State the multiple-factor feature vector of the second sentence；

As a result acquiring unit obtains the personage's couple for the multiple-factor feature vector to be input in relationship classifier Relationship classification results.

7. character relation draw-out device according to claim 6, which is characterized in that the personage to include it is multipair, it is described Marking unit, comprising:

Sort out subelement, for concentrating the first sentence for belonging to same personage couple to be referred to a positive example the weak label data Bao Zhong；

8. character relation draw-out device according to claim 6, which is characterized in that the filter element specifically includes:

Weight computing subelement, for calculating word after the first sentence described in the positive example packet segments by preset formula Weight, formula are as follows:

Idf (w, S) indicates reverse document-frequency of the word w in corpus S,

n_wjIt is the number that the word occurs in sentence s,

Judgment sub-unit for filtering out first three word of weight ranking in first sentence, and judges whether at least one First three word of a weight ranking is present in preset relative dictionary；

Retain subelement, for occurring first three word of at least one described weight ranking in the preset relative dictionary When, retain first sentence and is positive example sentence.

9. character relation draw-out device according to claim 6, which is characterized in that the multiple-factor feature of second sentence Vector includes morphology factor vector sum syntax because of subvector；

10. character relation draw-out device according to claim 9, which is characterized in that the extraction unit, comprising:

First building subelement, for constructing morphology because of subvector according to the word structure feature in second sentence；

Second building subelement, for according to the semantic relation feature construction syntax in second sentence because of subvector；

Extract subelement, for by the morphology because subvector with the syntax because subvector merges, acquisition second sentence Multiple-factor feature vector.

11. a kind of computer installation, which is characterized in that the computer installation includes processor, and the processor is for executing Realizing the character relation abstracting method as described in any one of claim 1-5 when the computer program stored in memory Step.

12. a kind of computer readable storage medium, is stored thereon with computer program, it is characterised in that: the computer program The step of character relation abstracting method as described in any one of claim 1-5 is realized when being executed by processor.