CN108959418A - Character relation extraction method and device, computer device and computer readable storage medium - Google Patents
Character relation extraction method and device, computer device and computer readable storage medium Download PDFInfo
- Publication number
- CN108959418A CN108959418A CN201810587061.0A CN201810587061A CN108959418A CN 108959418 A CN108959418 A CN 108959418A CN 201810587061 A CN201810587061 A CN 201810587061A CN 108959418 A CN108959418 A CN 108959418A
- Authority
- CN
- China
- Prior art keywords
- sentence
- feature
- word
- relationship
- positive example
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
Abstract
The invention relates to the technical field of natural language processing, and provides a character relation extraction method, which comprises the following steps: generating a weak label data set containing character pairs by aligning the natural language text data in the knowledge base and the corpus; marking a first sentence belonging to the same person pair in the weak label data set as a regular example packet of the same person pair relationship; filtering the first sentence in the regular example packet according to a filtering algorithm of a preset relation indicator to obtain training regular example data; and performing feature extraction on the training positive case data and the second sentence in the negative case packet to obtain a multi-factor feature vector of the second sentence, inputting the multi-factor feature vector into a relation classifier, and acquiring a relation classification result of the character pair by using a supervised method. The embodiment of the invention also provides a character relation extraction device, a computer device and a computer readable storage medium. The character relation extraction method provided by the embodiment of the invention improves the accuracy of character relation extraction, does not need to manually design a complex template, and is wider in application.
Description
Technical field
The present invention relates to natural language processing technique field, in particular to a kind of character relation abstracting method, device, calculating
Machine device and computer readable storage medium.
Background technique
In the electronic text information for the explosive growth that internet generates, a large amount of people entities and the pass between them
It is that information covers wherein.In face of the data of such multi-element heterogeneous, it is necessary to it is therefrom fast to be just able to satisfy people using information extraction technique
Speed obtains the demand of effective information.A vital task of the Relation extraction as information extraction, it is formal for the first time propose be
7th message in 1998 understands on conference (Message Understanding Conference, MUC) that it refers to from certainly
The process of the semantic relation between two entities of identification is found in right language text.
The traditional mode by manual read, understanding of entity relation extraction technological break-through obtains the limit of semantic relation
The automatic lookup and extraction of system instead semantic relation.As the popular research field in natural language processing, entity
Relation extraction is always the important directions of information extraction research field.The early stage research of Relation extraction is mainly by manually establishing
Syntax and semantics rule, then identifies the relationship of entity by the method for pattern match.Since these methods need largely
The early-stage preparations of artificial treatment and professional knowledge, researcher begin trying machine learning method.
According to the degree of dependence to labeled data, the Relation extraction method based on machine learning can be divided into supervised learning,
The mode of semi-supervised learning, remote supervisory study and unsupervised learning.Supervised learning method is using Relation extraction as one point
Class problem designs effective feature according to training data, then constructs various disaggregated models, finally uses trained classifier
Carry out projected relationship.In feature selecting, relationship classifier can be trained in conjunction with features such as vocabulary, syntax, semantemes, can also be added
Enter syntactic analysis tree and dependency tree to form feature vector, in addition there are the location informations that research joined relationship characteristic word
Feature carries out relationship classification.In addition, in order to avoid artificial design features engineering, scholars start with neural network structure
Then automatic Learning from Nature language text feature carries out entity relation extraction, this kind of deep learning method has also belonged to supervision side
Method.There are Relation extraction rate of accurateness and the recall rate of supervision all very high, but depends critically upon the relation object made in advance
Type system and labeled data collection.The especially method of deep learning the characteristics of due to neural network itself, needs largely to train number
According to can just obtain preferable sorter network model.Semi-supervised learning method mainly uses Bootstrapping, label propagation etc.
Mode carries out Relation extraction.For the relationship to be extracted, this method sets several sub-instance by hand first, then iteratively
The corresponding relationship templates of relationship and more examples are extracted from data.
Compared with the method for having supervision, semi-supervised method can greatly reduce the tagged corpus needed in learning process
Scale, but the processing such as interference problem of noise bad will affect in the On The Choice of initial seed collection and iterative process
The actual performance of this method.And unsupervised open Relation extraction method assumes that and possesses the entity of identical semantic relation to gathering around
There is similar contextual information, so that the semantic relation of the entity pair is represented using the corresponding contextual information of each entity,
And the semantic relation of all entities pair is clustered.Unsupervised entity relation extraction is not necessarily to pre-defined entity relationship type body
System has field independence, this is advantageous when handling magnanimity Opening field data, but it clusters threshold value and is difficult in advance really
Fixed, the accuracy rate for extracting result is lower, and still lacks more objective evaluation criterion at present.
In recent years, various large scale knowledge bases (Knowledge Base, KB), such as Freebase, DBpedia, YAGO and
Online encyclopaedic knowledge library is completed, this has great value for being configured with the training data of supervision machine learning method.Mintz
Et al. in 2009 for the first time Relation extraction field propose remote supervisory (Distant Supervision, DS) thought.Remotely
Measure of supervision it is assumed that if two entities be in knowledge base it is related, all sentences comprising the two entities are all
This relationship will be expressed.Relation extraction based on remote supervisory is spontaneously aligned natural language text and given knowledge base, so
Learn relationship extraction using the resulting weak label training data of alignment afterwards.
As shown in Figure 1, being exactly the exemplary system for carrying out Relation extraction using remote supervisory technology.In systems, first
Remote supervisory technology is first passed through when being aligned natural language text and knowledge base, the sentence containing certain people entities pair that will identify that
Son label be in the entity to the weak label data of relationship.Then, for the relational query of related person pair, system is logical
It crosses to be input in classifier from the correlated characteristic extracted in sentence and carries out relationship judgement, finally by the pass in classification results
It is that correct relationship fact result is put into relational knowledge base by the size of probability.This had both solved measure of supervision excessively
The problem of relying on handmarking's data, and problem that a certain extent can be lower to avoid unsupervised approaches accuracy rate.
But the basic assumption of remote supervisory is not rigorous, the entity in corpus is to might not be all in co-occurrence sentence
Entity can be expressed to the relationship in knowledge base.For example, " Yao Ming leads everybody to come news briefing scene, and leaf jasmine is subsequent
Also appear in scene." this co-occurrence sentence is can not semantically to express " man and wife " relationship between them true.It is this to contain
Entity pair but the sentence that cannot extract relationship characteristic belong to the noise data of remote supervisory method generation, should be by its mistake
Filter.Also, the research of current Relation extraction is concentrated mainly in the processing of English resource, this is primarily due to Chinese corpus needs
Participle, and there is complicated sentence structure and implicit semantic, therefore Chinese character relation extraction is more difficult.In addition, Chinese
For Knowledge Database than later, research of the remote supervisory in the Relation extraction of Chinese corpus is also fewer.
Pan Yun et al. attempts the character relation extraction system that Chinese is constructed using Chinese interaction encyclopaedia online resource for the first time, adopts
It is label propagation algorithm training pattern, obtains 68% or so accuracy rate, there is no carry out remote supervisory data for the method
Denoising, time consumption for training is too long and accuracy rate is not high.Huang Bei wait quietly people extracted using term vector and sentence pattern, cluster and
The method of scoring, the noise sentence in original training set obtained to remote supervisory character relation extraction process are filtered, reach
To the purpose of the training set denoising generated to remote supervisory, but this method requires manual intervention, and pattern extraction method used
Transportable property is bad, has very strong domain feature.
Summary of the invention
The present invention provides a kind of character relation abstracting method, it is intended to solve Chinese character relation extraction side in the prior art
Method causes training pattern accuracy rate not high because not carrying out denoising to remote supervisory data;The pattern extraction method of use can
Migration is bad, the single problem of application scenarios.
The invention is realized in this way a kind of character relation abstracting method, the character relation abstracting method include:
By the natural language text data in alignment knowledge base and corpus, the weak label data comprising personage couple is generated
Collection;
The weak label data is concentrated belong to the first sentence of same personage couple labeled as same personage to relationship just
Example packet;
According to the filter algorithm of preset relationship deictic words, first sentence in the positive example packet is filtered, is instructed
Practice positive example data;
The second sentence in the trained positive example data and negative example packet is subjected to feature extraction, obtains second sentence
Multiple-factor feature vector;
The multiple-factor feature vector is input in relationship classifier, the relationship classification results of the personage couple are obtained.
The present invention also provides a kind of character relation draw-out device, the character relation draw-out device includes:
Data generating unit, for the natural language text data by being aligned in knowledge base and corpus, generation includes
The weak label data collection of personage couple;
Marking unit, for concentrating the weak label data the first sentence for belonging to same personage couple labeled as described same
Positive example packet of one personage to relationship;
Filter element filters described in the positive example packet for the filter algorithm according to preset relationship deictic words
One sentence obtains training positive example data;
Extraction unit is obtained for the second sentence in the trained positive example data and negative example packet to be carried out feature extraction
Obtain the multiple-factor feature vector of second sentence;
As a result acquiring unit obtains the personage for the multiple-factor feature vector to be input in relationship classifier
Pair relationship classification results.
The embodiment of the present invention also provides a kind of computer installation, and the computer installation includes processor, the processor
It realizes when for executing the computer program stored in memory such as the step of above-mentioned character relation abstracting method.
The embodiment of the present invention also provides a kind of computer readable storage medium, is stored thereon with computer program, the meter
The step of calculation machine program realizes character relation abstracting method as described above when being executed by processor.
Character relation abstracting method provided by the invention passes through the natural language text number in alignment knowledge base and corpus
According to generate include personage couple weak label data collection, and further mark what weak label data concentrated to belong to same personage couple
The first sentence be a positive example packet, then the filter algorithm based on preset relationship deictic words says that the positive example packet of label carried out
Filter obtains more accurately training positive example data, and in the case where not needing manually to participate in, can obtain a large amount of high quality has prison
Supervise and instruct experienced data set;The lexical characteristics and syntax considered in natural language text are combined in the feature selecting of training process
The syntactic feature that dependency analysis generates, then character relation classifier is trained to carry out people by combined multiple-factor feature vector
The classification of object relationship;The accuracy rate of character relation extraction is effectively provided, engineer's template complex is not necessarily to, is suitable for new relation
The extraction task of type, application range are more extensive.
Detailed description of the invention
Fig. 1 is the exemplary system that existing remote supervisory technology carries out Relation extraction;
Fig. 2 is a kind of implementation flow chart of character relation abstracting method provided in an embodiment of the present invention;
Fig. 3 be it is provided in an embodiment of the present invention to different relationship personages in the extraction result for thering is noise free data to obtain
The comparison diagram of F1 value;
Fig. 4 is syntax dependency parsing exemplary diagram provided in an embodiment of the present invention;
Fig. 5 is a kind of positive example according to the filter algorithm filters of preset relationship deictic words provided in an embodiment of the present invention
First sentence in packet obtains the implementation flow chart of training positive example data;
Fig. 6 is a kind of distance of entity pair provided in an embodiment of the present invention and the Figure of the quantitative relationship of relationship triple;
Fig. 7 is that a kind of the second sentence by training positive example data and negative example packet provided in an embodiment of the present invention carries out spy
Sign is extracted, and the implementation flow chart of the multiple-factor feature vector of the second sentence is obtained;
Fig. 8 is a kind of structural schematic diagram of character relation draw-out device provided in an embodiment of the present invention;
Fig. 9 is the structural schematic diagram of filter element provided in an embodiment of the present invention;
Figure 10 is the structural schematic diagram of extraction unit provided in an embodiment of the present invention.
Specific embodiment
To keep the technical problems solved, the adopted technical scheme and the technical effect achieved by the invention clearer, below
The present invention is described in further detail in conjunction with the accompanying drawings and embodiments.It is understood that specific implementation described herein
Example is used only for explaining the present invention rather than limiting the invention.
Before introducing specific embodiment, main method of the present invention is illustrated with theory first:
Being put forward for the first time for remote supervisory thought is for solving the problems, such as biological information field, by development, Mintz et al.
Remote supervisory method is applied in the task of Relation extraction for the first time.The thought of remote supervisory is mainly to utilize given knowledge base
D and corpus C wherein includes a large amount of relationship triple (e in knowledge base D1, r, e2), wherein (r is relationship type to r ∈ R, and R is
Relationship type set), (e1, e2) it is the entity pair with relationship r.D is aligned by the method for remote supervisory with C, any packet in C
Sentence containing a pair of of entity in the relationship triple in knowledge base is all regarded to deposit between entity pair in triple in expression D
Relationship it is true.
For example, Mintz et al. is using Freebase as the knowledge base of structuring, in the relationship example in Freebase
Each entity pair, they find out the sentences comprising these entities pair all in wikipedia, and therefrom extract phase
The text feature answered trains relationship classifier.But the hypothesis of this remote supervisory is excessively strong, can draw in training data
Enter a large amount of noise data.Hereafter, some researchs are come by multi-instance learning (MultipleInstanceLearning, MIL)
Loosen the hypothesis of remote supervisory.
Compared to a series of examples individually marked inputted in supervised learning, in MIL, input is a series of
" packet " being marked, each " packet " include many examples.When all examples in packet are all negative examples, this packet can be marked
The example that is negative packet.And when at least containing a positive example in packet, this packet can be noted as positive example packet.When receiving a series of be marked
Packet when, classifier can learn: (1) summarizing a class concepts correctly to mark individual examples;(2) except conclusion
How study removes one packet of mark.In the remote supervisory Relation extraction model under multi-instance learning guidance, it is assumed that in all realities
Body is to (e1, e2) co-occurrence sentence at least one co-occurrence sentence can indicate this relationship.Then, by all same entities
Pair co-occurrence sentence be put into the same packet, if as soon as packet at least contain a positive example, labeled as the relationship positive example.And one
It is then the sentence that cannot express the relationship entirely in the packet of example that label, which is negative, and the sentence (example) in negative example packet is primarily used to instructing
Relational model is allowed to distinguish positive example and negative example in white silk.The method of multi-instance learning by the training process study to it is same just
Complementary information in example packet between sentence, can alleviate the noise data bring error flag of remote supervisory to a certain extent
The problem of.
Fig. 2 shows a kind of implementation flow charts of character relation abstracting method provided in an embodiment of the present invention, comprising:
In step s101, by the natural language text data in alignment knowledge base and corpus, generating includes personage
Pair weak label data collection.
In embodiments of the present invention, knowledge base includes: Freebase, DBpedia, YAGO and online encyclopaedic knowledge library
Deng prestoring personage in knowledge base to, people entities noun, personage to relationship etc..
As an embodiment of the present invention, corpus includes text data, and text data includes text, symbol, picture
Deng, such as news report, article, paper belong to text data.
Natural language text data include English text data, Chinese language text data, French text data, Russian textual data
According to any text data or combinations thereof in waiting, specifically without limitation.
In embodiments of the present invention, weak label data collection includes positive example data, negative number of cases evidence.
In one embodiment of the invention, two people entities are a personage couple, if Yao Ming and leaf jasmine are a people
Object pair;Lin Zhiying and kimi is personage's equity.
In knowledge base, for personage to generally being indicated using triple, triple includes title of the entity to two entities, personage
To relationship.As (Yao Ming, wife, Ye Li), (Yao Ming, husband, Ye Li), (Yao Ming, man and wife, Ye Li), (Lin Zhiying, father and son,
Kimi), (Lin Zhiying, son, kimi), (Lin Zhiying, father, kimi) etc..
In the embodiment of the present invention, the weak label data collection comprising 104,593 sentences is constructed, wherein 80% weak mark
It signs data (83675 sentences) and is used as training data, remaining 20% (20919 sentences) is used as test data.This experiment choosing
It selects five kinds of common character relations to be tested, is respectively as follows: father and son, mothers and sons, brother, sister, man and wife.It is illustrated in table 1 weak
The data distribution of label data collection is as follows:
The distribution of the weak label data collection of table 1
S101 through the above steps, by corpus natural language text data and knowledge base in the data prestored into
Row matching, generates the weak label data collection including personage couple.Example 1: a word in news report is that " Yao Ming leads everybody to come
News briefing scene, Ye Li and its daughter then also appear in scene ", then by this word and the progress of Freebase database
Matching, obtains three groups of personages couple and corresponding three groups of character relations;Three groups of personages are to being respectively as follows: A personage to for Yao Ming and leaf
Jasmine, B personage are to being Ye Li and daughter, C personage to for Yao Ming and daughter;Corresponding character relation is respectively man and wife, mother and daughter, father
Female.
In step s 102, concentrate the first sentence for belonging to same personage couple labeled as same people the weak label data
Positive example packet of the object to relationship.
In one embodiment of the invention, positive example packet refers to the data of all sentences including being described same person's object pair
Packet.
For example, in order to extract conjugal relation, then A personage closes Yao Ming with leaf jasmine and its corresponding personage in above-mentioned example 1
It is that man and wife then belongs to positive example packet;And B personage is then negative example packet to, C personage couple and its corresponding character relation.
In embodiments of the present invention, the first sentence includes at least one sentence, generally multiple sentences, particular number according to
It is practical to determine.
In step s 103, according to the filter algorithm of preset relationship deictic words, described in the positive example packet is filtered
One sentence obtains training positive example data.
In embodiments of the present invention, preset relationship deictic words includes man and wife, husband, wife, father and son, father, son, mother
Son, mother, father and daughter, daughter, brother, elder brother, younger brother, sister, elder sister, younger sister, brother and sister, elder sister and younger brother, friend, colleague etc., specifically not
It limits.All these preset relationship deictic words constitute preset relative dictionary.
Training positive example data indicate the data in positive example packet, refer generally to sentence or word.
For example, " Yao Ming leads everybody to come news briefing scene, and leaf jasmine then also appears in scene in example 2." this
A sentence can not semantically express " man and wife " relationship fact between them.It is this to contain entity pair but extract
Sentence to relationship characteristic then belongs to noise data, should be filtered.And " Yao Ming and wife Ye Li are attended for example, example 3
There are preset relative dictionary sheets to show people entities Yao Ming, leaf jasmine conjugal relation in the ribbon-cutting ceremony of Hope Primary School " sentence
Word " wife " is then left trained positive example data.
It is appreciated that in embodiments of the present invention, being not present in preset relation word dictionary by what filter algorithm filters fell
In data, i.e. noise data can effectively improve the accuracy rate for making personage's Relation extraction.Referring to Fig. 3 to different relationship personages
To the F1 value comparison diagram in the extraction result for having noise free data to obtain, (wherein, abscissa means that five kinds in above-mentioned table 1
Relationship type, ordinate are F1 values, and AVG is the mean value of five kinds of relational results.)
According to Fig. 3 it was determined that the F1 value for the extraction result that the first data of removal noise data obtain significantly improves, take out
The accuracy rate of character relation is taken to effectively improve.
In step S104, the second sentence in the trained positive example data and negative example packet is subjected to feature extraction, is obtained
Obtain the multiple-factor feature vector of second sentence.
In embodiments of the present invention, the multiple-factor feature vector of second sentence include morphology factor vector sum syntax because
Subvector.It is to convert the second sentence to the process of multiple-factor feature vector that second sentence, which is carried out feature extraction,.
The morphology further comprises because of subvector: distance feature, relative seat feature and part of speech feature;Wherein, distance
Feature refers to word of two people entities in sentence away from being generally abbreviated as dis;Relative seat feature refers to people entities in sentence
Tandem in son, is generally indicated with order;Part of speech feature refers to the quantity of verb and noun after segmenting in sentence,
Generally indicated with vn.
The syntax further comprises because of subvector: the distance between syntax dependence feature, entity and core predicate
Feature and entity context feature.Wherein, syntax dependence feature refers to everyone object entity sentence affiliated in sentence
Method relationship interdependent value reflects the relationship between people entities, is generally indicated with parsing-r;Entity refers to the name of the personage in sentence
Claim;Core predicate refers to the predicate verb that sentence core is embodied in sentence, the distance between body and core predicate feature, generally with p-
dis;Entity context feature refers to the first two words of people entities and the weight of latter two word, the then zero padding of the situation less than two
Processing, is generally indicated with context.
15 kinds of dependences that Harbin Institute of Technology is defined using language cloud are according to Key Relationships, punctuation mark, independence
Structure, right additional relationships, left additional relationships, guest's Jie relationship, coordination, dynamic benefit relationship, verbal endocentric phrase, fixed middle relationship and language,
Preposition object, guest's relationship, dynamic guest's relationship, the sequence of subject-predicate relationship, are corresponding in turn to assignment to 14 from 0.
As an embodiment of the present invention, the syntactic structure of sentence describes phrase structure, dependency structure in sentence
And phrase structure and dependency structure function.
Referring to fig. 4, for example, " this be the yellow of heap of stone eldest daughter with Sun Li 11 years old the more, be named as Huang Yici greatly.", people entities
" Huang of heap of stone " and relative " daughter " there is relationship in surely, relative " daughter " and core predicate " crying " there is subject-predicate relationship,
And there is dynamic guest's relationships between core predicate " crying " and people entities " Huang Yici ", syntax dependency parsing in this way,
It can be found that people entities " Huang is of heap of stone " all depend on relative " daughter " with " Huang Yici ".Further, by " Huang of heap of stone " with
Coordination between " Sun Li ", but between available people entities " Sun Li " and " Huang Yici " with relative " daughter " according to
Deposit relationship.As it can be seen that core predicate plays key effect to acquisition entity boundary, undertaking entity relationship, so, in natural language text
In this sentence, entity is also a kind of implication relation feature between entity at a distance from core predicate.
In embodiments of the present invention, the second sentence includes at least one sentence, generally multiple sentences, particular number according to
It is practical to determine.It is appreciated that the data that the second sentence includes under normal circumstances are greater than the data that the first sentence includes.
In step s105, the multiple-factor feature vector is input in relationship classifier, obtains the personage's couple
Relationship classification results.
In embodiments of the present invention, relationship classifier includes man and wife's classifier, father and son's classifier, mother and daughter's classifier, brother
Classifier, sister's classifier etc..
In embodiments of the present invention, the relationship classification results of personage couple generally indicate triple with personage, such as (Yao Ming,
Man and wife, Ye Li), (Lin Zhiying, father and son, kimi), (Jia Jingwen, mother and daughter, Bu Bu), (Bu Bu, sister, wave girl), (Jiang Wen, brother,
Jiang Wu), (Xiao Ming, friend, small red), (Lan Lan, colleague, little Huang) etc..
Character relation abstracting method provided in an embodiment of the present invention passes through the natural language in alignment knowledge base and corpus
Text data generates the weak label data collection comprising personage couple, and further marks belonging to for weak label data concentration same
The first sentence of personage couple is a positive example packet, then the filter algorithm based on preset relationship deictic words say the positive example packet of label into
Row filtering, a large amount of high quality can be obtained in the case where not needing manually to participate in by obtaining more accurately training positive example data
The data set of Training;Combined in the feature selecting of training process consider natural language text in lexical characteristics and
Syntax dependency parsing generate syntactic feature, then by combined multiple-factor feature vector train character relation classifier come into
The classification of row character relation;The accuracy rate of character relation extraction is effectively provided, engineer's template complex is not necessarily to, is suitable for new
The extraction task of relationship type, application range are more extensive.
In embodiments of the present invention, the personage is to including multipair, above-mentioned steps S102, comprising:
The first sentence for belonging to same personage couple is concentrated to be referred in a positive example packet the weak label data;
Multiple positive example packets are marked respectively.
In embodiments of the present invention, natural language text data obtained in corpus may include multiple personages couple, this
When, then respectively by multiple personages to classification, and the positive example packet of each personage couple is marked respectively.
Character relation abstracting method provided in an embodiment of the present invention, when personage is to including multipair, by first respectively to category
It is referred in a positive example packet in the first sentence of same personage couple, then marks the method for multiple positive example packets respectively, convenient for working as corpus
There are multiple personage's clock synchronizations in library, can carry out extraction of multiple personages to relationship simultaneously, make personage to abstracting method more intelligence
Energyization.
Referring to Fig. 5, as an embodiment of the present invention, above-mentioned steps S103 is specifically included:
In step S1031, word after the first sentence described in the positive example packet segments is calculated by preset formula
Weight, formula are as follows:
Wherein, TI (w) indicates the weight of any word w after the first sentence participle of some in positive example packet,
Tf (w, s) indicates the normalization word frequency of the word w in sentence s,
Idf (w, S) indicates reverse document-frequency of the word w in corpus S,
nwjIt is the number that the word occurs in sentence s,
∑knwjIt is the sum of the frequency of occurrence of all words occurred in sentence s,
| S | it is first sentence sum in the weak label positive example data in a relationship example packet,
| { j:w ∈ s, s ∈ S } | it is all sentence quantity comprising word w in corpus S.
In embodiments of the present invention, participle refers to the structure that the first sentence is split into word, after the sentence participle in example 3
Can split into: Yao Ming, with, wife, Ye Li,, attend, Hope Primary School, ribbon-cutting ceremony,.It should be noted that sentence
Punctuation mark in son is also required to split into word structure.
In step S1032, first three word of weight ranking in first sentence is filtered out, and judge whether at least
First three word of one weight ranking is present in preset relative dictionary;When sentencing interpretation result is to be, step is executed
Rapid S1033;When the judgment result is no, step S1034 is executed, first sentence is deleted.
In step S1033, retains first sentence and be positive example sentence.
In practical application, for example, after the sentence of example 3 is segmented, and the weight of each word is calculated such as by above-mentioned formula
Under:
It is followed successively by (comprising punctuation mark):
Weight (Yao Ming)=0.027,
Weight (with)=0.002,
Weight (wife)=0.408,
Weight (Ye Li)=0.018,
Weight ()=0,
Weight (attending)=0.029,
Weight ()=0,
Weight (Hope Primary School)=0.031,
Weight ()=0,
Weight (ribbon-cutting ceremony)=0.012,
Weight (.)=0, and then obtain the weight highest of Yao Ming, three Ye Li, wife words, and be all present in preset
In relative dictionary, therefore the sentence " ribbon-cutting ceremony that Yao Ming has attended Hope Primary School with wife Ye Li " for retaining example 3 is positive example
Sentence.
Character relation abstracting method provided in an embodiment of the present invention further carries out the first sentence by preset formula
Filtering improves the accuracy rate of character relation extraction.
Referring to Fig. 7, in embodiments of the present invention, above-mentioned steps S104 is specifically included:
In step S1041, according to the word structure feature building morphology in second sentence because of subvector;
In step S1042, according to the semantic relation feature construction syntax in second sentence because of subvector;
In step S1043, by the morphology because subvector with the syntax because subvector merges, obtain it is described second
The multiple-factor feature vector of son.
Referring to Fig. 6, the quantitative relation research through distance and relationship triple to a large amount of entities pair, it can be seen that point (5,
0.7923) indicate that the word between 2 entities accounts for total relationship triple number away from relationship example sum when being less than or equal to 5
79.23%.I.e. the incipient stage with word away from increase, the number of relationship triple increased dramatically.But when between 2 entities
When word is away from more than 5, between entity word away from increase relationship triple quantity increasing degree it is smaller and smaller, this is also
Illustrate that be closer two entities are bigger a possibility that there are entity relationships.
In another experiment based on Fig. 6 experiment basis, the entity of different relationships is added in a manner of superposition respectively
The distance between relative seat feature, part of speech feature, syntax dependence feature, entity and core predicate feature, physically
Following traits have obtained the average value that different relationship entities compare performance, referring specifically to table 2.From Table 2, it can be seen that with
The increase of feature, accuracy rate (p), recall rate (R) and the F1 value that entity extracts character relation it is higher.(wherein, recall rate is
Recall ratio is the ratio of relevant documentation number all in the relevant documentation number retrieved and document library, and measurement is searching system
Recall ratio)
The average value that the different relationship entities of table 2 compare performance
As a practical application of the invention, in above-mentioned example 3, distance feature dis=2
Relative seat feature order=1
Part of speech feature vn=5
Syntax dependence feature is parsing-r=(9,14)
The distance between entity and predicate feature p-dis are (5,2)
The contextual feature of entity=(0,0,0.002,0.408,0.002,0.408,0,0.272), therefore " Yao Ming and wife
Cotyledon jasmine has attended the ribbon-cutting ceremony of Hope Primary School " multiple-factor feature vector are as follows:
(2,1,5,9,14,5,2,0,0,0.002,0.408,0.002,0.408,0,0.272)
The multiple-factor feature vector of above-mentioned acquisition is input in conjugal relation classifier, because including institute in above-mentioned sentence
There is multiple-factor feature vector, therefore can accurately extract personage is conjugal relation to entity Yao Ming-Ye Li, extracts result more
Precisely;And when needing to extract relationship new personage, after filtering noise data can be first passed through, sentence is converted
At multiple-factor feature vector, then the extraction that can carry out new persona relationship is input in relationship classifier, without carrying out other behaviour
It is suitable for the extraction task of new relation type, application range is more extensive.
Character relation abstracting method provided in an embodiment of the present invention passes through the natural language in alignment knowledge base and corpus
Text data generates the weak label data collection comprising personage couple, and further marks belonging to for weak label data concentration same
The first sentence of personage couple is a positive example packet, then the filter algorithm based on preset relationship deictic words say the positive example packet of label into
Row filtering, a large amount of high quality can be obtained in the case where not needing manually to participate in by obtaining more accurately training positive example data
The data set of Training;Combined in the feature selecting of training process consider natural language text in lexical characteristics and
Syntax dependency parsing generate syntactic feature, then by combined multiple-factor feature vector train character relation classifier come into
The classification of row character relation;The accuracy rate of character relation extraction is effectively provided, engineer's template complex is not necessarily to, is suitable for new
The extraction task of relationship type, application range are more extensive.
Fig. 8 shows a kind of structural schematic diagram of character relation draw-out device 200 provided in an embodiment of the present invention, in order to just
Part relevant in the embodiment of the present invention is illustrated only in explanation.Personage's Relation extraction device 200, comprising:
Data generating unit 210, for generating packet by the natural language text data in alignment knowledge base and corpus
Weak label data collection containing personage couple.
In embodiments of the present invention, knowledge base includes: Freebase, DBpedia, YAGO and online encyclopaedic knowledge library
Deng prestoring personage in knowledge base to, people entities noun, personage to relationship etc..
As an embodiment of the present invention, corpus includes text data, and text data includes text, symbol, picture
Deng, such as news report, article, paper belong to text data.
Natural language text data include English text data, Chinese language text data, French text data, Russian textual data
According to any text data or combinations thereof in waiting, specifically without limitation.
In embodiments of the present invention, weak label data collection includes positive example data, negative number of cases evidence.
In one embodiment of the invention, two people entities are a personage couple, if Yao Ming and leaf jasmine are a people
Object pair;Lin Zhiying and kimi is personage's equity.
In knowledge base, for personage to generally being indicated using triple, triple includes title of the entity to two entities, personage
To relationship.As (Yao Ming, wife, Ye Li), (Yao Ming, husband, Ye Li), (Yao Ming, man and wife, Ye Li), (Lin Zhiying, father and son,
Kimi), (Lin Zhiying, son, kimi), (Lin Zhiying, father, kimi) etc..
In the embodiment of the present invention, the weak label data collection comprising 104,593 sentences is constructed, wherein 80% weak mark
It signs data (83675 sentences) and is used as training data, remaining 20% (20919 sentences) is used as test data.This experiment choosing
It selects five kinds of common character relations to be tested, is respectively as follows: father and son, mothers and sons, brother, sister, man and wife.It is illustrated in table 1 weak
The data distribution of label data collection is as follows:
The distribution of the weak label data collection of table 1
By above-mentioned data generating unit 210, by the natural language text data in corpus and prestoring in knowledge base
Data matched, generate include personage couple weak label data collection.Example 1: a word in news report is that " Yao Ming leads
Everybody has come news briefing scene, and Ye Li and its daughter then also appear in scene ", then by this word and Freebase number
It is matched according to library, obtains three groups of personages couple and corresponding three groups of character relations;Three groups of personages to be respectively as follows: A personage to for
Yao Ming and leaf jasmine, B personage are to being Ye Li and daughter, C personage to for Yao Ming and daughter;Corresponding character relation be respectively man and wife,
Mother and daughter, father and daughter.
Marking unit 220, for concentrating the first sentence for belonging to same personage couple labeled as institute the weak label data
Same personage is stated to the positive example packet of relationship.
In one embodiment of the invention, positive example packet refers to the data of all sentences including being described same person's object pair
Packet.
For example, in order to extract conjugal relation, then A personage closes Yao Ming with leaf jasmine and its corresponding personage in above-mentioned example 1
It is that man and wife then belongs to positive example packet;And B personage is then negative example packet to, C personage couple and its corresponding character relation.
In embodiments of the present invention, the first sentence includes at least one sentence, generally multiple sentences, particular number according to
It is practical to determine.
Filter element 230 filters described in the positive example packet for the filter algorithm according to preset relationship deictic words
First sentence obtains training positive example data.
In embodiments of the present invention, preset relationship deictic words includes man and wife, husband, wife, father and son, father, son, mother
Son, mother, father and daughter, daughter, brother, elder brother, younger brother, sister, elder sister, younger sister, brother and sister, elder sister and younger brother, friend, colleague etc., specifically not
It limits.All these preset relationship deictic words constitute preset relative dictionary.
Training positive example data indicate the data in positive example packet, refer generally to sentence or word.
For example, " Yao Ming leads everybody to come news briefing scene, and leaf jasmine then also appears in scene in example 2." this
A sentence can not semantically express " man and wife " relationship fact between them.It is this to contain entity pair but extract
Sentence to relationship characteristic then belongs to noise data, should be filtered.And " Yao Ming and wife Ye Li are attended for example, example 3
There are preset relative dictionary sheets to show people entities Yao Ming, leaf jasmine conjugal relation in the ribbon-cutting ceremony of Hope Primary School " sentence
Word " wife " is then left trained positive example data.
It is appreciated that in embodiments of the present invention, being not present in preset relation word dictionary by what filter algorithm filters fell
In data, i.e. noise data can effectively improve the accuracy rate for making personage's Relation extraction.Referring to Fig. 3 to different personages to
(wherein, abscissa means that five kinds of relationship types in above-mentioned table 1, AVG to the F1 value for the extraction result for having noise free data to obtain
It is the mean value of five kinds of relational results.Ordinate is F1 value)
According to Fig. 3 it was determined that the F1 value for the extraction result that the first data of removal noise data obtain significantly improves, take out
The accuracy rate of character relation is taken to effectively improve.
Extraction unit 240, for the second sentence in the trained positive example data and negative example packet to be carried out feature extraction,
Obtain the multiple-factor feature vector of second sentence.
The multiple-factor feature vector of second sentence includes morphology factor vector sum syntax because of subvector;
The morphology further comprises because of subvector: distance feature, relative seat feature and part of speech feature;
The syntax further comprises because of subvector: the distance between syntax dependence feature, entity and core predicate
Feature and entity context feature.
In embodiments of the present invention, the multiple-factor feature vector of second sentence include morphology factor vector sum syntax because
Subvector.It is to convert the second sentence to the process of multiple-factor feature vector that second sentence, which is carried out feature extraction,.
The morphology further comprises because of subvector: distance feature, relative seat feature and part of speech feature;Wherein, distance
Feature refers to word of two people entities in sentence away from being generally abbreviated as dis;Relative seat feature refers to people entities in sentence
Tandem in son, is generally indicated with order;Part of speech feature refers to the quantity of verb and noun after segmenting in sentence,
Generally indicated with vn.
The syntax further comprises because of subvector: the distance between syntax dependence feature, entity and core predicate
Feature and entity context feature.Wherein, syntax dependence feature refers to everyone object entity sentence affiliated in sentence
Method relationship interdependent value reflects the relationship between people entities, is generally indicated with parsing-r;Entity refers to the name of the personage in sentence
Claim;Core predicate refers to the predicate verb that sentence core is embodied in sentence, the distance between body and core predicate feature, generally with p-
dis;Entity context feature refers to the first two words of people entities and the weight of latter two word, the then zero padding of the situation less than two
Processing, is generally indicated with context.
15 kinds of dependences that Harbin Institute of Technology is defined using language cloud are according to Key Relationships, punctuation mark, independence
Structure, right additional relationships, left additional relationships, guest's Jie relationship, coordination, dynamic benefit relationship, verbal endocentric phrase, fixed middle relationship and language,
Preposition object, guest's relationship, dynamic guest's relationship, the sequence of subject-predicate relationship, are corresponding in turn to assignment to 14 from 0.
As an embodiment of the present invention, the syntactic structure of sentence describes phrase structure, dependency structure in sentence
And phrase structure and dependency structure function.
Referring to fig. 4, for example, " this be the yellow of heap of stone eldest daughter with Sun Li 11 years old the more, be named as Huang Yici greatly.", people entities
" Huang of heap of stone " and relative " daughter " there is relationship in surely, relative " daughter " and core predicate " crying " there is subject-predicate relationship,
And there is dynamic guest's relationships between core predicate " crying " and people entities " Huang Yici ", syntax dependency parsing in this way,
It can be found that people entities " Huang is of heap of stone " all depend on relative " daughter " with " Huang Yici ".Further, by " Huang of heap of stone " with
Coordination between " Sun Li ", but between available people entities " Sun Li " and " Huang Yici " with relative " daughter " according to
Deposit relationship.As it can be seen that core predicate plays key effect to acquisition entity boundary, undertaking entity relationship, so, in natural language text
In this sentence, entity is also a kind of implication relation feature between entity at a distance from core predicate.
In embodiments of the present invention, the second sentence includes at least one sentence, generally multiple sentences, particular number according to
It is practical to determine.It is appreciated that the data that the second sentence includes under normal circumstances are greater than the data that the first sentence includes.
As a result acquiring unit 250 obtain the people for the multiple-factor feature vector to be input in relationship classifier
The relationship classification results of object pair.
In embodiments of the present invention, relationship classifier includes man and wife's classifier, father and son's classifier, mother and daughter's classifier, brother
Classifier, sister's classifier etc..
In embodiments of the present invention, the relationship classification results of personage couple generally indicate triple with personage, such as (Yao Ming,
Man and wife, Ye Li), (Lin Zhiying, father and son, kimi), (Jia Jingwen, mother and daughter, Bu Bu), (Bu Bu, sister, wave girl), (Jiang Wen, brother,
Jiang Wu), (Xiao Ming, friend, small red), (Lan Lan, colleague, little Huang) etc..
Character relation draw-out device provided in an embodiment of the present invention passes through the natural language in alignment knowledge base and corpus
Text data generates the weak label data collection comprising personage couple, and further marks belonging to for weak label data concentration same
The first sentence of personage couple is a positive example packet, then the filter algorithm based on preset relationship deictic words say the positive example packet of label into
Row filtering, a large amount of high quality can be obtained in the case where not needing manually to participate in by obtaining more accurately training positive example data
The data set of Training;Combined in the feature selecting of training process consider natural language text in lexical characteristics and
Syntax dependency parsing generate syntactic feature, then by combined multiple-factor feature vector train character relation classifier come into
The classification of row character relation;The accuracy rate of character relation extraction is effectively provided, engineer's template complex is not necessarily to, is suitable for new
The extraction task of relationship type, application range are more extensive.
In embodiments of the present invention, the personage is to including multipair, above-mentioned marking unit 220, comprising:
Sort out subelement, for concentrating the first sentence for belonging to same personage couple to be referred to one the weak label data
In positive example packet;
Subelement is marked, for marking multiple positive example packets respectively.
In embodiments of the present invention, natural language text data obtained in corpus may include multiple personages couple, this
When, then respectively by multiple personages to classification, and the positive example packet of each personage couple is marked respectively.
Character relation draw-out device provided in an embodiment of the present invention, when personage is to including multipair, by first respectively to category
It is referred in a positive example packet in the first sentence of same personage couple, then marks the method for multiple positive example packets respectively, convenient for working as corpus
There are multiple personage's clock synchronizations in library, can carry out extraction of multiple personages to relationship simultaneously, make personage to abstracting method more intelligence
Energyization.
Referring to Fig. 9, above-mentioned filter element 230 is specifically included:
Weight computing subelement 231 is segmented for calculating the first sentence described in the positive example packet by preset formula
The weight of word, formula are as follows afterwards:
Wherein, TI (w) indicates the weight of any word w after the first sentence participle of some in positive example packet,
Tf (w, s) indicates the normalization word frequency of the word w in sentence s,
Idf (w, S) indicates reverse document-frequency of the word w in corpus S,
nwjIt is the number that the word occurs in sentence s,
∑knwjIt is the sum of the frequency of occurrence of all words occurred in sentence s,
| S | it is first sentence sum in the weak label positive example data in a relationship example packet,
| { j:w ∈ s, s ∈ S } | it is all sentence quantity comprising word w in corpus S.
In embodiments of the present invention, participle refers to the structure that the first sentence is split into word, after the sentence participle in example 3
Can split into: Yao Ming, with, wife, Ye Li,, attend, Hope Primary School, ribbon-cutting ceremony,.It should be noted that sentence
Punctuation mark in son is also required to split into word structure.
Judgment sub-unit 232, for filtering out first three word of weight ranking in first sentence, and judge whether to
First three word of a rare weight ranking is present in preset relative dictionary.
Retain subelement 233, for before there is at least one described weight ranking in the preset relative dictionary
When three word, retains first sentence and be positive example sentence.
In practical application, for example, after the sentence of example 3 is segmented, and the weight of each word is calculated such as by above-mentioned formula
Under:
It is followed successively by (comprising punctuation mark):
Weight (Yao Ming)=0.027,
Weight (with)=0.002,
Weight (wife)=0.408,
Weight (Ye Li)=0.018,
Weight ()=0,
Weight (attending)=0.029,
Weight ()=0,
Weight (Hope Primary School)=0.031,
Weight ()=0,
Weight (ribbon-cutting ceremony)=0.012,
Weight (.)=0, and then obtain the weight highest of Yao Ming, three Ye Li, wife words, and be all present in preset
In relative dictionary, therefore the sentence " ribbon-cutting ceremony that Yao Ming has attended Hope Primary School with wife Ye Li " for retaining example 3 is positive example
Sentence.
Character relation draw-out device provided in an embodiment of the present invention further carries out the first sentence by preset formula
Filtering improves the accuracy rate of character relation extraction.
Referring to Figure 10, said extracted unit 240, comprising:
First building subelement 241, for according in second sentence word structure feature construct the morphology factor to
Amount;
Second building subelement 242, for according to the semantic relation feature construction syntax factor in second sentence to
Amount;
Extract subelement 243, for by the morphology because subvector with the syntax because subvector merges, acquisition described the
The multiple-factor feature vector of two sentences.
Referring to Fig. 6, the quantitative relation research through distance and relationship triple to a large amount of entities pair, it can be seen that point (5,
0.7923) indicate that the word between 2 entities accounts for total relationship triple number away from relationship example sum when being less than or equal to 5
79.23%.I.e. the incipient stage with word away from increase, the number of relationship triple increased dramatically.But when between 2 entities
When word is away from more than 5, between entity word away from increase relationship triple quantity increasing degree it is smaller and smaller, this is also
Illustrate that be closer two entities are bigger a possibility that there are entity relationships.
In another experiment based on Fig. 6 experiment basis, the entity of different relationships is added in a manner of superposition respectively
The distance between relative seat feature, part of speech feature, syntax dependence feature, entity and core predicate feature, physically
Following traits have obtained the average value that different relationship entities compare performance, referring specifically to table 2.From Table 2, it can be seen that with
The increase of feature, accuracy rate (p), recall rate (R) and the F1 value that entity extracts character relation it is higher.(wherein, recall rate is
Recall ratio is the ratio of relevant documentation number all in the relevant documentation number retrieved and document library, and measurement is searching system
Recall ratio)
The average value that the different relationship entities of table 4 compare performance
As a practical application of the invention, in above-mentioned example 3, distance feature dis=2
Relative seat feature order=1
Part of speech feature vn=5
Dependence feature is r=(9,14)
The distance between entity and predicate feature p-dis are (5,2)
The contextual feature of entity=(0,0,0.002,0.408,0.002,0.408,0,0.272), therefore " Yao Ming and wife
Cotyledon jasmine has attended the ribbon-cutting ceremony of Hope Primary School " multiple-factor feature vector are as follows:
(2,1,5,9,14,5,2,0,0,0.002,0.408,0.002,0.408,0,0.272)
The multiple-factor feature vector of above-mentioned acquisition is input in conjugal relation classifier, because including institute in above-mentioned sentence
There is multiple-factor feature vector, therefore can accurately extract personage is conjugal relation to entity Yao Ming-Ye Li, extracts result more
Precisely;And when needing to extract relationship new personage, after filtering noise data can be first passed through, sentence is converted
At multiple-factor feature vector, then the extraction that can carry out new persona relationship is input in relationship classifier, without carrying out other behaviour
It is suitable for the extraction task of new relation type, application range is more extensive.
Character relation draw-out device provided in an embodiment of the present invention passes through the natural language in alignment knowledge base and corpus
Text data generates the weak label data collection comprising personage couple, and further marks belonging to for weak label data concentration same
The first sentence of personage couple is a positive example packet, then the filter algorithm based on preset relationship deictic words say the positive example packet of label into
Row filtering, a large amount of high quality can be obtained in the case where not needing manually to participate in by obtaining more accurately training positive example data
The data set of Training;Combined in the feature selecting of training process consider natural language text in lexical characteristics and
Syntax dependency parsing generate syntactic feature, then by combined multiple-factor feature vector train character relation classifier come into
The classification of row character relation;The accuracy rate of character relation extraction is effectively provided, engineer's template complex is not necessarily to, is suitable for new
The extraction task of relationship type, application range are more extensive.
The embodiment of the present invention provides a kind of computer installation, which includes processor, and processor is for executing
The step of character relation abstracting method that above-mentioned each embodiment of the method provides is realized when the computer program stored in memory.
Illustratively, computer program can be divided into one or more modules, one or more module is stored
In memory, and by processor it executes, to complete the present invention.One or more modules, which can be, can complete specific function
Series of computation machine program instruction section, the instruction segment is for describing implementation procedure of the computer program in computer installation.Example
Such as, computer program can be divided into the step of character relation abstracting method that above-mentioned each embodiment of the method provides.
It will be understood by those skilled in the art that the description of above-mentioned computer installation is only example, do not constitute to calculating
The restriction of machine device may include component more more or fewer than foregoing description, perhaps combine certain components or different portions
Part, such as may include input-output equipment, network access equipment, bus etc..
Alleged processor can be central processing unit (Central Processing Unit, CPU), can also be it
His general processor, digital signal processor (Digital SignalProcessor, DSP), specific integrated circuit
(Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-
Programmable GateArray, FPGA) either other programmable logic device, discrete gate or transistor logic,
Discrete hardware components etc..General processor can be microprocessor or the processor is also possible to any conventional processor
Deng the processor is the control centre of the computer installation, utilizes various interfaces and the entire computer installation of connection
Various pieces.
The memory can be used for storing the computer program and/or module, and the processor is by operation or executes
Computer program in the memory and/or module are stored, and calls the data being stored in memory, described in realization
The various functions of computer installation.The memory can mainly include storing program area and storage data area, wherein storage program
It area can application program (such as sound-playing function, image player function etc.) needed for storage program area, at least one function
Deng;Storage data area, which can be stored, uses created data (such as audio data, phone directory etc.) etc. according to mobile phone.In addition,
Memory may include high-speed random access memory, can also include nonvolatile memory, such as hard disk, memory, grafting
Formula hard disk, intelligent memory card (SmartMedia Card, SMC), secure digital (Secure Digital, SD) card, flash card
(Flash Card), at least one disk memory, flush memory device or other volatile solid-state parts.
If the integrated module/unit of the computer installation is realized in the form of SFU software functional unit and as independent
Product when selling or using, can store in a computer readable storage medium.Based on this understanding, the present invention is real
All or part of the process in existing above-described embodiment method, can also instruct relevant hardware come complete by computer program
At the computer program can be stored in a computer readable storage medium, which is being executed by processor
When, it can be achieved that the step of above-mentioned each character relation abstracting method embodiment.Wherein, the computer program includes computer journey
Sequence code, the computer program code can be source code form, object identification code form, executable file or certain intermediate shapes
Formula etc..The computer-readable medium may include: any entity or device, note that can carry the computer program code
Recording medium, USB flash disk, mobile hard disk, magnetic disk, CD, computer storage, read-only memory (ROM, Read-OnlyMemory), with
Machine access memory (RAM, RandomAccess Memory), electric carrier signal, electric signal and software distribution medium etc..
Note that above content is only presently preferred embodiments of the present invention.It will be appreciated by those skilled in the art that the present invention is not limited to
Specific embodiment described here is able to carry out various apparent variations for a person skilled in the art, readjusts and replace
In generation, is without departing from protection scope of the present invention.Therefore, although having been carried out by above embodiments to the present invention more detailed
Illustrate, but the present invention is not limited to the above embodiments only, can also include more without departing from the inventive concept
Other equivalent embodiments, and the scope of the invention is determined by the scope of the appended claims.
Claims (12)
1. a kind of character relation abstracting method characterized by comprising
By the natural language text data in alignment knowledge base and corpus, the weak label data collection comprising personage couple is generated;
The weak label data is concentrated into positive example packet of the first sentence for belonging to same personage couple labeled as same personage to relationship;
According to the filter algorithm of preset relationship deictic words, first sentence in the positive example packet is filtered, is obtaining training just
Number of cases evidence;
The second sentence in the trained positive example data and negative example packet is subjected to feature extraction, obtains the more of second sentence
Ratio characteristics vector;
The multiple-factor feature vector is input in relationship classifier, the relationship classification results of the personage couple are obtained.
2. character relation abstracting method according to claim 1, which is characterized in that the personage to include it is multipair, it is described
The weak label data is concentrated into positive example packet of the sentence for belonging to same personage couple labeled as same personage to relationship, comprising:
The first sentence for belonging to same personage couple is concentrated to be referred in a positive example packet the weak label data;
Multiple positive example packets are marked respectively.
3. character relation abstracting method according to claim 1, which is characterized in that described according to preset relationship deictic words
Filter algorithm, filter first sentence in the positive example packet, specifically include:
The weight of word after the participle of the first sentence described in the positive example packet is calculated by preset formula, formula is as follows:
Wherein, TI (w) indicates the weight of any word w after the first sentence participle of some in positive example packet,
Tf (w, s) indicates the normalization word frequency of the word w in sentence s,
Idf (w, S) indicates reverse document-frequency of the word w in corpus S,
nwjIt is the number that the word occurs in sentence s,
∑knwjIt is the sum of the frequency of occurrence of all words occurred in sentence s,
| S | it is first sentence sum in the weak label positive example data in a relationship example packet,
| { j:w ∈ s, s ∈ S } | it is all sentence quantity comprising word w in corpus S;
First three word of weight ranking in first sentence is filtered out, and before judging whether at least one described weight ranking
Three word is present in preset relative dictionary;
When the judgment result is yes, retain first sentence to be positive example sentence.
4. character relation abstracting method according to claim 1, which is characterized in that the multiple-factor feature of second sentence
Vector includes morphology factor vector sum syntax because of subvector;
The morphology further comprises because of subvector: distance feature, relative seat feature and part of speech feature;
The syntax further comprises because of subvector: the distance between syntax dependence feature, entity and core predicate feature
And entity context feature.
5. character relation abstracting method according to claim 4, which is characterized in that it is described by the trained positive example data with
And the second sentence in negative example packet carries out feature extraction, obtains the multiple-factor feature vector of second sentence, specifically includes:
According to the word structure feature building morphology in second sentence because of subvector;
According to the semantic relation feature construction syntax in second sentence because of subvector;
By the morphology because subvector with the syntax because subvector merges, obtain the multiple-factor feature of second sentence to
Amount.
6. a kind of character relation draw-out device, which is characterized in that the character relation draw-out device includes:
Data generating unit, for by the natural language text data in alignment knowledge base and corpus, generating to include personage
Pair weak label data collection;
Marking unit, for concentrating the first sentence for belonging to same personage couple labeled as the same people the weak label data
Positive example packet of the object to relationship;
Filter element filters described first in the positive example packet for the filter algorithm according to preset relationship deictic words
Son obtains training positive example data;
Extraction unit obtains institute for the second sentence in the trained positive example data and negative example packet to be carried out feature extraction
State the multiple-factor feature vector of the second sentence;
As a result acquiring unit obtains the personage's couple for the multiple-factor feature vector to be input in relationship classifier
Relationship classification results.
7. character relation draw-out device according to claim 6, which is characterized in that the personage to include it is multipair, it is described
Marking unit, comprising:
Sort out subelement, for concentrating the first sentence for belonging to same personage couple to be referred to a positive example the weak label data
Bao Zhong;
Subelement is marked, for marking multiple positive example packets respectively.
8. character relation draw-out device according to claim 6, which is characterized in that the filter element specifically includes:
Weight computing subelement, for calculating word after the first sentence described in the positive example packet segments by preset formula
Weight, formula are as follows:
Wherein, TI (w) indicates the weight of any word w after the first sentence participle of some in positive example packet,
Tf (w, s) indicates the normalization word frequency of the word w in sentence s,
Idf (w, S) indicates reverse document-frequency of the word w in corpus S,
nwjIt is the number that the word occurs in sentence s,
∑knwjIt is the sum of the frequency of occurrence of all words occurred in sentence s,
| S | it is first sentence sum in the weak label positive example data in a relationship example packet,
| { j:w ∈ s, s ∈ S } | it is all sentence quantity comprising word w in corpus S;
Judgment sub-unit for filtering out first three word of weight ranking in first sentence, and judges whether at least one
First three word of a weight ranking is present in preset relative dictionary;
Retain subelement, for occurring first three word of at least one described weight ranking in the preset relative dictionary
When, retain first sentence and is positive example sentence.
9. character relation draw-out device according to claim 6, which is characterized in that the multiple-factor feature of second sentence
Vector includes morphology factor vector sum syntax because of subvector;
The morphology further comprises because of subvector: distance feature, relative seat feature and part of speech feature;
The syntax further comprises because of subvector: the distance between syntax dependence feature, entity and core predicate feature
And entity context feature.
10. character relation draw-out device according to claim 9, which is characterized in that the extraction unit, comprising:
First building subelement, for constructing morphology because of subvector according to the word structure feature in second sentence;
Second building subelement, for according to the semantic relation feature construction syntax in second sentence because of subvector;
Extract subelement, for by the morphology because subvector with the syntax because subvector merges, acquisition second sentence
Multiple-factor feature vector.
11. a kind of computer installation, which is characterized in that the computer installation includes processor, and the processor is for executing
Realizing the character relation abstracting method as described in any one of claim 1-5 when the computer program stored in memory
Step.
12. a kind of computer readable storage medium, is stored thereon with computer program, it is characterised in that: the computer program
The step of character relation abstracting method as described in any one of claim 1-5 is realized when being executed by processor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810587061.0A CN108959418A (en) | 2018-06-06 | 2018-06-06 | Character relation extraction method and device, computer device and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810587061.0A CN108959418A (en) | 2018-06-06 | 2018-06-06 | Character relation extraction method and device, computer device and computer readable storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108959418A true CN108959418A (en) | 2018-12-07 |
Family
ID=64493904
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810587061.0A Pending CN108959418A (en) | 2018-06-06 | 2018-06-06 | Character relation extraction method and device, computer device and computer readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108959418A (en) |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109815296A (en) * | 2018-12-29 | 2019-05-28 | 北京中科闻歌科技股份有限公司 | The personage's construction of knowledge base method, apparatus and storage medium of notarization document |
CN109977235A (en) * | 2019-04-04 | 2019-07-05 | 吉林大学 | A kind of determination method and apparatus of trigger word |
CN110334355A (en) * | 2019-07-15 | 2019-10-15 | 苏州大学 | A kind of Relation extraction method, system and associated component |
CN110457603A (en) * | 2019-08-16 | 2019-11-15 | 中国电子信息产业集团有限公司第六研究所 | Customer relationship abstracting method, device, electronic equipment and readable storage medium storing program for executing |
CN110516239A (en) * | 2019-08-26 | 2019-11-29 | 贵州大学 | A kind of segmentation pond Relation extraction method based on convolutional neural networks |
CN110674637A (en) * | 2019-09-06 | 2020-01-10 | 腾讯科技(深圳)有限公司 | Character relation recognition model training method, device, equipment and medium |
CN110750994A (en) * | 2019-10-23 | 2020-02-04 | 北京字节跳动网络技术有限公司 | Entity relationship extraction method and device, electronic equipment and storage medium |
CN110825847A (en) * | 2019-10-31 | 2020-02-21 | 北京奇艺世纪科技有限公司 | Method and device for identifying intimacy between target people, electronic equipment and storage medium |
CN110837732A (en) * | 2019-10-31 | 2020-02-25 | 北京奇艺世纪科技有限公司 | Method and device for identifying intimacy between target people, electronic equipment and storage medium |
CN110852107A (en) * | 2019-11-08 | 2020-02-28 | 北京明略软件系统有限公司 | Relationship extraction method, device and storage medium |
CN111104520A (en) * | 2019-11-21 | 2020-05-05 | 新华智云科技有限公司 | Figure entity linking method based on figure identity |
CN111209737A (en) * | 2019-12-30 | 2020-05-29 | 厦门市美亚柏科信息股份有限公司 | Method for screening out noise document and computer readable storage medium |
CN111476673A (en) * | 2020-04-02 | 2020-07-31 | 中国人民解放军国防科技大学 | Method, device and medium for aligning users among social networks based on neural network |
CN112347249A (en) * | 2020-10-30 | 2021-02-09 | 中科曙光南京研究院有限公司 | Alarm condition element extraction system and extraction method thereof |
CN112668342A (en) * | 2021-01-08 | 2021-04-16 | 中国科学院自动化研究所 | Remote supervision relation extraction noise reduction system based on twin network |
CN113254549A (en) * | 2021-06-21 | 2021-08-13 | 中国人民解放军国防科技大学 | Character relation mining model training method, character relation mining method and device |
CN113361280A (en) * | 2021-06-30 | 2021-09-07 | 北京百度网讯科技有限公司 | Method for training model, prediction method, prediction device, electronic device and storage medium |
CN115238217A (en) * | 2022-09-23 | 2022-10-25 | 山东省齐鲁大数据研究院 | Method for extracting numerical information from bulletin text and terminal |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140082003A1 (en) * | 2012-09-17 | 2014-03-20 | Digital Trowel (Israel) Ltd. | Document mining with relation extraction |
CN104657750A (en) * | 2015-03-23 | 2015-05-27 | 苏州大学张家港工业技术研究院 | Method and device for extracting character relation |
CN105678327A (en) * | 2016-01-05 | 2016-06-15 | 北京信息科技大学 | Method for extracting non-taxonomy relations between entities for Chinese patents |
CN106484675A (en) * | 2016-09-29 | 2017-03-08 | 北京理工大学 | Fusion distributed semantic and the character relation abstracting method of sentence justice feature |
-
2018
- 2018-06-06 CN CN201810587061.0A patent/CN108959418A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140082003A1 (en) * | 2012-09-17 | 2014-03-20 | Digital Trowel (Israel) Ltd. | Document mining with relation extraction |
CN104657750A (en) * | 2015-03-23 | 2015-05-27 | 苏州大学张家港工业技术研究院 | Method and device for extracting character relation |
CN105678327A (en) * | 2016-01-05 | 2016-06-15 | 北京信息科技大学 | Method for extracting non-taxonomy relations between entities for Chinese patents |
CN106484675A (en) * | 2016-09-29 | 2017-03-08 | 北京理工大学 | Fusion distributed semantic and the character relation abstracting method of sentence justice feature |
Non-Patent Citations (2)
Title |
---|
YANGCHEN HUANG等: "Multi-language person social relation extraction model based on distant supervision", 《 2018 IEEE 3RD INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND BIG DATA ANALYSIS (ICCCBDA)》 * |
黄杨琛等: "基于远程监督的多因子人物关系抽取模型", 《通信学报》 * |
Cited By (29)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109815296A (en) * | 2018-12-29 | 2019-05-28 | 北京中科闻歌科技股份有限公司 | The personage's construction of knowledge base method, apparatus and storage medium of notarization document |
CN109977235A (en) * | 2019-04-04 | 2019-07-05 | 吉林大学 | A kind of determination method and apparatus of trigger word |
CN109977235B (en) * | 2019-04-04 | 2022-10-25 | 吉林大学 | Method and device for determining trigger word |
CN110334355A (en) * | 2019-07-15 | 2019-10-15 | 苏州大学 | A kind of Relation extraction method, system and associated component |
CN110334355B (en) * | 2019-07-15 | 2023-08-18 | 苏州大学 | Relation extraction method, system and related components |
CN110457603B (en) * | 2019-08-16 | 2021-08-06 | 中国电子信息产业集团有限公司第六研究所 | User relationship extraction method and device, electronic equipment and readable storage medium |
CN110457603A (en) * | 2019-08-16 | 2019-11-15 | 中国电子信息产业集团有限公司第六研究所 | Customer relationship abstracting method, device, electronic equipment and readable storage medium storing program for executing |
CN110516239A (en) * | 2019-08-26 | 2019-11-29 | 贵州大学 | A kind of segmentation pond Relation extraction method based on convolutional neural networks |
CN110674637A (en) * | 2019-09-06 | 2020-01-10 | 腾讯科技(深圳)有限公司 | Character relation recognition model training method, device, equipment and medium |
CN110750994A (en) * | 2019-10-23 | 2020-02-04 | 北京字节跳动网络技术有限公司 | Entity relationship extraction method and device, electronic equipment and storage medium |
CN110825847A (en) * | 2019-10-31 | 2020-02-21 | 北京奇艺世纪科技有限公司 | Method and device for identifying intimacy between target people, electronic equipment and storage medium |
CN110825847B (en) * | 2019-10-31 | 2022-09-02 | 北京奇艺世纪科技有限公司 | Method and device for identifying intimacy between target people, electronic equipment and storage medium |
CN110837732B (en) * | 2019-10-31 | 2024-01-26 | 北京奇艺世纪科技有限公司 | Method and device for identifying intimacy between target persons, electronic equipment and storage medium |
CN110837732A (en) * | 2019-10-31 | 2020-02-25 | 北京奇艺世纪科技有限公司 | Method and device for identifying intimacy between target people, electronic equipment and storage medium |
CN110852107B (en) * | 2019-11-08 | 2023-05-05 | 北京明略软件系统有限公司 | Relation extraction method, device and storage medium |
CN110852107A (en) * | 2019-11-08 | 2020-02-28 | 北京明略软件系统有限公司 | Relationship extraction method, device and storage medium |
CN111104520B (en) * | 2019-11-21 | 2023-06-30 | 新华智云科技有限公司 | Personage entity linking method based on personage identity |
CN111104520A (en) * | 2019-11-21 | 2020-05-05 | 新华智云科技有限公司 | Figure entity linking method based on figure identity |
CN111209737B (en) * | 2019-12-30 | 2022-09-13 | 厦门市美亚柏科信息股份有限公司 | Method for screening out noise document and computer readable storage medium |
CN111209737A (en) * | 2019-12-30 | 2020-05-29 | 厦门市美亚柏科信息股份有限公司 | Method for screening out noise document and computer readable storage medium |
CN111476673A (en) * | 2020-04-02 | 2020-07-31 | 中国人民解放军国防科技大学 | Method, device and medium for aligning users among social networks based on neural network |
CN112347249A (en) * | 2020-10-30 | 2021-02-09 | 中科曙光南京研究院有限公司 | Alarm condition element extraction system and extraction method thereof |
CN112347249B (en) * | 2020-10-30 | 2024-02-27 | 中科曙光南京研究院有限公司 | Alert condition element extraction system and extraction method thereof |
CN112668342A (en) * | 2021-01-08 | 2021-04-16 | 中国科学院自动化研究所 | Remote supervision relation extraction noise reduction system based on twin network |
CN113254549B (en) * | 2021-06-21 | 2021-11-23 | 中国人民解放军国防科技大学 | Character relation mining model training method, character relation mining method and device |
CN113254549A (en) * | 2021-06-21 | 2021-08-13 | 中国人民解放军国防科技大学 | Character relation mining model training method, character relation mining method and device |
CN113361280A (en) * | 2021-06-30 | 2021-09-07 | 北京百度网讯科技有限公司 | Method for training model, prediction method, prediction device, electronic device and storage medium |
CN113361280B (en) * | 2021-06-30 | 2023-10-31 | 北京百度网讯科技有限公司 | Model training method, prediction method, apparatus, electronic device and storage medium |
CN115238217A (en) * | 2022-09-23 | 2022-10-25 | 山东省齐鲁大数据研究院 | Method for extracting numerical information from bulletin text and terminal |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108959418A (en) | Character relation extraction method and device, computer device and computer readable storage medium | |
CN109388795B (en) | Named entity recognition method, language recognition method and system | |
WO2020052405A1 (en) | Corpus annotation set generation method and apparatus, electronic device, and storage medium | |
CN109815336B (en) | Text aggregation method and system | |
CN109670039B (en) | Semi-supervised e-commerce comment emotion analysis method based on three-part graph and cluster analysis | |
US9373075B2 (en) | Applying a genetic algorithm to compositional semantics sentiment analysis to improve performance and accelerate domain adaptation | |
WO2018028077A1 (en) | Deep learning based method and device for chinese semantics analysis | |
CN108874878A (en) | A kind of building system and method for knowledge mapping | |
CN108984683A (en) | Extracting method, system, equipment and the storage medium of structural data | |
CN107766371A (en) | A kind of text message sorting technique and its device | |
CN110442725B (en) | Entity relationship extraction method and device | |
CN107301163B (en) | Formula-containing text semantic parsing method and device | |
US11934781B2 (en) | Systems and methods for controllable text summarization | |
CN110188359B (en) | Text entity extraction method | |
CN111091009B (en) | Document association auditing method based on semantic analysis | |
WO2022121146A1 (en) | Method and apparatus for determining importance of code segment | |
CN111898337A (en) | Single-sentence abstract defect report title automatic generation method based on deep learning | |
CN113051869B (en) | Method and system for realizing identification of text difference content by combining semantic recognition | |
CN112528653A (en) | Short text entity identification method and system | |
CN113626553A (en) | Cascade binary Chinese entity relation extraction method based on pre-training model | |
CN112948570A (en) | Unsupervised automatic domain knowledge map construction system | |
WO2022213864A1 (en) | Corpus annotation method and apparatus, and related device | |
CN117251539B (en) | Patent intelligent retrieval system using generative artificial intelligence | |
US20220318230A1 (en) | Text to question-answer model system | |
CN108197154B (en) | Online subset topic modeling method for interactive document exploration |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20181207 |