CN106484675A

CN106484675A - Fusion distributed semantic and the character relation abstracting method of sentence justice feature

Info

Publication number: CN106484675A
Application number: CN201610866186.8A
Authority: CN
Inventors: 罗森林; 焦龙龙; 潘丽敏; 郭佳; 吴舟婷; 陈倩柔
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2016-09-29
Filing date: 2016-09-29
Publication date: 2017-03-08

Abstract

The present invention relates to the character relation abstracting method of a kind of fusion distributed semantic and sentence justice feature, belongs to natural language processing field.The present invention is first with statistics words-frequency feature and Bootstrapping algorithm, in markd language material on a small quantity and in a large number unmarked language material, training obtains relationship characteristic dictionary respectively, then the triple example of sentence is constructed apart from optimization rule by element, fusion distributed semantic information and semantic information construction triple feature space, finally triple is carried out being non-binary decision, character relation classification is obtained using confidence level maximization principle.Present invention achieves the automatically generating of characteristic relation dictionary, it is non-binary decision problem that many for traditional relation classification problems are converted into triple, more adapt to traditional machine learning classification algorithm, and distributed semantic information is utilized, improve the accuracy rate of relation classification.

Description

Fusion distributed semantic and the character relation abstracting method of sentence justice feature

Technical field

The present invention relates to a kind of concentrate the method for extracting automatically character relation from Chinese text or Chinese text, belong to calculating Machine science and information extraction technique field.

Background technology

Character relation extract be accurately and rapidly automatic for the relation between dispersion people entities in the text and personage Extract, belong to the research contents in information extraction field.

Information extraction technique (IE, Information Extraction) will complete two big Tasks：Entity recognition (EDR, Entity Detection and Recognition) and relation recognition (RDR, Relation Detection and Recognition).Wherein relation recognition (also referred to as " Relation extraction ") is exactly to extract the pass of the presence between entity from text System, and the type of these relations is predefined.Character relation belongs to the one kind in entity relationship, refers to text or text set Described in two personages between incidence relation.Character relation is extracted, is mainly solved：1. obtain between two personages Attribute of a relation (attribute of a relation extraction)；2. the correlation degree (relationship strength calculating) between two personages is calculated.In addition, Organizational form and display form for the character relation being dispersed in text and text set is also the problem for needing consideration.

Character relation abstracting method mainly has two classes：Method based on pattern-recognition and the method based on machine learning.

1. the method based on pattern-recognition：

1) the character relation abstracting method based on pattern-recognition of early stage：Based on the method for pattern-recognition be by morphology, The feature of the aspects such as syntax, builds the knowledge base (or referred to as rule base) needed for identification, enters row mode using the knowledge base Coupling, reaches the purpose of Relation extraction.For the character relation abstracting method based on pattern-recognition, most difficult step is personage The foundation of relation schema (character relation rule base).The foundation of these character relation patterns needs to rely on linguist, sociology Family carries out careful deep analysis to the language material in field involved by extraction task, and exhaustive various possible character relations work out personage Relation schema.This method establishment cycle is oversize, and application cost is very high.

2) improved method to earlier processes：For the problem of the pure hand woven character relation pattern of early stage, later Scholars propose some solutions.

A) as, in the FASTUS extraction system of Appelt et al. proposition, various fields relied on by the concept for introducing " grand " Rule by a kind of with autgmentability, versatility in the way of express.User only needs to change the parameter setting in corresponding " grand ", so that it may Relation schema rule with the good specific area task of rapid configuration；So-called grand, it is exactly that number order is organized together, as one Individually order completes a particular task.

B) the Proteus extraction system that Roman et al. is proposed employs the character relation decimation pattern extensive based on sample Construction method, this method by work out character relation pattern carry out extensive so that pattern can be suitable for wider field Character relation is extracted；

C) REES system (the Large-Scale Relation and Event Extraction that Aone et al. builds System Relation extraction is carried out by knowledge base of the construction comprising more than 100 kind of character relation pattern in).

D) additionally, in terms of Chinese, also there is method of some scholars using pattern-recognition the country for extracting character relation, As Jiang Jifa et al. in order to the amount of labour for mitigating pattern authorized personnel proposes a kind of binary crelation of bootstrapping and binary crelation mould Formula acquisition methods BRPAM, the method can be expanded knowledge by existing binary crelation of booting, and (character relation is regular in storehouse Storehouse), put method according to this, Jiang Jifa they devise an IE system that can carry out binary crelation extraction from free text BRPAM2Texts；Deng's thumb et al. has been incorporated into lexical semantic coupling in relation schema coupling, it is proposed that a kind of brand-new relation The method of extraction.This method is due to introducing the feature of semanteme of vocabulary so that it is objective that the result that character relation is extracted more meets Logic, accuracy rate improves, and the character relation for different field can be realized by the dictionary of association area Character relation is extracted.

Yet suffer from that development cost is high above based on the character relation abstracting method of pattern-recognition, applicability is low not Foot.

2. the method based on machine learning：

Based on the character relation abstracting method of machine learning be by machine learning algorithm, on the basis of manual labeling language material Upper structural classification device, is then applied in the classification deterministic process of field language material character relation.At present using more Machine learning algorithm has MBL algorithm and SVM algorithm.Such as：

A) Zhang et al. build Chinese name entity and Relation extraction system be exactly using MBL algorithm from training data Middle structure classifying rules, carries out the extraction of entity and relation based on the rule in extraction process；

B) Zhang and Che Wanxiang etc. then carries out the study of Relation extraction rule using SVM algorithm；What is graceful et al. to propose By the use of a small amount of artificial entity relationship that chooses as seed (initial relation), continuous Extended Relations seed by way of self study Set, the method for extracting entity relationship；

C) Liu Lu et al. then proposes a kind of entity relation extraction method based on the positive and negative example training of SVM.

Method based on machine learning comparative maturity, but still suffer from problem, for example, in the feelings that language material is not abundant enough Under condition, the covering dynamics of Feature Words not enough, affects classifying quality；Feature selecting is most important for machine learning algorithm, and special Levy selection and sentence justice characteristic information and distributed information is not made full use of, cause signature analysis not deep enough, classifying quality is not excellent.

Content of the invention

Feature selecting for machine learning algorithm is not difficult deep enough with signature analysis, causes asking for classifying quality difference Topic, the present invention propose the character relation abstracting method of a kind of fusion distributed semantic and sentence justice feature, improve from Chinese text This or Chinese text concentrate the effect for extracting automatically character relation.

Technical scheme includes following content：

First with statistics words-frequency feature and Bootstrapping algorithm, respectively in markd language material on a small quantity and in a large number In unmarked language material, training obtains relationship characteristic dictionary, then constructs the triple of sentence by element apart from optimization rule Example, fusion morphology layer and sentence justice latent structure triple feature space, finally carry out being non-binary decision to triple, utilize Confidence level maximization principle obtains character relation classification.Present invention achieves automatically generating for characteristic relation dictionary, will be traditional It is non-binary decision problem that many classification problems of relation are converted into triple, more adapts to traditional machine learning classification algorithm, and Using sentence justice feature, the accuracy rate of relation classification is improved, as shown in Figure 1.

Step 1, relationship characteristic dictionary are automatically generated；

Character relation is extracted and regards classification task as, the present invention defines eight big class character relations, including the relation of being an apprentice of, family Relation, relationship between superior and subordinate, competitive relation, friends, love relation, nominal kinship's relation, nurse relation and other relations.Relation is special The bidirectional relationship that word is characterize between description personage is levied, the differentiation to attribute of a relation between personage is most important, introduced below The idiographic flow for automatically generating relationship characteristic dictionary algorithm that patent is proposed.

Step 1.1, through Text Pretreatment, the language material to tape label is trained, and obtains initial seed word set, concrete stream Journey is as follows：

Step 1.1.1, first with the Chinese sentence in participle instrument ICTCLAS2013, BFS laboratory of the Computer Department of the Chinese Academy of Science Adopted structural model automatic build system ACSM (Automatic Chinese Sentential Semantic Model) and instrument Scikit-learn is pre-processed to language material, respectively obtains participle, part-of-speech tagging, name Entity recognition, the TF- of each word IDF value and sentence justice results of structural analysis.Then stop words is removed, and the language material to tape label is trained, and obtains initial seed word Collection.

Wherein, sentence justice structural model (CSM) is the structuring of syntagmatic, shape between composition and the composition in distich justice Formula represents, abstract sentence justice is expressed as the accessible structural data of computer, it is therefore an objective to help computer from deep layer Semantic angle goes to understand Chinese sentence.By the model abstract sentence justice Formal Representation is mathematical and physical structure between composition, Allow computer to be capable of identify that and process Chinese sentence meaning.

Will the have of the adopted structural model of sentence：Sentence adopted type, topic, state topic, semanteme lattice, predicate item, Chinese time system, when Empty range information, composition syntagmatic etc..For above-mentioned key element, sentence justice structural model is divided into 4 levels：Sentence pattern layer, retouch Layer, object layer and levels of detail is stated, its citation form is as shown in Fig. 2 (see photo).

The sentence structure information obtained by sentence justice structural model analysis and semantic information, extraction can state sentence semantics Feature, these features can express people entities important information.The adopted latent structure of sentence be using the combination between sentence justice composition Relation, specifically on the basis of sentence justice structural model automatically builds successively query semantics lattice (table 1) corresponding item as feature Word, and formed with more Precise Semantics expression energy according to the dependence (refer to the attached drawing 2) of semantic lattice construction various combination mode The feature phrase of power.

The semantic lattice type declaration of table 1

Step 1.1.2, the language material of tape label is pressed contained relation classification C_i(0<i<N, N represent relation category quantity) area Point, if sentence include multiple relations, will its repeat to be subdivided in corresponding plurality of classes.

Step 1.1.3, for each classification C, extracts noun and verb as candidate seed word, and calculates candidate seed The TF-IDF value of word, calculates the criticality of these words and all sentences in training set according to formula (1) and (2).

Wherein sen_iRepresent that sentence i, word represent candidate seed word, | C | represents sentence sum, K (word) table in classification C Show the correlation degree of all sentences in candidate seed word and training set, n represents contained word sum in such all sentence, word_tfidfRepresent TF-IDF value of the candidate word in training set, word ∈ sen represents word in sentence.

Wherein, TF-IDF is a kind of statistical method, in order to assess a word in a file set or a corpus A copy of it file significance level, word_tfidfRepresent weight of the candidate word in the training set, the main think of of TF-IDF Think be：If frequency TF that certain word or phrase occur in an article is high, and seldom occurs in other articles, then recognize It is that this word or phrase have good class discrimination ability, TF-IDF can more represent weight of the word to text than word frequency statisticses Degree is wanted, and the weight of the word therefore in formula (2), is weighed using the method for TF-IDF.

Step 1.1.4, according to《Synonym woods》Coding information, all for candidate seed word word synon K are added and Represent the new criticality of the word.

Candidate seed word is ranked up, then given threshold by step 1.1.5 by final K, extracts K more than threshold value Morphology becomes such initial seed word set, and threshold value is generally relevant with sentence quantity and is obtained by experiment.

Step 1.2, initial seed word set step 1 obtained using Bootstrapping algorithm are expanded, concrete bag Containing four basic processes.

Step 1.2.1, in the language material not marked in a large number, extracts noun and verb as candidate word.

Step 1.2.2, considers the seed word set in each relation classification C, respectively to each candidate word w, using mutual The method of information calculates the weight M value of candidate word, such as shown in formula (3),

Wherein sword represents seed words, and F (w) represents the sentence number comprising w in whole language material, and F (sword) represents whole Sentence number comprising initial word sword in individual language material, co-occurrence frequency F (w, sword) represent that candidate word is occurred with initial word sword In the sentence number of same sentence, F_allRepresent the sentence sum in whole language material.

Step 1.2.3, chooses and meets F (w)>F_min(w) and M>M_minWord and seed set of words and as new kind Sub- word set, wherein, F_minW () represents minimum sentence number, be set to 5, M_minIt is the minimal weight for arranging.

Step 1.2.4, repeat step 1.2.2,1.2.3 till producing without the new word for meeting condition, by upper State the relationship characteristic dictionary that step has automatically generated all categories.

Step 2, triple feature space are constructed, and in a sentence, the relationship characteristic word for occurring and relative are reflected Attribute of a relation may belong in the sentence occur two personages between, for example " used as the coach of Liu Xiang, Sun Haiping was to 13 seconds 08 achievement is felt quite pleased " wherein " train " " master-apprentice relation " reflected between " Yao Ming " and " Sun Haiping ".Definition<Personage-pass System-personage>For a relation triple example, so to attribute of a relation ownership can carry out being non-binary decision to classify more Problem is converted into two classification problems.

Step 2.1, extracts the name entity in each sentence, obtains this people's list of file names<Name₁、Name₂、…Name_n >, all of name in list is arranged in pairs or groups two-by-two, forms pair relationhip<(Name₁、Name₂)、(Name₂、Name₃)、…、 (Name_n-1、Name_n)>.

Step 2.2, the relationship characteristic dictionary generated using step 1, obtain the relationship characteristic vocabulary in sentence<W₁、W₂、… W_m>, pair relationhip is sequentially added, exhaustive composition triple example<(Name_i, W_k, Name_j)>, for each i, j, k, meet 0<i<=n, 0<j<=n, 0<k<=m.

Step 2.3, using the word2vec method in deep learning, calculates feature vocabulary<W₁、W₂、…W_m>Rank with people Table<Name₁、Name₂、…Name_n>In each element term vector, obtain the term vector W_Vec of each Feature Words_k, and each The term vector NameVec of name_i.

Step 2.4, the method that is mated using word string, obtain three elements position in sentence in triple example respectively, right In every kind of combination<(pos(Name_i), pos (W_k), pos (Name_j))>, in conjunction with<(Name_i, W_k, Name_j)>Between semantic letter Breath, using apart from d between formula (4) calculating triple example.

Wherein, pos (Name_i) represent Name_iCharacter position in sentence, dis (pos (Name_i),Table That shows two people entities is separated by word number, dis (pos (Name_i),pos(W_k)) represent that relative k and entity i's is separated by word number, dim(NameVec_i,NameVec_j) represent two term vectors between similarity, when the semanteme of two words is more close, dim (NameVec_i,NameVec_j) bigger, otherwise less, this formula combines the spy of distributed semantic information and sentence justice feature Point, makes more represent the positional information of triple example apart from d, and punctuate is calculated by 5 characters, group when selecting to make d to take minimum Close the positional information for representing triple example.

Step 2.5, if dis in positional information>dis_min, d>d_min(d_minRepresent acceptable minimum threshold of distance), then arrange Except the triple example, final triple sample result is obtained, corresponding characteristic vector is constructed according to positional information.

Step 3, triple are non-binary decisions；

In the decision tree that trains out by C4.5, there can be corresponding confidence for each result for being judged to "true" Degree FACTOR P⁺, the confidence level is exactly the confidence level of the alternative relations combination for being judged as "Yes", can be used for there is conflict Composition of relations result is screened.If each triple example is judged as "true", by confidence level P⁺As its weights, compare All triple that two people entities are located, will make weights obtain maximum attribute of a relation as the final personage of people entities Relation result of determination.

Beneficial effect

Compared to the method based on machine learning, the present invention adopt have the characteristics that recognition speed is fast, accuracy rate is high.

Compared to the method based on pattern-recognition, practicality of the present invention is wider, with more preferable autgmentability.

Compared with the method based on pattern-recognition, the technology that the present invention is adopted has less calculating consumption, is not only suitable for In desktop computer, the mobile computing platforms such as mobile phone, panel computer are also applied for.

Compared with based on semantic pattern character relation abstracting method, the sentence justice feature of the present invention has more excellent depth analysis Effect is so as to ensure that higher recognition accuracy.

Description of the drawings

Fig. 1 is the character relation extraction algorithm schematic diagram of the present invention；

Fig. 2 is sentence justice structural model citation form structure chart；

Fig. 3 is to train, using C4.5, the decision tree example (part) that the character relation for obtaining combination is non-binary decision；

Fig. 4 relationship characteristic dictionary automatic generating calculation parameter selection experiments Comparative result figure；

Specific embodiment

In order to better illustrate objects and advantages of the present invention, the reality to the inventive method with reference to the accompanying drawings and examples The mode of applying is described in further details.

Data source be BFS hot topic personage retrieval language material, including " Yao Ming ", " Liu Xiang ", " Zhou Jielun ", " James ", " become Dragon ", " Bryant ", " Xie Tingfeng ", mark language material amount to 1540 texts, there is the sentence 2389 of at least two names, do not mark Note sentence 10000.The description of data source as shown in table 1, obtains people entities number by artificial statistics.

1 character relation of table extracts experimental data source

In order to personage's Relation extraction method is verified, three experiments have been carried out：

(1) parameter selection experiments：Select optimal threshold value in initial seed word extraction process and Bootstrapping algorithm The combination of K and M, wherein, K and M is the threshold value of initial seed word association degree and candidate seed word weight respectively.

(2) relationship characteristic dictionary contrast experiment：The automatic dictionary for extracting is entered with the dictionary of adopted manual compiling Row contrast, the dictionary that checking is automatically extracted have higher expansion and with the other matching degree of relation object.

(3) character relation extracts effect experimental：For check this patent propose character relation extraction algorithm accuracy, Comprehensive, and be compared with other Relation extraction algorithms.

Above-mentioned testing process will be illustrated one by one below, all tests are all completed on same computer, specifically It is configured to：Intel double-core CPU (dominant frequency 3.0G), 4.00G internal memory, Windows7 operating system.

The character relation for selecting for parameter and being extracted, we equally choose accuracy rate, recall rate and F value and are commented Valency, computational methods are identical with formula (5)～(7), and parameter meaning therein is varied from：

A) represent the number of the correct character relation attribute being extracted；

B) represent the number of the character relation attribute of the mistake being extracted；

C) represent the number of the character relation attribute not being extracted.

For relationship characteristic dictionary contrast experiment, using expert estimation strategy, at two long campaigns natural languages Each word is assigned in+3 by stages from -3 and selects integer to divide according to word and the other matching degree of relation object by the researcher of reason Value is given a mark, and -3 points of representatives are mismatched very much, and+3 points of representatives are mated very much, and statistics obtains PTS and average index.

ICTCLAS (the Institute of Computing that participle is provided using the Computer Department of the Chinese Academy of Science in experiment Technology, Chinese Lexical Analysis System) as morphological analysis instrument.The name of ICTCLAS is known Other rate of accuracy reached (973 evaluation and test) to more than 98%, directly using this identification of function who object.

It is non-binary classification model training and judgement to carry out triple example, from 22 kinds of morphology layers and sentence justice spy Composite construction feature space is levied, here spatially uses each triple example of vector representation, train classification models are simultaneously surveyed Examination, these features are as shown in table 2.

Table 2 constructs the feature in triple example aspects space

1. parameter selection experiments

Using gridding method, the parameter K value of selection represents the criticality minimum threshold index of initial seed word, the ginseng of selection Number M value is the minimum scoring threshold in the extraction of Bootstrapping algorithm.First K value is normalized with M value, respectively from 0.1 to 0.9 with 0.05 as separation fluctuation, and F value is observed roughly, finds K value in 0.4～0.6 interval, M value 0.5～ During 0.7 interval interior variation, effect is preferable.Then careful selection is carried out with gridding method in above-mentioned interval, obtain best parameter group.

In experiment, with 0.02 as the interval variation for being spaced in 0.4～0.6, M is with 0.02 as the area for being spaced in 0.5～0.64 for K Between change, wherein M1=0.5.Abscissa is K value, and ordinate is F value.The result of parameter selection experiments as shown in Figure 4, thus Understand, when K takes 0.52, M and takes 0.54, relationship characteristic dictionary is the most suitable.

2. relationship characteristic dictionary contrast experiment

The relationship characteristic dictionary obtained automatically generating with the best parameter group described in parameter selection experiments, and and Feng Yangbo The dictionary of manual compiling of the scholar used in its character relation extraction system is contrasted.Total word number in statistics dictionary, and Using expert estimation strategy, the effect of two dictionaries of score analysis is contrasted.

The result of relationship characteristic dictionary contrast experiment is as shown in table 3.

2 relationship characteristic dictionary contrast and experiment of table

As shown in Table 3, compared to the dictionary of manual compiling, the dictionary that automatic generating calculation is obtained is extended in word total amount 77.6%, and word quantity all has lifting by a relatively large margin in the range of each just divides.PTS improves 152 points, shows more High comprehensive quality, and average mark is declined slightly, this be due to each word in the dictionary of manual compiling be subjective extract, can Illustrative higher.Can be drawn from above, in the case of the matching degree of each word is significantly reduced, greatly improve dictionary Level of coverage and oeverall quality.

3. character relation extracts experiment

The model training of C4.5 character relation judgement is carried out by 2389 mark sentences first；Then automatic people is carried out Thing Relation extraction；The last standard relationship triple obtained with artificial statistics calculates accuracy rate, recall rate and F value as standard.

When carrying out general effect Experimental comparison, go forward side by side with reference to sentence justice features training model first with distributed semantic information Row test, obtains this patent algorithm final effect, is finally respectively adopted based on the entity relation extraction algorithm of semantic pattern, is based on The entity relation extraction algorithm of SVM and the SVM name entity relation extraction algorithm that is trained based on positive counter-example are directed to identical number Character relation extraction is carried out according to source, wherein the first algorithm is the algorithm based on pattern-recognition, other two kinds be based on engineering The algorithm of habit.

The result that character relation extracts effect experimental is as shown in table 4.

Table mistake！Word in document without given pattern.Character relation extracts effect experimental result

As shown in Table 4, preferable with reference to the method effect of sentence justice feature using distributed semantic information.This be due to distributed Semantic information, accurately have expressed the information such as word order, part of speech.Greatly improve in conjunction with the sentence justice characteristic information with strong distinction The ability to express of feature space, this is also embodied in contrast experiment, the overall target F value of the algorithm reaches 83.8%, Better than other Relation extraction algorithms.

By being contrasted with existing excellent algorithm, it can be found that the effect of this patent algorithm will be substantially better than based on mould The entity relation extraction method of formula identification, and also it is better than the entity relation extraction algorithm for being generally basede on machine learning in personage pass The application that system extracts.Reason is as follows：First, the automation for realizing characteristic relation dictionary is generated, and based on improvement Bootstrapping algorithm has expanded the coverage of characteristic relation word, differentiates that to triple the lifting of recall rate is produced actively Impact；Second, it is non-binary decision problem that many for traditional relation classification problems are converted into triple, more adapts to traditional machine Device learning classification algorithm；3rd, using sentence justice structural model, sample is carried out deeper into analysis, identify triple difference Semantic component and design feature, effectively constrain triple example information as strong feature, the lifting effect in terms of accuracy rate It is obvious.

Claims

1. a kind of fusion distributed semantic and sentence justice feature character relation abstracting method, it is characterised in that using statistics word frequency Feature and Bootstrapping algorithm, train in markd language material on a small quantity and in a large number unmarked language material respectively and are closed It is feature lexicon, then the triple example that element constructs sentence apart from optimization rule is combined by distributed semantic information, Fusion morphology layer and sentence justice latent structure triple feature space, finally carry out being non-binary decision to triple, using confidence Degree maximization principle obtains character relation classification, comprises the steps：

Step 1, through pretreatment, the language material to tape label is trained, obtains initial seed word set, then use Bootstrapping algorithm is expanded to initial seed word set, production Methods feature lexicon, is comprised the following steps that：

Step 1.1, carries out division classification, Text Pretreatment to training set language material, training, generates initial seed word set, concrete stream Journey is as follows：

Step 1.1.1, the language material of tape label is divided into corresponding relation classification C_i(0<i<N, N represent relation categorical measure) In, if sentence include multiple relations, will its repeat to be subdivided in corresponding plurality of classes；

Step 1.1.2, pre-processes to language material, obtains participle, part-of-speech tagging, name Entity recognition, the TF-IDF of each word Value and sentence justice results of structural analysis；

Step 1.1.3, for each classification C, extracts noun and verb as candidate seed word, and calculates the key of these words Degree K, the computing formula of K are as follows：

K (w o r d) = \frac{Σ_{i = 1}^{| C |} A p e a r ({sen}_{i}, w o r d)}{n}

A p e a r (s e n, w o r d) = \{\begin{matrix} {word}_{t f i d f} & w o r d &Element; s e n \\ 0 & w o r d &NotElement; s e n \end{matrix}

Wherein sen_iRepresent that sentence i, word represent candidate seed word, | C | represents sentence sum in classification C, K (word) represents time The correlation degree of all sentences in seed words and training set is selected, n represents contained word sum, word in such all sentence_tfidfTable Show TF-IDF value of the candidate word in training set, word ∈ sen represents word in sentence；

Step 1.1.4, according to《Synonym woods》Coding information, all for candidate seed word word synon K are added and are represented Candidate seed word is ranked up, then given threshold by the new criticality of the word by final K, extracts word of the K more than threshold value Such initial seed word set is formed, threshold value is generally relevant with sentence quantity and obtains by experiment；

Step 1.2, the initial seed word set extracted by step 1.1 and a large amount of un-annotated datas, using Bootstrapping Algorithm expansion initial seed word set, production Methods feature lexicon, comprise the following steps that：

Step 1.2.1, in the language material not marked in a large number, extracts noun and verb as candidate word；

Step 1.2.2, considers the seed word set in each relation classification C respectively, calculates M value, meter using the method for mutual information Calculating formula is：

M = \underset{s w o r d &Subset; c}{Σ} l o g \frac{P (w, s w o r d)}{P (w) * P (s w o r d)}

P (w, s w o r d) = \frac{F (w, s w o r d)}{F_{a l l}}

P (w) = \frac{F (w)}{F_{a l l}}

P (s w o r d) = \frac{F (s w o r d)}{F_{a l l}}

Wherein sword represents seed words, and F (w) represents the sentence number comprising w in whole language material；F (sword) represents whole language Sentence number comprising initial word sword in material；It is same that co-occurrence frequency F (w, sword) represents that candidate word is occurred in initial word sword The sentence number of one sentence；F_allRepresent the sentence sum in whole language material；

Step 1.2.3, chooses and meets F (w)>F_min(w) and M>M_minWord and seed set of words and as new seed words Collection, wherein, F_minW () represents minimum sentence number, be set to 5, M_minIt is the minimal weight for arranging；

Step 1.2.4, repeat step 1.2.2,1.2.3 till producing without the new word for meeting condition, by above-mentioned step The relationship characteristic dictionary of all categories has been automatically generated suddenly；

Step 2, triple feature space are constructed, definition<Personage-relation-personage>For a relation triple example, by being The ownership of non-binary decision character relation attribute, many classification problems is converted into two classification problems, is comprised the following steps that：

Step 2.1, extracts the name entity in each sentence, obtains this people's list of file names<Name₁、Name₂、…Name_n>, will In list, all of name is arranged in pairs or groups two-by-two, forms pair relationhip<(Name₁、Name₂)、(Name₂、Name₃)、…、(Name_n-1、 Name_n)>；

Step 2.2, the relationship characteristic dictionary generated using step 1, obtain the relationship characteristic vocabulary in sentence<W₁、W₂、…W_m>, Pair relationhip is sequentially added, exhaustive composition triple example<(Name_i, W_k, Name_j)>, for each i, j, k, meet 0<i< =n, 0<j<=n, 0<k<=m；

Step 2.3, calculates feature vocabulary<W₁、W₂、…W_m>With people's list of file names<Name₁、Name₂、…Name_n>In each element Term vector, obtains the term vector W_Vec of each Feature Words, and the term vector NameVec of each name_i；

Step 2.4, using the word2vec method in deep learning, and the method for word string coupling, obtain respectively every in data set Three elements position in sentence in the term vector of individual vocabulary and triple example, combines for every kind of positional information<(pos (Name_i), pos (W_k), pos (Name_j))>, in conjunction with word combination<(Name_i, W_k, Name_j)>Semantic information, calculate distance D, formula are as follows：

D=dis (pos (Name_i),pos(Name_j))/dim(NameVec_i,NameVec_j)+

dis(pos(Name_i),pos(W_k))/dim(NameVec_i,W_Vec_k)+

dis(pos(W_k),pos(Name_j))/dim(NameVec_j,W_Vec_k)

Wherein, pos (Name_i) represent Name_iCharacter position in sentence, dis (pos (Name_i),pos(Name_j)) represent two People entities be separated by word number, dis (pos (Name_i),pos(W_k)) represent that relative k and entity i's is separated by word number, dim (NameVec_i,NameVec_j) represent similarity between two term vectors, during the more semantic similarity of two words, dim (NameVec_i,NameVec_j) bigger, otherwise less, punctuate is calculated by 5 characters, selects combination when making d take minimum to represent The positional information of triple example；

Step 2.5, if dis in positional information>dis_min, d>d_min(d_minRepresent acceptable minimum threshold of distance), then excluding should Triple example, obtains final triple sample result；

Step 2.6, from 22 kinds of morphology layers and sentence justice combinations of features structural feature space, here is spatially each with vector representation Individual triple example；

Step 3, is non-binary decision by triple, obtains final character relation result of determination.

2. the character relation abstracting method of a kind of fusion distributed semantic according to claim 1 and sentence justice feature, which is special Levy and be, the similarity calculating method between the term vector of step 2.4, its formula is：

Wherein NameVec_iAnd W_Vec_kRepresent two vectors, respectively the term vector of representative's name vocabulary and the term vector of Feature Words.

3. the character relation abstracting method of a kind of fusion distributed semantic according to claim 1 and sentence justice feature, which is special Levy and be, the triple in step 3 is non-binary decision method, comprises the following steps that：

In the decision tree that trains out by C4.5, corresponding confidence level coefficient is obtained for each result for being judged to "true" P⁺, the confidence level is the confidence level of the alternative relations combination for being judged as "Yes", for the composition of relations result to there is conflict Screened, if each triple example is judged as "true", by confidence level P⁺As its weights, compare two people entities The all triple being located, will make weights obtain maximum attribute of a relation and distribute to them, final as two people entities Character relation result of determination.

4. the character relation abstracting method of a kind of fusion distributed semantic according to claim 1 and sentence justice feature, which is special Levy and be, step 2.6 make use of sentence justice structural model that sentence structure information and semantic information structural feature space is obtained, concrete step Rapid as follows：

The sentence structure information obtained by sentence justice structural model analysis and semantic information, extract the spy that can state sentence semantics Levy, sentence justice latent structure be using the syntagmatic between sentence justice composition, the base for specifically automatically building in sentence justice structural model On plinth successively query semantics lattice (table 1) corresponding item as Feature Words, and according to the dependence of semantic lattice construct various combination Mode forms the feature phrase with more Precise Semantics ability to express.

The semantic lattice type declaration of table 1