CN106484675A - Fusion distributed semantic and the character relation abstracting method of sentence justice feature - Google Patents

Fusion distributed semantic and the character relation abstracting method of sentence justice feature Download PDF

Info

Publication number
CN106484675A
CN106484675A CN201610866186.8A CN201610866186A CN106484675A CN 106484675 A CN106484675 A CN 106484675A CN 201610866186 A CN201610866186 A CN 201610866186A CN 106484675 A CN106484675 A CN 106484675A
Authority
CN
China
Prior art keywords
sentence
word
name
relation
triple
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610866186.8A
Other languages
Chinese (zh)
Inventor
罗森林
焦龙龙
潘丽敏
郭佳
吴舟婷
陈倩柔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN201610866186.8A priority Critical patent/CN106484675A/en
Publication of CN106484675A publication Critical patent/CN106484675A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars

Abstract

The present invention relates to the character relation abstracting method of a kind of fusion distributed semantic and sentence justice feature, belongs to natural language processing field.The present invention is first with statistics words-frequency feature and Bootstrapping algorithm, in markd language material on a small quantity and in a large number unmarked language material, training obtains relationship characteristic dictionary respectively, then the triple example of sentence is constructed apart from optimization rule by element, fusion distributed semantic information and semantic information construction triple feature space, finally triple is carried out being non-binary decision, character relation classification is obtained using confidence level maximization principle.Present invention achieves the automatically generating of characteristic relation dictionary, it is non-binary decision problem that many for traditional relation classification problems are converted into triple, more adapt to traditional machine learning classification algorithm, and distributed semantic information is utilized, improve the accuracy rate of relation classification.

Description

Fusion distributed semantic and the character relation abstracting method of sentence justice feature
Technical field
The present invention relates to a kind of concentrate the method for extracting automatically character relation from Chinese text or Chinese text, belong to calculating Machine science and information extraction technique field.
Background technology
Character relation extract be accurately and rapidly automatic for the relation between dispersion people entities in the text and personage Extract, belong to the research contents in information extraction field.
Information extraction technique (IE, Information Extraction) will complete two big Tasks:Entity recognition (EDR, Entity Detection and Recognition) and relation recognition (RDR, Relation Detection and Recognition).Wherein relation recognition (also referred to as " Relation extraction ") is exactly to extract the pass of the presence between entity from text System, and the type of these relations is predefined.Character relation belongs to the one kind in entity relationship, refers to text or text set Described in two personages between incidence relation.Character relation is extracted, is mainly solved:1. obtain between two personages Attribute of a relation (attribute of a relation extraction);2. the correlation degree (relationship strength calculating) between two personages is calculated.In addition, Organizational form and display form for the character relation being dispersed in text and text set is also the problem for needing consideration.
Character relation abstracting method mainly has two classes:Method based on pattern-recognition and the method based on machine learning.
1. the method based on pattern-recognition:
1) the character relation abstracting method based on pattern-recognition of early stage:Based on the method for pattern-recognition be by morphology, The feature of the aspects such as syntax, builds the knowledge base (or referred to as rule base) needed for identification, enters row mode using the knowledge base Coupling, reaches the purpose of Relation extraction.For the character relation abstracting method based on pattern-recognition, most difficult step is personage The foundation of relation schema (character relation rule base).The foundation of these character relation patterns needs to rely on linguist, sociology Family carries out careful deep analysis to the language material in field involved by extraction task, and exhaustive various possible character relations work out personage Relation schema.This method establishment cycle is oversize, and application cost is very high.
2) improved method to earlier processes:For the problem of the pure hand woven character relation pattern of early stage, later Scholars propose some solutions.
A) as, in the FASTUS extraction system of Appelt et al. proposition, various fields relied on by the concept for introducing " grand " Rule by a kind of with autgmentability, versatility in the way of express.User only needs to change the parameter setting in corresponding " grand ", so that it may Relation schema rule with the good specific area task of rapid configuration;So-called grand, it is exactly that number order is organized together, as one Individually order completes a particular task.
B) the Proteus extraction system that Roman et al. is proposed employs the character relation decimation pattern extensive based on sample Construction method, this method by work out character relation pattern carry out extensive so that pattern can be suitable for wider field Character relation is extracted;
C) REES system (the Large-Scale Relation and Event Extraction that Aone et al. builds System Relation extraction is carried out by knowledge base of the construction comprising more than 100 kind of character relation pattern in).
D) additionally, in terms of Chinese, also there is method of some scholars using pattern-recognition the country for extracting character relation, As Jiang Jifa et al. in order to the amount of labour for mitigating pattern authorized personnel proposes a kind of binary crelation of bootstrapping and binary crelation mould Formula acquisition methods BRPAM, the method can be expanded knowledge by existing binary crelation of booting, and (character relation is regular in storehouse Storehouse), put method according to this, Jiang Jifa they devise an IE system that can carry out binary crelation extraction from free text BRPAM2Texts;Deng's thumb et al. has been incorporated into lexical semantic coupling in relation schema coupling, it is proposed that a kind of brand-new relation The method of extraction.This method is due to introducing the feature of semanteme of vocabulary so that it is objective that the result that character relation is extracted more meets Logic, accuracy rate improves, and the character relation for different field can be realized by the dictionary of association area Character relation is extracted.
Yet suffer from that development cost is high above based on the character relation abstracting method of pattern-recognition, applicability is low not Foot.
2. the method based on machine learning:
Based on the character relation abstracting method of machine learning be by machine learning algorithm, on the basis of manual labeling language material Upper structural classification device, is then applied in the classification deterministic process of field language material character relation.At present using more Machine learning algorithm has MBL algorithm and SVM algorithm.Such as:
A) Zhang et al. build Chinese name entity and Relation extraction system be exactly using MBL algorithm from training data Middle structure classifying rules, carries out the extraction of entity and relation based on the rule in extraction process;
B) Zhang and Che Wanxiang etc. then carries out the study of Relation extraction rule using SVM algorithm;What is graceful et al. to propose By the use of a small amount of artificial entity relationship that chooses as seed (initial relation), continuous Extended Relations seed by way of self study Set, the method for extracting entity relationship;
C) Liu Lu et al. then proposes a kind of entity relation extraction method based on the positive and negative example training of SVM.
Method based on machine learning comparative maturity, but still suffer from problem, for example, in the feelings that language material is not abundant enough Under condition, the covering dynamics of Feature Words not enough, affects classifying quality;Feature selecting is most important for machine learning algorithm, and special Levy selection and sentence justice characteristic information and distributed information is not made full use of, cause signature analysis not deep enough, classifying quality is not excellent.
Content of the invention
Feature selecting for machine learning algorithm is not difficult deep enough with signature analysis, causes asking for classifying quality difference Topic, the present invention propose the character relation abstracting method of a kind of fusion distributed semantic and sentence justice feature, improve from Chinese text This or Chinese text concentrate the effect for extracting automatically character relation.
Technical scheme includes following content:
First with statistics words-frequency feature and Bootstrapping algorithm, respectively in markd language material on a small quantity and in a large number In unmarked language material, training obtains relationship characteristic dictionary, then constructs the triple of sentence by element apart from optimization rule Example, fusion morphology layer and sentence justice latent structure triple feature space, finally carry out being non-binary decision to triple, utilize Confidence level maximization principle obtains character relation classification.Present invention achieves automatically generating for characteristic relation dictionary, will be traditional It is non-binary decision problem that many classification problems of relation are converted into triple, more adapts to traditional machine learning classification algorithm, and Using sentence justice feature, the accuracy rate of relation classification is improved, as shown in Figure 1.
Step 1, relationship characteristic dictionary are automatically generated;
Character relation is extracted and regards classification task as, the present invention defines eight big class character relations, including the relation of being an apprentice of, family Relation, relationship between superior and subordinate, competitive relation, friends, love relation, nominal kinship's relation, nurse relation and other relations.Relation is special The bidirectional relationship that word is characterize between description personage is levied, the differentiation to attribute of a relation between personage is most important, introduced below The idiographic flow for automatically generating relationship characteristic dictionary algorithm that patent is proposed.
Step 1.1, through Text Pretreatment, the language material to tape label is trained, and obtains initial seed word set, concrete stream Journey is as follows:
Step 1.1.1, first with the Chinese sentence in participle instrument ICTCLAS2013, BFS laboratory of the Computer Department of the Chinese Academy of Science Adopted structural model automatic build system ACSM (Automatic Chinese Sentential Semantic Model) and instrument Scikit-learn is pre-processed to language material, respectively obtains participle, part-of-speech tagging, name Entity recognition, the TF- of each word IDF value and sentence justice results of structural analysis.Then stop words is removed, and the language material to tape label is trained, and obtains initial seed word Collection.
Wherein, sentence justice structural model (CSM) is the structuring of syntagmatic, shape between composition and the composition in distich justice Formula represents, abstract sentence justice is expressed as the accessible structural data of computer, it is therefore an objective to help computer from deep layer Semantic angle goes to understand Chinese sentence.By the model abstract sentence justice Formal Representation is mathematical and physical structure between composition, Allow computer to be capable of identify that and process Chinese sentence meaning.
Will the have of the adopted structural model of sentence:Sentence adopted type, topic, state topic, semanteme lattice, predicate item, Chinese time system, when Empty range information, composition syntagmatic etc..For above-mentioned key element, sentence justice structural model is divided into 4 levels:Sentence pattern layer, retouch Layer, object layer and levels of detail is stated, its citation form is as shown in Fig. 2 (see photo).
The sentence structure information obtained by sentence justice structural model analysis and semantic information, extraction can state sentence semantics Feature, these features can express people entities important information.The adopted latent structure of sentence be using the combination between sentence justice composition Relation, specifically on the basis of sentence justice structural model automatically builds successively query semantics lattice (table 1) corresponding item as feature Word, and formed with more Precise Semantics expression energy according to the dependence (refer to the attached drawing 2) of semantic lattice construction various combination mode The feature phrase of power.
The semantic lattice type declaration of table 1
Step 1.1.2, the language material of tape label is pressed contained relation classification Ci(0<i<N, N represent relation category quantity) area Point, if sentence include multiple relations, will its repeat to be subdivided in corresponding plurality of classes.
Step 1.1.3, for each classification C, extracts noun and verb as candidate seed word, and calculates candidate seed The TF-IDF value of word, calculates the criticality of these words and all sentences in training set according to formula (1) and (2).
Wherein seniRepresent that sentence i, word represent candidate seed word, | C | represents sentence sum, K (word) table in classification C Show the correlation degree of all sentences in candidate seed word and training set, n represents contained word sum in such all sentence, wordtfidfRepresent TF-IDF value of the candidate word in training set, word ∈ sen represents word in sentence.
Wherein, TF-IDF is a kind of statistical method, in order to assess a word in a file set or a corpus A copy of it file significance level, wordtfidfRepresent weight of the candidate word in the training set, the main think of of TF-IDF Think be:If frequency TF that certain word or phrase occur in an article is high, and seldom occurs in other articles, then recognize It is that this word or phrase have good class discrimination ability, TF-IDF can more represent weight of the word to text than word frequency statisticses Degree is wanted, and the weight of the word therefore in formula (2), is weighed using the method for TF-IDF.
Step 1.1.4, according to《Synonym woods》Coding information, all for candidate seed word word synon K are added and Represent the new criticality of the word.
Candidate seed word is ranked up, then given threshold by step 1.1.5 by final K, extracts K more than threshold value Morphology becomes such initial seed word set, and threshold value is generally relevant with sentence quantity and is obtained by experiment.
Step 1.2, initial seed word set step 1 obtained using Bootstrapping algorithm are expanded, concrete bag Containing four basic processes.
Step 1.2.1, in the language material not marked in a large number, extracts noun and verb as candidate word.
Step 1.2.2, considers the seed word set in each relation classification C, respectively to each candidate word w, using mutual The method of information calculates the weight M value of candidate word, such as shown in formula (3),
Wherein sword represents seed words, and F (w) represents the sentence number comprising w in whole language material, and F (sword) represents whole Sentence number comprising initial word sword in individual language material, co-occurrence frequency F (w, sword) represent that candidate word is occurred with initial word sword In the sentence number of same sentence, FallRepresent the sentence sum in whole language material.
Step 1.2.3, chooses and meets F (w)>Fmin(w) and M>MminWord and seed set of words and as new kind Sub- word set, wherein, FminW () represents minimum sentence number, be set to 5, MminIt is the minimal weight for arranging.
Step 1.2.4, repeat step 1.2.2,1.2.3 till producing without the new word for meeting condition, by upper State the relationship characteristic dictionary that step has automatically generated all categories.
Step 2, triple feature space are constructed, and in a sentence, the relationship characteristic word for occurring and relative are reflected Attribute of a relation may belong in the sentence occur two personages between, for example " used as the coach of Liu Xiang, Sun Haiping was to 13 seconds 08 achievement is felt quite pleased " wherein " train " " master-apprentice relation " reflected between " Yao Ming " and " Sun Haiping ".Definition<Personage-pass System-personage>For a relation triple example, so to attribute of a relation ownership can carry out being non-binary decision to classify more Problem is converted into two classification problems.
Step 2.1, extracts the name entity in each sentence, obtains this people's list of file names<Name1、Name2、…Namen >, all of name in list is arranged in pairs or groups two-by-two, forms pair relationhip<(Name1、Name2)、(Name2、Name3)、…、 (Namen-1、Namen)>.
Step 2.2, the relationship characteristic dictionary generated using step 1, obtain the relationship characteristic vocabulary in sentence<W1、W2、… Wm>, pair relationhip is sequentially added, exhaustive composition triple example<(Namei, Wk, Namej)>, for each i, j, k, meet 0<i<=n, 0<j<=n, 0<k<=m.
Step 2.3, using the word2vec method in deep learning, calculates feature vocabulary<W1、W2、…Wm>Rank with people Table<Name1、Name2、…Namen>In each element term vector, obtain the term vector W_Vec of each Feature Wordsk, and each The term vector NameVec of namei.
Step 2.4, the method that is mated using word string, obtain three elements position in sentence in triple example respectively, right In every kind of combination<(pos(Namei), pos (Wk), pos (Namej))>, in conjunction with<(Namei, Wk, Namej)>Between semantic letter Breath, using apart from d between formula (4) calculating triple example.
Wherein, pos (Namei) represent NameiCharacter position in sentence, dis (pos (Namei),Table That shows two people entities is separated by word number, dis (pos (Namei),pos(Wk)) represent that relative k and entity i's is separated by word number, dim(NameVeci,NameVecj) represent two term vectors between similarity, when the semanteme of two words is more close, dim (NameVeci,NameVecj) bigger, otherwise less, this formula combines the spy of distributed semantic information and sentence justice feature Point, makes more represent the positional information of triple example apart from d, and punctuate is calculated by 5 characters, group when selecting to make d to take minimum Close the positional information for representing triple example.
Step 2.5, if dis in positional information>dismin, d>dmin(dminRepresent acceptable minimum threshold of distance), then arrange Except the triple example, final triple sample result is obtained, corresponding characteristic vector is constructed according to positional information.
Step 3, triple are non-binary decisions;
In the decision tree that trains out by C4.5, there can be corresponding confidence for each result for being judged to "true" Degree FACTOR P+, the confidence level is exactly the confidence level of the alternative relations combination for being judged as "Yes", can be used for there is conflict Composition of relations result is screened.If each triple example is judged as "true", by confidence level P+As its weights, compare All triple that two people entities are located, will make weights obtain maximum attribute of a relation as the final personage of people entities Relation result of determination.
Beneficial effect
Compared to the method based on machine learning, the present invention adopt have the characteristics that recognition speed is fast, accuracy rate is high.
Compared to the method based on pattern-recognition, practicality of the present invention is wider, with more preferable autgmentability.
Compared with the method based on pattern-recognition, the technology that the present invention is adopted has less calculating consumption, is not only suitable for In desktop computer, the mobile computing platforms such as mobile phone, panel computer are also applied for.
Compared with based on semantic pattern character relation abstracting method, the sentence justice feature of the present invention has more excellent depth analysis Effect is so as to ensure that higher recognition accuracy.
Description of the drawings
Fig. 1 is the character relation extraction algorithm schematic diagram of the present invention;
Fig. 2 is sentence justice structural model citation form structure chart;
Fig. 3 is to train, using C4.5, the decision tree example (part) that the character relation for obtaining combination is non-binary decision;
Fig. 4 relationship characteristic dictionary automatic generating calculation parameter selection experiments Comparative result figure;
Specific embodiment
In order to better illustrate objects and advantages of the present invention, the reality to the inventive method with reference to the accompanying drawings and examples The mode of applying is described in further details.
Data source be BFS hot topic personage retrieval language material, including " Yao Ming ", " Liu Xiang ", " Zhou Jielun ", " James ", " become Dragon ", " Bryant ", " Xie Tingfeng ", mark language material amount to 1540 texts, there is the sentence 2389 of at least two names, do not mark Note sentence 10000.The description of data source as shown in table 1, obtains people entities number by artificial statistics.
1 character relation of table extracts experimental data source
In order to personage's Relation extraction method is verified, three experiments have been carried out:
(1) parameter selection experiments:Select optimal threshold value in initial seed word extraction process and Bootstrapping algorithm The combination of K and M, wherein, K and M is the threshold value of initial seed word association degree and candidate seed word weight respectively.
(2) relationship characteristic dictionary contrast experiment:The automatic dictionary for extracting is entered with the dictionary of adopted manual compiling Row contrast, the dictionary that checking is automatically extracted have higher expansion and with the other matching degree of relation object.
(3) character relation extracts effect experimental:For check this patent propose character relation extraction algorithm accuracy, Comprehensive, and be compared with other Relation extraction algorithms.
Above-mentioned testing process will be illustrated one by one below, all tests are all completed on same computer, specifically It is configured to:Intel double-core CPU (dominant frequency 3.0G), 4.00G internal memory, Windows7 operating system.
The character relation for selecting for parameter and being extracted, we equally choose accuracy rate, recall rate and F value and are commented Valency, computational methods are identical with formula (5)~(7), and parameter meaning therein is varied from:
A) represent the number of the correct character relation attribute being extracted;
B) represent the number of the character relation attribute of the mistake being extracted;
C) represent the number of the character relation attribute not being extracted.
For relationship characteristic dictionary contrast experiment, using expert estimation strategy, at two long campaigns natural languages Each word is assigned in+3 by stages from -3 and selects integer to divide according to word and the other matching degree of relation object by the researcher of reason Value is given a mark, and -3 points of representatives are mismatched very much, and+3 points of representatives are mated very much, and statistics obtains PTS and average index.
ICTCLAS (the Institute of Computing that participle is provided using the Computer Department of the Chinese Academy of Science in experiment Technology, Chinese Lexical Analysis System) as morphological analysis instrument.The name of ICTCLAS is known Other rate of accuracy reached (973 evaluation and test) to more than 98%, directly using this identification of function who object.
It is non-binary classification model training and judgement to carry out triple example, from 22 kinds of morphology layers and sentence justice spy Composite construction feature space is levied, here spatially uses each triple example of vector representation, train classification models are simultaneously surveyed Examination, these features are as shown in table 2.
Table 2 constructs the feature in triple example aspects space
1. parameter selection experiments
Using gridding method, the parameter K value of selection represents the criticality minimum threshold index of initial seed word, the ginseng of selection Number M value is the minimum scoring threshold in the extraction of Bootstrapping algorithm.First K value is normalized with M value, respectively from 0.1 to 0.9 with 0.05 as separation fluctuation, and F value is observed roughly, finds K value in 0.4~0.6 interval, M value 0.5~ During 0.7 interval interior variation, effect is preferable.Then careful selection is carried out with gridding method in above-mentioned interval, obtain best parameter group.
In experiment, with 0.02 as the interval variation for being spaced in 0.4~0.6, M is with 0.02 as the area for being spaced in 0.5~0.64 for K Between change, wherein M1=0.5.Abscissa is K value, and ordinate is F value.The result of parameter selection experiments as shown in Figure 4, thus Understand, when K takes 0.52, M and takes 0.54, relationship characteristic dictionary is the most suitable.
2. relationship characteristic dictionary contrast experiment
The relationship characteristic dictionary obtained automatically generating with the best parameter group described in parameter selection experiments, and and Feng Yangbo The dictionary of manual compiling of the scholar used in its character relation extraction system is contrasted.Total word number in statistics dictionary, and Using expert estimation strategy, the effect of two dictionaries of score analysis is contrasted.
The result of relationship characteristic dictionary contrast experiment is as shown in table 3.
2 relationship characteristic dictionary contrast and experiment of table
As shown in Table 3, compared to the dictionary of manual compiling, the dictionary that automatic generating calculation is obtained is extended in word total amount 77.6%, and word quantity all has lifting by a relatively large margin in the range of each just divides.PTS improves 152 points, shows more High comprehensive quality, and average mark is declined slightly, this be due to each word in the dictionary of manual compiling be subjective extract, can Illustrative higher.Can be drawn from above, in the case of the matching degree of each word is significantly reduced, greatly improve dictionary Level of coverage and oeverall quality.
3. character relation extracts experiment
The model training of C4.5 character relation judgement is carried out by 2389 mark sentences first;Then automatic people is carried out Thing Relation extraction;The last standard relationship triple obtained with artificial statistics calculates accuracy rate, recall rate and F value as standard.
When carrying out general effect Experimental comparison, go forward side by side with reference to sentence justice features training model first with distributed semantic information Row test, obtains this patent algorithm final effect, is finally respectively adopted based on the entity relation extraction algorithm of semantic pattern, is based on The entity relation extraction algorithm of SVM and the SVM name entity relation extraction algorithm that is trained based on positive counter-example are directed to identical number Character relation extraction is carried out according to source, wherein the first algorithm is the algorithm based on pattern-recognition, other two kinds be based on engineering The algorithm of habit.
The result that character relation extracts effect experimental is as shown in table 4.
Table mistake!Word in document without given pattern.Character relation extracts effect experimental result
As shown in Table 4, preferable with reference to the method effect of sentence justice feature using distributed semantic information.This be due to distributed Semantic information, accurately have expressed the information such as word order, part of speech.Greatly improve in conjunction with the sentence justice characteristic information with strong distinction The ability to express of feature space, this is also embodied in contrast experiment, the overall target F value of the algorithm reaches 83.8%, Better than other Relation extraction algorithms.
By being contrasted with existing excellent algorithm, it can be found that the effect of this patent algorithm will be substantially better than based on mould The entity relation extraction method of formula identification, and also it is better than the entity relation extraction algorithm for being generally basede on machine learning in personage pass The application that system extracts.Reason is as follows:First, the automation for realizing characteristic relation dictionary is generated, and based on improvement Bootstrapping algorithm has expanded the coverage of characteristic relation word, differentiates that to triple the lifting of recall rate is produced actively Impact;Second, it is non-binary decision problem that many for traditional relation classification problems are converted into triple, more adapts to traditional machine Device learning classification algorithm;3rd, using sentence justice structural model, sample is carried out deeper into analysis, identify triple difference Semantic component and design feature, effectively constrain triple example information as strong feature, the lifting effect in terms of accuracy rate It is obvious.

Claims (4)

1. a kind of fusion distributed semantic and sentence justice feature character relation abstracting method, it is characterised in that using statistics word frequency Feature and Bootstrapping algorithm, train in markd language material on a small quantity and in a large number unmarked language material respectively and are closed It is feature lexicon, then the triple example that element constructs sentence apart from optimization rule is combined by distributed semantic information, Fusion morphology layer and sentence justice latent structure triple feature space, finally carry out being non-binary decision to triple, using confidence Degree maximization principle obtains character relation classification, comprises the steps:
Step 1, through pretreatment, the language material to tape label is trained, obtains initial seed word set, then use Bootstrapping algorithm is expanded to initial seed word set, production Methods feature lexicon, is comprised the following steps that:
Step 1.1, carries out division classification, Text Pretreatment to training set language material, training, generates initial seed word set, concrete stream Journey is as follows:
Step 1.1.1, the language material of tape label is divided into corresponding relation classification Ci(0<i<N, N represent relation categorical measure) In, if sentence include multiple relations, will its repeat to be subdivided in corresponding plurality of classes;
Step 1.1.2, pre-processes to language material, obtains participle, part-of-speech tagging, name Entity recognition, the TF-IDF of each word Value and sentence justice results of structural analysis;
Step 1.1.3, for each classification C, extracts noun and verb as candidate seed word, and calculates the key of these words Degree K, the computing formula of K are as follows:
K ( w o r d ) = &Sigma; i = 1 | C | A p e a r ( sen i , w o r d ) n
A p e a r ( s e n , w o r d ) = word t f i d f w o r d &Element; s e n 0 w o r d &NotElement; s e n
Wherein seniRepresent that sentence i, word represent candidate seed word, | C | represents sentence sum in classification C, K (word) represents time The correlation degree of all sentences in seed words and training set is selected, n represents contained word sum, word in such all sentencetfidfTable Show TF-IDF value of the candidate word in training set, word ∈ sen represents word in sentence;
Step 1.1.4, according to《Synonym woods》Coding information, all for candidate seed word word synon K are added and are represented Candidate seed word is ranked up, then given threshold by the new criticality of the word by final K, extracts word of the K more than threshold value Such initial seed word set is formed, threshold value is generally relevant with sentence quantity and obtains by experiment;
Step 1.2, the initial seed word set extracted by step 1.1 and a large amount of un-annotated datas, using Bootstrapping Algorithm expansion initial seed word set, production Methods feature lexicon, comprise the following steps that:
Step 1.2.1, in the language material not marked in a large number, extracts noun and verb as candidate word;
Step 1.2.2, considers the seed word set in each relation classification C respectively, calculates M value, meter using the method for mutual information Calculating formula is:
M = &Sigma; s w o r d &Subset; c l o g P ( w , s w o r d ) P ( w ) * P ( s w o r d )
P ( w , s w o r d ) = F ( w , s w o r d ) F a l l
P ( w ) = F ( w ) F a l l
P ( s w o r d ) = F ( s w o r d ) F a l l
Wherein sword represents seed words, and F (w) represents the sentence number comprising w in whole language material;F (sword) represents whole language Sentence number comprising initial word sword in material;It is same that co-occurrence frequency F (w, sword) represents that candidate word is occurred in initial word sword The sentence number of one sentence;FallRepresent the sentence sum in whole language material;
Step 1.2.3, chooses and meets F (w)>Fmin(w) and M>MminWord and seed set of words and as new seed words Collection, wherein, FminW () represents minimum sentence number, be set to 5, MminIt is the minimal weight for arranging;
Step 1.2.4, repeat step 1.2.2,1.2.3 till producing without the new word for meeting condition, by above-mentioned step The relationship characteristic dictionary of all categories has been automatically generated suddenly;
Step 2, triple feature space are constructed, definition<Personage-relation-personage>For a relation triple example, by being The ownership of non-binary decision character relation attribute, many classification problems is converted into two classification problems, is comprised the following steps that:
Step 2.1, extracts the name entity in each sentence, obtains this people's list of file names<Name1、Name2、…Namen>, will In list, all of name is arranged in pairs or groups two-by-two, forms pair relationhip<(Name1、Name2)、(Name2、Name3)、…、(Namen-1、 Namen)>;
Step 2.2, the relationship characteristic dictionary generated using step 1, obtain the relationship characteristic vocabulary in sentence<W1、W2、…Wm>, Pair relationhip is sequentially added, exhaustive composition triple example<(Namei, Wk, Namej)>, for each i, j, k, meet 0<i< =n, 0<j<=n, 0<k<=m;
Step 2.3, calculates feature vocabulary<W1、W2、…Wm>With people's list of file names<Name1、Name2、…Namen>In each element Term vector, obtains the term vector W_Vec of each Feature Words, and the term vector NameVec of each namei
Step 2.4, using the word2vec method in deep learning, and the method for word string coupling, obtain respectively every in data set Three elements position in sentence in the term vector of individual vocabulary and triple example, combines for every kind of positional information<(pos (Namei), pos (Wk), pos (Namej))>, in conjunction with word combination<(Namei, Wk, Namej)>Semantic information, calculate distance D, formula are as follows:
D=dis (pos (Namei),pos(Namej))/dim(NameVeci,NameVecj)+
dis(pos(Namei),pos(Wk))/dim(NameVeci,W_Veck)+
dis(pos(Wk),pos(Namej))/dim(NameVecj,W_Veck)
Wherein, pos (Namei) represent NameiCharacter position in sentence, dis (pos (Namei),pos(Namej)) represent two People entities be separated by word number, dis (pos (Namei),pos(Wk)) represent that relative k and entity i's is separated by word number, dim (NameVeci,NameVecj) represent similarity between two term vectors, during the more semantic similarity of two words, dim (NameVeci,NameVecj) bigger, otherwise less, punctuate is calculated by 5 characters, selects combination when making d take minimum to represent The positional information of triple example;
Step 2.5, if dis in positional information>dismin, d>dmin(dminRepresent acceptable minimum threshold of distance), then excluding should Triple example, obtains final triple sample result;
Step 2.6, from 22 kinds of morphology layers and sentence justice combinations of features structural feature space, here is spatially each with vector representation Individual triple example;
Step 3, is non-binary decision by triple, obtains final character relation result of determination.
2. the character relation abstracting method of a kind of fusion distributed semantic according to claim 1 and sentence justice feature, which is special Levy and be, the similarity calculating method between the term vector of step 2.4, its formula is:
Wherein NameVeciAnd W_VeckRepresent two vectors, respectively the term vector of representative's name vocabulary and the term vector of Feature Words.
3. the character relation abstracting method of a kind of fusion distributed semantic according to claim 1 and sentence justice feature, which is special Levy and be, the triple in step 3 is non-binary decision method, comprises the following steps that:
In the decision tree that trains out by C4.5, corresponding confidence level coefficient is obtained for each result for being judged to "true" P+, the confidence level is the confidence level of the alternative relations combination for being judged as "Yes", for the composition of relations result to there is conflict Screened, if each triple example is judged as "true", by confidence level P+As its weights, compare two people entities The all triple being located, will make weights obtain maximum attribute of a relation and distribute to them, final as two people entities Character relation result of determination.
4. the character relation abstracting method of a kind of fusion distributed semantic according to claim 1 and sentence justice feature, which is special Levy and be, step 2.6 make use of sentence justice structural model that sentence structure information and semantic information structural feature space is obtained, concrete step Rapid as follows:
The sentence structure information obtained by sentence justice structural model analysis and semantic information, extract the spy that can state sentence semantics Levy, sentence justice latent structure be using the syntagmatic between sentence justice composition, the base for specifically automatically building in sentence justice structural model On plinth successively query semantics lattice (table 1) corresponding item as Feature Words, and according to the dependence of semantic lattice construct various combination Mode forms the feature phrase with more Precise Semantics ability to express.
The semantic lattice type declaration of table 1
CN201610866186.8A 2016-09-29 2016-09-29 Fusion distributed semantic and the character relation abstracting method of sentence justice feature Pending CN106484675A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610866186.8A CN106484675A (en) 2016-09-29 2016-09-29 Fusion distributed semantic and the character relation abstracting method of sentence justice feature

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610866186.8A CN106484675A (en) 2016-09-29 2016-09-29 Fusion distributed semantic and the character relation abstracting method of sentence justice feature

Publications (1)

Publication Number Publication Date
CN106484675A true CN106484675A (en) 2017-03-08

Family

ID=58268976

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610866186.8A Pending CN106484675A (en) 2016-09-29 2016-09-29 Fusion distributed semantic and the character relation abstracting method of sentence justice feature

Country Status (1)

Country Link
CN (1) CN106484675A (en)

Cited By (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106951558A (en) * 2017-03-31 2017-07-14 广东睿盟计算机科技有限公司 A kind of data processing method of the tax intelligent consulting platform based on deep search
CN107341171A (en) * 2017-05-03 2017-11-10 刘洪利 Extract the method and system of data (gene) feature templates method and application template
CN107506346A (en) * 2017-07-10 2017-12-22 北京享阅教育科技有限公司 A kind of Chinese reading grade of difficulty method and system based on machine learning
CN107798136A (en) * 2017-11-23 2018-03-13 北京百度网讯科技有限公司 Entity relation extraction method, apparatus and server based on deep learning
CN107908749A (en) * 2017-11-17 2018-04-13 哈尔滨工业大学(威海) A kind of personage's searching system and method based on search engine
CN107957991A (en) * 2017-12-05 2018-04-24 湖南星汉数智科技有限公司 A kind of entity attribute information extraction method and device relied on based on syntax
CN108073708A (en) * 2017-12-20 2018-05-25 北京百度网讯科技有限公司 Information output method and device
CN108363816A (en) * 2018-03-21 2018-08-03 北京理工大学 Open entity relation extraction method based on sentence justice structural model
CN108664615A (en) * 2017-05-12 2018-10-16 华中师范大学 A kind of knowledge mapping construction method of discipline-oriented educational resource
CN108805290A (en) * 2018-06-28 2018-11-13 国信优易数据有限公司 A kind of determination method and device of entity class
CN108959418A (en) * 2018-06-06 2018-12-07 中国人民解放军国防科技大学 Character relation extraction method and device, computer device and computer readable storage medium
CN109215797A (en) * 2018-09-05 2019-01-15 山东管理学院 Chinese medicine case non-categorical Relation extraction method and system based on extension correlation rule
CN109710932A (en) * 2018-12-22 2019-05-03 北京工业大学 A kind of medical bodies Relation extraction method based on Fusion Features
CN109815497A (en) * 2019-01-23 2019-05-28 四川易诚智讯科技有限公司 Based on the interdependent character attribute abstracting method of syntax
CN109992667A (en) * 2019-03-26 2019-07-09 新华三大数据技术有限公司 A kind of file classification method and device
CN110019829A (en) * 2017-09-19 2019-07-16 小草数语(北京)科技有限公司 Data attribute determines method, apparatus
CN110084371A (en) * 2019-03-27 2019-08-02 平安国际智慧城市科技股份有限公司 Model iteration update method, device and computer equipment based on machine learning
CN110458236A (en) * 2019-08-14 2019-11-15 有米科技股份有限公司 A kind of Advertising Copy style recognition methods and system
CN110532358A (en) * 2019-07-05 2019-12-03 东南大学 A kind of template automatic generation method towards knowledge base question and answer
WO2019227614A1 (en) * 2018-06-01 2019-12-05 平安科技(深圳)有限公司 Method and device for obtaining triple of samples, computer device and storage medium
CN110674637A (en) * 2019-09-06 2020-01-10 腾讯科技(深圳)有限公司 Character relation recognition model training method, device, equipment and medium
CN110781301A (en) * 2019-09-25 2020-02-11 中国科学院信息工程研究所 Character information extraction method for character attribute sparse page
CN110825847A (en) * 2019-10-31 2020-02-21 北京奇艺世纪科技有限公司 Method and device for identifying intimacy between target people, electronic equipment and storage medium
CN110969005A (en) * 2018-09-29 2020-04-07 航天信息股份有限公司 Method and device for determining similarity between entity corpora
CN111027324A (en) * 2019-12-05 2020-04-17 电子科技大学广东电子信息工程研究院 Method for extracting open type relation based on syntax mode and machine learning
CN111444713A (en) * 2019-01-16 2020-07-24 清华大学 Method and device for extracting entity relationship in news event
CN112579748A (en) * 2019-09-30 2021-03-30 北京国双科技有限公司 Method and device for extracting specific event relation from inquiry record
CN112711700A (en) * 2019-10-24 2021-04-27 富驰律法(北京)科技有限公司 Method and system for recommending case for fair litigation
CN113191118A (en) * 2021-05-08 2021-07-30 山东省计算中心(国家超级计算济南中心) Text relation extraction method based on sequence labeling
CN108829854B (en) * 2018-06-21 2021-08-31 北京百度网讯科技有限公司 Method, apparatus, device and computer-readable storage medium for generating article
CN114492420A (en) * 2022-04-02 2022-05-13 北京中科闻歌科技股份有限公司 Text classification method, device and equipment and computer readable storage medium
CN115796280A (en) * 2023-01-31 2023-03-14 南京万得资讯科技有限公司 Entity identification entity linking system suitable for high efficiency and controllability in financial field

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104657750A (en) * 2015-03-23 2015-05-27 苏州大学张家港工业技术研究院 Method and device for extracting character relation
CN104933027A (en) * 2015-06-12 2015-09-23 华东师范大学 Open Chinese entity relation extraction method using dependency analysis
CN105608070A (en) * 2015-12-21 2016-05-25 中国科学院信息工程研究所 Character relationship extraction method oriented to headline

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104657750A (en) * 2015-03-23 2015-05-27 苏州大学张家港工业技术研究院 Method and device for extracting character relation
CN104933027A (en) * 2015-06-12 2015-09-23 华东师范大学 Open Chinese entity relation extraction method using dependency analysis
CN105608070A (en) * 2015-12-21 2016-05-25 中国科学院信息工程研究所 Character relationship extraction method oriented to headline

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张晗: "融合句义特征的人名消歧及人物关系抽取技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (48)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106951558B (en) * 2017-03-31 2020-06-12 广东睿盟计算机科技有限公司 Data processing method of tax intelligent consultation platform based on deep search
CN106951558A (en) * 2017-03-31 2017-07-14 广东睿盟计算机科技有限公司 A kind of data processing method of the tax intelligent consulting platform based on deep search
CN107341171A (en) * 2017-05-03 2017-11-10 刘洪利 Extract the method and system of data (gene) feature templates method and application template
CN107341171B (en) * 2017-05-03 2021-07-27 刘洪利 Method for extracting data feature template and method and system for applying template
CN108664615A (en) * 2017-05-12 2018-10-16 华中师范大学 A kind of knowledge mapping construction method of discipline-oriented educational resource
CN107506346A (en) * 2017-07-10 2017-12-22 北京享阅教育科技有限公司 A kind of Chinese reading grade of difficulty method and system based on machine learning
CN110019829B (en) * 2017-09-19 2021-05-07 绿湾网络科技有限公司 Data attribute determination method and device
CN110019829A (en) * 2017-09-19 2019-07-16 小草数语(北京)科技有限公司 Data attribute determines method, apparatus
CN107908749A (en) * 2017-11-17 2018-04-13 哈尔滨工业大学(威海) A kind of personage's searching system and method based on search engine
CN107908749B (en) * 2017-11-17 2020-04-10 哈尔滨工业大学(威海) Character retrieval system and method based on search engine
US10664660B2 (en) 2017-11-23 2020-05-26 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and device for extracting entity relation based on deep learning, and server
CN107798136A (en) * 2017-11-23 2018-03-13 北京百度网讯科技有限公司 Entity relation extraction method, apparatus and server based on deep learning
CN107798136B (en) * 2017-11-23 2020-12-01 北京百度网讯科技有限公司 Entity relation extraction method and device based on deep learning and server
CN107957991A (en) * 2017-12-05 2018-04-24 湖南星汉数智科技有限公司 A kind of entity attribute information extraction method and device relied on based on syntax
CN108073708A (en) * 2017-12-20 2018-05-25 北京百度网讯科技有限公司 Information output method and device
CN108363816A (en) * 2018-03-21 2018-08-03 北京理工大学 Open entity relation extraction method based on sentence justice structural model
WO2019227614A1 (en) * 2018-06-01 2019-12-05 平安科技(深圳)有限公司 Method and device for obtaining triple of samples, computer device and storage medium
CN108959418A (en) * 2018-06-06 2018-12-07 中国人民解放军国防科技大学 Character relation extraction method and device, computer device and computer readable storage medium
CN108829854B (en) * 2018-06-21 2021-08-31 北京百度网讯科技有限公司 Method, apparatus, device and computer-readable storage medium for generating article
CN108805290A (en) * 2018-06-28 2018-11-13 国信优易数据有限公司 A kind of determination method and device of entity class
CN109215797B (en) * 2018-09-05 2022-04-08 山东管理学院 Method and system for extracting non-classification relation of traditional Chinese medicine medical case based on extended association rule
CN109215797A (en) * 2018-09-05 2019-01-15 山东管理学院 Chinese medicine case non-categorical Relation extraction method and system based on extension correlation rule
CN110969005B (en) * 2018-09-29 2023-10-31 航天信息股份有限公司 Method and device for determining similarity between entity corpora
CN110969005A (en) * 2018-09-29 2020-04-07 航天信息股份有限公司 Method and device for determining similarity between entity corpora
CN109710932A (en) * 2018-12-22 2019-05-03 北京工业大学 A kind of medical bodies Relation extraction method based on Fusion Features
CN111444713A (en) * 2019-01-16 2020-07-24 清华大学 Method and device for extracting entity relationship in news event
CN111444713B (en) * 2019-01-16 2022-04-29 清华大学 Method and device for extracting entity relationship in news event
CN109815497A (en) * 2019-01-23 2019-05-28 四川易诚智讯科技有限公司 Based on the interdependent character attribute abstracting method of syntax
CN109815497B (en) * 2019-01-23 2023-04-18 四川易诚智讯科技有限公司 Character attribute extraction method based on syntactic dependency
CN109992667A (en) * 2019-03-26 2019-07-09 新华三大数据技术有限公司 A kind of file classification method and device
CN109992667B (en) * 2019-03-26 2021-06-08 新华三大数据技术有限公司 Text classification method and device
CN110084371A (en) * 2019-03-27 2019-08-02 平安国际智慧城市科技股份有限公司 Model iteration update method, device and computer equipment based on machine learning
CN110084371B (en) * 2019-03-27 2021-01-15 平安国际智慧城市科技股份有限公司 Model iteration updating method and device based on machine learning and computer equipment
CN110532358B (en) * 2019-07-05 2023-08-22 东南大学 Knowledge base question-answering oriented template automatic generation method
CN110532358A (en) * 2019-07-05 2019-12-03 东南大学 A kind of template automatic generation method towards knowledge base question and answer
CN110458236A (en) * 2019-08-14 2019-11-15 有米科技股份有限公司 A kind of Advertising Copy style recognition methods and system
CN110674637A (en) * 2019-09-06 2020-01-10 腾讯科技(深圳)有限公司 Character relation recognition model training method, device, equipment and medium
CN110781301A (en) * 2019-09-25 2020-02-11 中国科学院信息工程研究所 Character information extraction method for character attribute sparse page
CN112579748A (en) * 2019-09-30 2021-03-30 北京国双科技有限公司 Method and device for extracting specific event relation from inquiry record
CN112711700A (en) * 2019-10-24 2021-04-27 富驰律法(北京)科技有限公司 Method and system for recommending case for fair litigation
CN110825847B (en) * 2019-10-31 2022-09-02 北京奇艺世纪科技有限公司 Method and device for identifying intimacy between target people, electronic equipment and storage medium
CN110825847A (en) * 2019-10-31 2020-02-21 北京奇艺世纪科技有限公司 Method and device for identifying intimacy between target people, electronic equipment and storage medium
CN111027324A (en) * 2019-12-05 2020-04-17 电子科技大学广东电子信息工程研究院 Method for extracting open type relation based on syntax mode and machine learning
CN111027324B (en) * 2019-12-05 2023-11-21 电子科技大学广东电子信息工程研究院 Open relation extraction method based on syntactic pattern and machine learning
CN113191118A (en) * 2021-05-08 2021-07-30 山东省计算中心(国家超级计算济南中心) Text relation extraction method based on sequence labeling
CN113191118B (en) * 2021-05-08 2023-07-18 山东省计算中心(国家超级计算济南中心) Text relation extraction method based on sequence annotation
CN114492420A (en) * 2022-04-02 2022-05-13 北京中科闻歌科技股份有限公司 Text classification method, device and equipment and computer readable storage medium
CN115796280A (en) * 2023-01-31 2023-03-14 南京万得资讯科技有限公司 Entity identification entity linking system suitable for high efficiency and controllability in financial field

Similar Documents

Publication Publication Date Title
CN106484675A (en) Fusion distributed semantic and the character relation abstracting method of sentence justice feature
CN103235772B (en) A kind of text set character relation extraction method
CN107491531B (en) Chinese network comment sensibility classification method based on integrated study frame
CN104391942B (en) Short essay eigen extended method based on semantic collection of illustrative plates
Bhowmik et al. Bangla text sentiment analysis using supervised machine learning with extended lexicon dictionary
Sadegh et al. Opinion mining and sentiment analysis: A survey
CN103207913B (en) The acquisition methods of commercial fine granularity semantic relation and system
CN106202543A (en) Ontology Matching method and system based on machine learning
Chang et al. Research on detection methods based on Doc2vec abnormal comments
CN106599032A (en) Text event extraction method in combination of sparse coding and structural perceptron
CN106997341A (en) A kind of innovation scheme matching process, device, server and system
CN106055560A (en) Method for collecting data of word segmentation dictionary based on statistical machine learning method
Sadr et al. Presentation of an efficient automatic short answer grading model based on combination of pseudo relevance feedback and semantic relatedness measures
Fu et al. Learning semantic hierarchies: A continuous vector space approach
CN110110116A (en) A kind of trademark image retrieval method for integrating depth convolutional network and semantic analysis
Sadr et al. Unified topic-based semantic models: A study in computing the semantic relatedness of geographic terms
Sebti et al. A new word sense similarity measure in WordNet
CN114997288A (en) Design resource association method
Wadud et al. Text coherence analysis based on misspelling oblivious word embeddings and deep neural network
CN112686025A (en) Chinese choice question interference item generation method based on free text
CN105701085A (en) Network duplicate checking method and system
Guélorget et al. Combining vagueness detection with deep learning to identify fake news
CN105701086A (en) Method and system for detecting literature through sliding window
CN107943852A (en) Chinese parallelism sentence recognition methods and system
Strakatova et al. All that glitters is not gold: a gold standard of adjective-noun collocations for German

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170308