CN107102989A

CN107102989A - A kind of entity disambiguation method based on term vector, convolutional neural networks

Info

Publication number: CN107102989A
Application number: CN201710373502.2A
Authority: CN
Inventors: 张雷; 高扬; 唐驰; 谢俊元
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2017-05-24
Filing date: 2017-05-24
Publication date: 2017-08-29
Anticipated expiration: 2037-05-24
Also published as: CN107102989B

Abstract

The present invention provides a kind of entity disambiguation method based on term vector, convolutional neural networks, including Entity recognition stage, Entity Semantics represent the four-stages such as stage, neural network learning training stage and entity classification stage.This method relies on the term vector and convolutional neural networks of word2vec training, respectively for treating candidate's entity summary info constructing semantic characteristic vector in disambiguation entity context and knowledge base.The cosine similarity of characteristic vector is calculated in the entity classification stage, the maximum candidate's entity of similarity is taken as the final goal entity for treating disambiguation entity.By the method for the present invention, the semantic expressiveness ability of entity is substantially increased, and then improve the accuracy rate of follow-up disambiguation.

Description

A kind of entity disambiguation method based on term vector, convolutional neural networks

Technical field

The invention belongs to technical field of Internet information, and in particular to a kind of entity disambiguation method, more particularly to a kind of base In term vector, the entity disambiguation method of convolutional neural networks.

Background technology

With the popularization of mobile Internet, microblogging, blog, mhkc, forum and major news websites, government work website Etc. the life for greatly facilitating the people.The data overwhelming majority on these platforms is all with unstructured or semi-structured shape Formula is present, and causes to there is substantial amounts of Ambiguity in these data.If it is accurate that these can be carried out containing ambiguous entity Disambiguation, it will for later-stage utilization produce great convenience.

The entity disambiguation algorithm underlying model of current main flow is to be based on bag of words mostly, the intrinsic limitation of bag of words, Cause these algorithms to be unable to make full use of the semantic information of context, cause entity disambiguation effect also to have greatly improved sky Between.Word insertion is the focus of machine learning in recent years, and the core concept of word insertion is exactly to construct a distribution for each word Formula represents that this avoid the wide gap between vocabulary and vocabulary.Convolutional neural networks are a branches of neural network model, can Effectively to catch local feature, then global modeling.If can be modeled using convolutional neural networks to word insertion, then energy Access semantic feature more more effective than bag of words.And based on local sensing and the shared thought of weights, convolutional Neural net Network Model Parameter greatly reduces, and training speed is very fast, and the AlphaGo of Google core is exactly two convolutional neural networks.

The present invention combines term vector and convolutional neural networks, for treating disambiguation entity context and knowledge base entity Summary info, respectively constructing semantic represent that training convolutional neural networks are predicted.Substantially increase the language of entity context Adopted descriptive power.

The content of the invention

Goal of the invention：The present invention is difficult by the present situation of the semantic information of context for existing entity disambiguation method, carries For a kind of entity disambiguation method based on term vector, convolutional neural networks, it is intended to catch context semantic information to help entity Disambiguation.

Technical scheme：

A kind of entity disambiguation method based on term vector, convolutional neural networks, including step：

Step 1：According to application scenarios collect comprising treating the text set of disambiguation entity, and text set is pre-processed, Determine that each in text set treats disambiguation entity and its contextual feature；

Step 2：The knowledge base of disambiguation entity is treated according to what domain knowledge was built, and searches for knowledge base, determines that each is treated The Expressive Features of each candidate's entity in candidate's entity sets of disambiguation entity and set；

Step 3：The term vector of the noun in centered on treating disambiguation entity fixed size window is taken to be constituted one Term vector matrix, is used as the context semantic feature for treating disambiguation entity；The summary info of the entity of each in knowledge base is taken in meter The term vector for calculating larger preceding 20 nouns of weight ratio after TFIDF constitutes term vector matrix, is used as the semanteme of knowledge base entity Feature；

Step 4：By known unambiguously entity joint knowledge base target entity and candidate's entity composing training in text Gather, and be input to convolutional neural networks model and be trained, the parameter in adjustment model；

Step 5：The sample for entity and knowledge base candidate the entity sets composition for treat disambiguation to each, is input to step 4 Obtained convolutional neural networks model, respectively obtains and treats that each knowledge base is real in disambiguation entity and knowledge base candidate's entity sets The semantic feature vector of body；

Step 6：Based on semantic feature vector, each entity in disambiguation entity and knowledge base candidate's entity sets is treated in calculating Cosine similarity；It is the final goal entity for treating disambiguation entity to take the maximum candidate's entity of similarity.

Pretreatment in the step 1 is to carry out part-of-speech tagging to text set with Chinese Academy of Sciences Chinese word segmentation program ICTCLAS Participle, then filters out stop words according to deactivation vocabulary, and for some proper nouns and the indiscernible physical name of comparison Create a name dictionary.

Chinese word segmentation program ICTCLAS is called to carry out part-of-speech tagging point to the entity description in knowledge base in the step 2 Word, stop words is filtered out according to vocabulary is disabled.

The term vector of the noun in a fixed size window in the step 3 centered on treating disambiguation entity is constituted one Individual term vector matrix is specially：

1) Google deep learning program word2vec is called to be trained Chinese wikipedia corpus, so as to obtain word Vector table L, the length of term vector therein is 200 dimensions, and often one-dimensional is all a real number；

2) disambiguation entity e context conetxt is treated_e={ w₁,w₂,…,w_i,…,w_KIn each noun w_iInquiry Term vector table L, obtains the term vector v of each noun_i；

3) according to the term vector for the context words for treating disambiguation entity e, the context term vector square for treating disambiguation entity e is built Battle array [v₁,v₂,v₃,…v_i,…,v_K]；

4) terminate.

Take the summary info of the entity of each in the knowledge base weight ratio after TFIDF is calculated larger in the step 3 The term vector of preceding 20 nouns constitutes term vector matrix：

1) to candidate entity sets E={ e₁,e₂,…,e_nIn each candidate's entity e_iExpressive Features in each name Word w_iQuery word vector table L, obtains the term vector v of each noun_i；

2) according to the term vector of the noun of each in Expressive Features, the term vector matrix of entity description is built；

3) terminate.

The step 4 convolutional neural networks learning training is specially：

1) each treat that the semantic feature of disambiguation and the semantic feature of candidate's entity sets, as a training sample, are input to Neural network model；

2) semantic feature for treating disambiguation carries out convolution, sets convolution kernel feature map number as 200, sets Convolution kernel feature map size is [2,200], i.e., a length of 2, a width of 200 matrix；

3) result of each convolution kernel convolution uses 1-max ponds, obtains the feature of each convolution kernel；

4) 200 convolution kernel feature composition intermediate result, then full articulamentum is input to, full articulamentum size is 50, finally Obtain the semantic feature vector of one 50 dimension；

5) semantic feature of candidate's entity sets, first asks and adds and average, a full articulamentum is then input to again, size is same Sample is 50, finally gives the semantic feature vector of one 50 dimension；

6) the loss function Loss of the training sample of each in neutral net_eIt is defined as：

Loss_e=max (0,1-sim (e, e_ε)+sim(e,e′))

Wherein：e_εExpression treats that any other candidates in disambiguation entity e target entity, e ' expression candidate's entity setses are real Body, means the difference for maximizing target entity and any other candidate's Entity Semantics characteristic vector similarities；

Whole loss function is defined as：Loss=∑s Loss_e；

7) parameter in neutral net, which is used, is uniformly distributed U (- 0.01,0.01) initialization；

8) activation primitive in neutral net uses tanh tanh activation primitives；

9) parameter in neutral net is updated using stochastic gradient descent；

10) terminate.

The step 6 entity classification stage is specially：

1) the semantic feature vector a for treating disambiguation entity e is read from file system；

2) candidate entity sets E={ e are read from file system₁,e₂,…,e_nIn the vectorial set B=of semantic feature {b₁,b₂,…,b_n}；

3) candidate's entity sets is traveled through, the cosine similarity of each characteristic vector in e and E is calculated

4) the maximum entity of similarity is chosenAs finally predicting the outcome；

5) terminate.

Beneficial effect：Entity disambiguation method of the invention based on term vector, convolutional neural networks treats disambiguation entity respectively With knowledge base candidate's entity structure semantic expressiveness.Using training set training neural network model, in entity disambiguation, it will wait to disappear Discrimination entity is input to the neural network model trained, and output treats that most like candidate's entity of disambiguation entity is real as final goal Body.

Brief description of the drawings

In order to illustrate more clearly of the present invention, the accompanying drawing used needed for the present invention will be briefly described below：

Fig. 1 is the entity disambiguation method based on term vector, convolutional neural networks of the invention.

Fig. 2 is the structure chart of convolutional neural networks model.

Fig. 3 is the flow chart in entity classification stage.

Embodiment

The present invention is further described below in conjunction with the accompanying drawings.

The flow chart based on term vector, the entity disambiguation method of convolutional neural networks of the present invention is as shown in Figure 1.

Step 0 is the initial state of the entity disambiguation method of the present invention；

In the Entity recognition stage (step 1-6)：

Step 1 is to be collected according to application scenarios comprising the text set for treating disambiguation entity；

Step 2 is the knowledge base for treating disambiguation entity built according to domain knowledge；

Step 3 is to call Chinese Academy of Sciences Chinese word segmentation program ICTCLAS to carry out part-of-speech tagging participle to text set, then basis Disable vocabulary and filter out stop words, and for some proper nouns and comparison one noun word of indiscernible entity name creation Allusion quotation；

Step 4 is to call Chinese word segmentation program ICTCLAS to carry out part-of-speech tagging participle, root to the entity description in knowledge base Stop words is filtered out according to vocabulary is disabled；

Step 5 be according to application scenarios determine each pay close attention to treat disambiguation entity and its contextual feature；

Step 6 is the generation of candidate's entity, searches for knowledge base, compares and treats that disambiguation entity is censured in item and knowledge base in text Whether entity censures identical, if identical, it is to treat that disambiguation entity censures candidate's entity of item in text that these entities, which are regarded, really The Expressive Features of each candidate's entity during each is treated candidate's entity sets of disambiguation entity and gathered calmly；

The stage (step 7-10) is represented in Entity Semantics：

Step 7 is that the term vector for taking the noun in a fixed size window centered on treating disambiguation entity is constituted one Term vector matrix；(being carried out to text set after part-of-speech tagging word segmentation processing, the word with/n mark), window size is 10；

4) terminate.

Step 8 is the summary info for taking the entity of each in knowledge base 20 before weight ratio after calculating TFIDF is larger The term vector of individual noun constitutes term vector matrix；If inadequate 20 nouns, take existing all nouns；

3) terminate.

Step 9 is to regard the term vector matrix of step 7 as the context semantic feature for treating disambiguation entity；

Step 10 be using the term vector matrix of step 8 as knowledge base entity semantic feature；

In the neural network learning training stage (step 11-12)：

Step 11 is by known unambiguously entity joint knowledge base entity composing training set in text；

The training set that step 12 is directed in step 11 is input to convolutional neural networks model and is trained, in adjustment model Parameter；

1) each treat that the semantic expressiveness of disambiguation and the semantic feature of candidate's entity sets, as a training sample, are input to Neural network model；

Loss_e=max (0,1-sim (e, e_ε)+sim(e,e′))

Whole loss function is defined as：Loss=∑s Loss_e；

8) activation primitive in neutral net uses tanh tanh activation primitives；

9) parameter in neutral net is updated using stochastic gradient descent；

10) terminate.

In the entity classification stage (step 13-14)：

Step 13 is to read the sample set that disambiguation entity and knowledge base candidate's entity are treated in text；

The sample set that step 14 traversal step 13 is read, the convolution that step 12 training is obtained is input to by each sample Neural network model, and output category result；

Step 15 is the end step based on term vector, the entity disambiguation method of convolutional neural networks of the present invention；

Fig. 2 is the detailed overview diagram to the neural network structure of the step 12 in the neural network learning training stage in Fig. 1, Including following several parts：

Term vector matrix：Treat the term vector matrix of disambiguation entity context and the term vector square of knowledge base entity description feature Battle array as convolutional neural networks input；

Convolutional layer：Disambiguation entity context term vector matrix is treated, convolution is carried out by 200 different convolution kernels, obtains To the feature of each convolution kernel；

1-max ponds layer：1-max ponds are carried out to the feature that convolutional layer is exported, the intermediate result of one 200 dimension is obtained；

Full articulamentum：The full articulamentum that a size is 50 is connected to above-mentioned intermediate result, to knowledge base candidate's entity Term vector adds and is averaged also one size of connection to be 50 full articulamentum, so as to obtain the semantic features vector of two 50 dimensions；

Similarity Measure：Calculate the cosine similarity of two semantic feature vectors；

Fig. 3 is the detailed process description to the step 14 in the entity classification stage in Fig. 1：

Step 16 is Fig. 3 initial state figure；

Step 17 is to read the neural network model trained in file system；

Step 18 is to read the sample set that disambiguation entity and knowledge base candidate's entity are treated in text；

Step 19 is that sample set is input into convolutional neural networks model, obtains after semantic feature vector, travels through knowledge Storehouse candidate's entity sets, calculates the cosine similarity for the semantic feature vector for treating disambiguation entity and each candidate's entity；

Step 20 is output similarity highest entity as final goal entity；

Step 21 is Fig. 3 done state figure；

Specifically：1) the semantic feature vector a for treating disambiguation entity e is read from file system；

5) terminate.

In summary, present invention comprehensive utilization term vector, the method for convolutional neural networks, are treated under disambiguation physically respectively Text and knowledge base candidate's entity summary info construct term vector matrix, are input to convolutional neural networks model.Training convolutional nerve Parameter in network model, adjustment model.In forecast period, export most like entity and be used as target entity.Solve tradition Bag of words in there is vocabulary wide gap, so as to the problem of semantic expressiveness scarce capacity, further increase the standard of entity disambiguation True rate.

Described above is only the preferred embodiment of the present invention, it should be pointed out that：For the ordinary skill people of the art For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should It is considered as protection scope of the present invention.

Claims

1. a kind of entity disambiguation method based on term vector, convolutional neural networks, it is characterised in that：Including step：

Step 1：According to application scenarios collect comprising treating the text set of disambiguation entity, and text set is pre-processed, it is determined that Each in text set treats disambiguation entity and its contextual feature；

Step 2：The knowledge base of disambiguation entity is treated according to what domain knowledge was built, and searches for knowledge base, determines that each treats disambiguation The Expressive Features of each candidate's entity in candidate's entity sets of entity and set；

Step 3：Take the noun in centered on treating disambiguation entity fixed size window term vector constituted a word to Moment matrix, is used as the context semantic feature for treating disambiguation entity；The summary info of the entity of each in knowledge base is taken to calculate The term vector of larger preceding 20 nouns of weight ratio constitutes term vector matrix after TFIDF, special as the semanteme of knowledge base entity Levy；

Step 4：By known unambiguously entity joint knowledge base target entity and candidate's entity composing training set in text, And be input to convolutional neural networks model and be trained, the parameter in adjustment model；

Step 5：The sample for entity and knowledge base candidate the entity sets composition for treat disambiguation to each, is input to step 4 and obtains Convolutional neural networks model, respectively obtain and treat each knowledge base entity in disambiguation entity and knowledge base candidate's entity sets Semantic feature vector；

Step 6：Based on semantic feature vector, calculate and treat the remaining of each entity in disambiguation entity and knowledge base candidate's entity sets String similarity；It is the final goal entity for treating disambiguation entity to take the maximum candidate's entity of similarity.

2. entity disambiguation method according to claim 1, it is characterised in that：Pretreatment in the step 1 is section in using Institute Chinese word segmentation program ICTCLAS carries out part-of-speech tagging participle to text set, then filters out stop words according to deactivation vocabulary, and And for some proper nouns and comparison one noun dictionary of indiscernible entity name creation.

3. entity disambiguation method according to claim 1, it is characterised in that：Chinese word segmentation program is called in the step 2 ICTCLAS carries out part-of-speech tagging participle to the entity description in knowledge base, and stop words is filtered out according to vocabulary is disabled.

4. entity disambiguation method according to claim 1, it is characterised in that：To treat disambiguation entity in the step 3 The term vector of noun in one fixed size window of the heart constitutes a term vector matrix：

1) Google deep learning program word2vec is called to be trained Chinese wikipedia corpus, so as to obtain term vector Table L, the length of term vector therein is 200 dimensions, and often one-dimensional is all a real number；

2) disambiguation entity e context conetxt is treated_e={ w₁,w₂,…,w_i,…,w_KIn each noun w_iQuery word to Scale L, obtains the term vector v of each noun_i；

3) according to the term vector for the context words for treating disambiguation entity e, the context term vector matrix for treating disambiguation entity e is built [v₁,v₂,v₃,…v_i,…,v_K]；

4) terminate.

5. entity disambiguation method according to claim 1, it is characterised in that：Each in knowledge base is taken in the step 3 It is specific that the summary info of entity term vector of larger preceding 20 nouns of weight ratio after TFIDF is calculated constitutes term vector matrix For：

1) to candidate entity sets E={ e₁,e₂,…,e_nIn each candidate's entity e_iExpressive Features in each noun w_i Query word vector table L, obtains the term vector v of each noun_i；

3) terminate.

6. entity disambiguation method according to claim 1, it is characterised in that：The step 4 convolutional neural networks study instruction Practice detailed process as follows：

1) each treat that the semantic feature of disambiguation and the semantic feature of candidate's entity sets, as a training sample, are input to nerve Network model；

2) semantic feature for treating disambiguation carries out convolution, sets convolution kernel feature map number as 200, sets convolution Core feature map size is [2,200], i.e., a length of 2, a width of 200 matrix；

4) 200 convolution kernel feature composition intermediate result, then be input to full articulamentum, full articulamentum size is 50, is finally given The semantic feature vector of one 50 dimension；

5) semantic feature of candidate's entity sets, first asks and adds and average, a full articulamentum is then input to again, size is similarly 50, finally give the semantic feature vector of one 50 dimension；

Loss_e=max (0,1-sim (e, e_ε)+sim(e,e′))

Wherein：e_εAny other candidate's entities in disambiguation entity e target entity, e ' expression candidate's entity setses, meaning are treated in expression Think of is the difference for maximizing target entity and any other candidate's Entity Semantics characteristic vector similarities；

Whole loss function is defined as：Loss=∑s Loss_e；

8) activation primitive in neutral net uses tanh tanh activation primitives；

9) parameter in neutral net is updated using stochastic gradient descent；

10) terminate.

7. entity disambiguation method according to claim 1, it is characterised in that：The step 6 entity classification stage specific mistake Journey is as follows：

2) candidate entity sets E={ e are read from file system₁,e₂,…,e_nIn semantic feature vector set B={ b₁, b₂,…,b_n}；

5) terminate.