CN107102989B

CN107102989B - Entity disambiguation method based on word vector and convolutional neural network

Info

Publication number: CN107102989B
Application number: CN201710373502.2A
Authority: CN
Inventors: 张雷; 高扬; 唐驰; 谢俊元
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2017-05-24
Filing date: 2017-05-24
Publication date: 2020-09-29
Anticipated expiration: 2037-05-24
Also published as: CN107102989A

Abstract

The invention provides an entity disambiguation method based on word vectors and a convolutional neural network. The method constructs semantic feature vectors respectively aiming at the context of an entity to be disambiguated and candidate entity abstract information in a knowledge base by depending on word vectors and a convolutional neural network trained by word2 vec. And calculating the cosine similarity of the characteristic vectors in the entity classification stage, and taking the candidate entity with the maximum similarity as the final target entity of the entity to be disambiguated. By the method, the semantic representation capability of the entity is greatly improved, and the accuracy of subsequent disambiguation is further improved.

Description

Entity disambiguation method based on word vector and convolutional neural network

Technical Field

The invention belongs to the technical field of internet information, and particularly relates to an entity disambiguation method, in particular to an entity disambiguation method based on word vectors and a convolutional neural network.

Background

With the popularization of mobile internet, microblogs, blogs, posts, forums, various news websites, government work websites and the like greatly facilitate the life of people. Most of the data on the platforms exist in an unstructured or semi-structured form, so that a large amount of ambiguity exists in the data. If these ambiguous entities can be accurately disambiguated, great convenience is brought to later utilization.

Most of the existing mainstream entity disambiguation algorithm bottom models are based on bag-of-words models, and the bag-of-words models have inherent limitations, so that the algorithms cannot fully utilize semantic information of contexts, and the entity disambiguation effect has a great improvement space. Word embedding is a hot spot of machine learning in recent years, and the core idea of word embedding is to construct a distributed representation for each word, so that a gap between words is avoided. The convolutional neural network is a branch of a neural network model, and can effectively capture local features and then carry out global modeling. If the convolutional neural network can be used for modeling word embedding, semantic features more effective than the bag-of-words model can be obtained. And based on the thought of local perception and weight sharing, the parameters in the convolutional neural network model are greatly reduced, the training speed is high, and the core of Alphago of Google is two convolutional neural networks.

The invention combines the word vector and the convolutional neural network, constructs semantic representation respectively aiming at the context of the entity to be disambiguated and the entity abstract information of the knowledge base, trains the convolutional neural network and predicts. The semantic description capability of the entity context is greatly improved.

Disclosure of Invention

The purpose of the invention is as follows: the invention provides an entity disambiguation method based on word vectors and a convolutional neural network aiming at the current situation that the existing entity disambiguation method is difficult to utilize context semantic information, and aims to capture the context semantic information to help entity disambiguation.

The technical scheme is as follows:

an entity disambiguation method based on word vector and convolutional neural network comprises the following steps:

step 1: according to a text set which is collected by an application scene and contains entities to be disambiguated, preprocessing the text set, and determining each entity to be disambiguated and context characteristics thereof in the text set;

step 2: constructing a knowledge base of the entities to be disambiguated according to the domain knowledge, searching the knowledge base, and determining a candidate entity set of each entity to be disambiguated and description characteristics of each candidate entity in the set;

and step 3: taking word vectors of nouns in a fixed-size window taking an entity to be disambiguated as a center to form a word vector matrix as a context semantic feature of the entity to be disambiguated; taking word vectors of the first 20 nouns with larger weights after calculating TF & IDF of the abstract information of each entity in the knowledge base to form a word vector matrix as semantic features of the knowledge base entities;

and 4, step 4: combining known unambiguous entities in the text with target entities and candidate entities in a knowledge base to form a training set, inputting the training set into a convolutional neural network model for training, and adjusting parameters in the model;

and 5: inputting a sample consisting of each entity to be disambiguated and the knowledge base candidate entity set into the convolutional neural network model obtained in the step (4) to respectively obtain semantic feature vectors of each knowledge base entity in the entity to be disambiguated and the knowledge base candidate entity set;

step 6: calculating the cosine similarity of the entity to be disambiguated and each entity in the knowledge base candidate entity set based on the semantic feature vector; and taking the candidate entity with the maximum similarity as the final target entity of the entity to be disambiguated.

The preprocessing in the step 1 is to use Chinese word segmentation program ICTCCLAS of Chinese academy of sciences to label and segment words of the text set, then filter out stop words according to a stop word list, and create a name dictionary for proper nouns and entity names which are difficult to recognize.

And in the step 2, a Chinese word segmentation program ICTCCLAS is called to perform part-of-speech tagging and word segmentation on entity descriptions in the knowledge base, and stop words are filtered according to the stop word list.

The step 3 of forming a word vector matrix by word vectors of nouns in a fixed-size window with the entity to be disambiguated as the center specifically comprises the following steps:

1) calling a Google deep learning program word2vec to train a Wikipedia corpus so as to obtain a word vector table L, wherein the length of a word vector is 200 dimensions, and each dimension is a real number;

2) context to be disambiguated entity e_e＝{w₁,w₂,…,w_i,…,w_KEach noun w in_iInquiring the word vector table L to obtain the word vector v of each noun_i；

3) Constructing a context word vector matrix [ v ] of the entity e to be disambiguated according to the word vector of the context word of the entity e to be disambiguated₁,v₂,v₃,…v_i,…,v_K]；

4) And (6) ending.

The step 3 of taking the word vectors of the first 20 nouns with larger weights after calculating the TF · IDF of the summary information of each entity in the knowledge base to form the word vector matrix specifically comprises:

1) for candidate entity set E ═ E₁,e₂,…,e_nEach candidate entity e in_iEach noun w in the description of (1)_iInquiring the word vector table L to obtain the word vector v of each noun_i；

2) Constructing a word vector matrix of entity description according to the word vector of each noun in the description characteristics;

3) and (6) ending.

The convolutional neural network learning training of the step 4 specifically comprises the following steps:

1) each semantic feature to be disambiguated and the semantic features of the candidate entity set are used as a training sample and input into the neural network model;

2) convolving semantic features to be disambiguated, setting the number of convolution kernel feature maps to be 200, and setting the size of the convolution kernel feature maps to be [2, 200], namely setting a matrix with the length of 2 and the width of 200;

3) pooling the convolution result of each convolution kernel by using 1-max to obtain the characteristic of each convolution kernel;

4)200 convolution kernel features form an intermediate result, the intermediate result is input into a full connection layer, the size of the full connection layer is 50, and a 50-dimensional semantic feature vector is finally obtained;

5) the semantic features of the candidate entity set are added and averaged, then the sum and the average are input into a full connection layer, the size of the full connection layer is also 50, and finally a 50-dimensional semantic feature vector is obtained;

6) loss function Loss of each training sample in neural network_eIs defined as:

Loss_e＝max(0,1-sim(e,e)+sim(e,e′))

wherein: e.g. of the typeRepresenting a target entity of an entity e to be disambiguated, and e' representing any other candidate entity in the candidate entity set, which means that the difference between the semantic feature vector similarity of the target entity and any other candidate entity is maximized;

the global Loss function is defined as Loss ∑ Loss_e；

7) Parameters in the neural network are initialized by uniformly distributed U (-0.01, 0.01);

8) the activation functions in the neural network all adopt tanh hyperbolic tangent activation functions;

9) parameters in the neural network are updated by adopting random gradient descent;

10) and (6) ending.

The step 6 of entity classification specifically comprises the following steps:

1) reading a semantic feature vector a of an entity e to be disambiguated from a file system;

2) reading a candidate entity set E ═ E from a file system₁,e₂,…,e_nSet of semantic feature vectors in { B ═ B } ═ B₁,b₂,…,b_n}；

3) Traversing the candidate entity set, and calculating the cosine similarity of each feature vector in E and E

4) Selecting the entity with the largest similarity

As a final prediction result;

5) and (6) ending.

Has the advantages that: the entity disambiguation method based on the word vector and the convolutional neural network constructs semantic representations for the entity to be disambiguated and the candidate entity of the knowledge base respectively. And training the neural network model by utilizing the training set, inputting the entity to be disambiguated into the trained neural network model when the entity is disambiguated, and outputting the most similar candidate entity of the entity to be disambiguated as a final target entity.

Drawings

For a more clear description of the invention, reference will now be made to the accompanying drawings, which form a part hereof and in which:

FIG. 1 is a block diagram of an entity disambiguation method based on word vector convolutional neural networks of the present invention.

Fig. 2 is a block diagram of a convolutional neural network model.

FIG. 3 is a flow chart of an entity classification phase.

Detailed Description

The present invention will be further described with reference to the accompanying drawings.

The flow chart of the entity disambiguation method based on the word vector and the convolutional neural network is shown in FIG. 1.

Step 0 is the initial state of the entity disambiguation method of the invention;

in the entity identification phase (steps 1-6):

step 1, collecting a text set containing an entity to be disambiguated according to an application scene;

step 2, constructing a knowledge base of the entity to be disambiguated according to the domain knowledge;

step 3, calling a Chinese word segmentation program ICTCCLAS of the Chinese academy of sciences to perform part-of-speech tagging and word segmentation on the text set, then filtering out stop words according to a stop word list, and creating a noun dictionary for some proper nouns and entity names which are difficult to recognize;

step 4, calling a Chinese word segmentation program ICTCCLAS to perform part-of-speech tagging and word segmentation on entity descriptions in a knowledge base, and filtering stop words according to a stop word list;

step 5, determining each concerned entity to be disambiguated and the context characteristics thereof according to the application scene;

step 6, generating candidate entities, searching a knowledge base, comparing whether the name of the entity to be disambiguated in the text is the same as the name of the entity in the knowledge base, if so, taking the entities as the candidate entities of the name of the entity to be disambiguated in the text, and determining the candidate entity set of each entity to be disambiguated and the description characteristics of each candidate entity in the set;

in the entity semantic representation phase (step 7-10):

step 7, a word vector matrix is formed by the word vectors of nouns in a fixed-size window taking the entity to be disambiguated as the center; (after the part of speech tagging and word segmentation processing is carried out on the text set, the word with/n marks is carried out), and the window size is 10;

4) And (6) ending.

Step 8, word vectors of the first 20 nouns with larger weights are obtained from the summary information of each entity in the knowledge base after TF & IDF calculation to form a word vector matrix; if the number of the nouns is less than 20, taking all the existing nouns;

3) and (6) ending.

Step 9, the word vector matrix in the step 7 is used as the context semantic feature of the entity to be disambiguated;

step 10, the word vector matrix in the step 8 is used as the semantic feature of the knowledge base entity;

in the neural network learning training phase (step 11-12):

step 11, combining known unambiguous entities in the text with knowledge base entities to form a training set;

step 12, inputting the training set in the step 11 into a convolutional neural network model for training, and adjusting parameters in the model;

1) each semantic representation to be disambiguated and the semantic features of the candidate entity set are used as a training sample and input into the neural network model;

6) loss function Loss of each training sample in neural network_eIs defined as:

Loss_e＝max(0,1-sim(e,e)+sim(e,e′))

the global Loss function is defined as Loss ∑ Loss_e；

10) and (6) ending.

In the entity classification phase (steps 13-14):

step 13, reading a sample set of the entity to be disambiguated and the knowledge base candidate entity in the text;

step 14, traversing the sample set read in step 13, inputting each sample into the convolutional neural network model obtained by training in step 12, and outputting a classification result;

step 15 is the end step of the entity disambiguation method based on word vector, convolutional neural network of the present invention;

FIG. 2 is a detailed overview of the neural network structure of step 12 in the training phase for neural network learning of FIG. 1, including the following components:

a word vector matrix: taking a word vector matrix of the context of the entity to be disambiguated and a word vector matrix of the entity description characteristics of the knowledge base as the input of the convolutional neural network;

and (3) rolling layers: carrying out convolution on the context word vector matrix of the entity to be disambiguated through 200 different convolution kernels to obtain the characteristics of each convolution kernel;

1-max pooling layer: performing 1-max pooling on the output characteristics of the convolutional layer to obtain a 200-dimensional intermediate result;

full connection layer: connecting a full-connection layer with the size of 50 to the intermediate result, and connecting a full-connection layer with the size of 50 to the word vector sum average of the candidate entities in the knowledge base so as to obtain two 50-dimensional semantic feature vectors;

and (3) similarity calculation: calculating cosine similarity of the two semantic feature vectors;

FIG. 3 is a detailed flow description of step 14 in the entity classification phase of FIG. 1:

step 16 is the start state diagram of FIG. 3;

step 17, reading a trained neural network model in the file system;

step 18, reading a sample set of the entity to be disambiguated and the knowledge base candidate entity in the text;

step 19, inputting the sample set into a convolutional neural network model to obtain semantic feature vectors, traversing the knowledge base candidate entity set, and calculating the cosine similarity of the semantic feature vectors of the entity to be disambiguated and each candidate entity;

step 20, outputting the entity with the highest similarity as a final target entity;

step 21 is the ending state diagram of FIG. 3;

specifically, the method comprises the following steps: 1) reading a semantic feature vector a of an entity e to be disambiguated from a file system;

4) Selecting the entity with the largest similarity

As a final prediction result;

5) and (6) ending.

In summary, the present invention constructs a word vector matrix by comprehensively using the word vector and the convolutional neural network, and respectively constructing the word vector matrix for the context of the entity to be disambiguated and the abstract information of the candidate entity in the knowledge base, and inputting the word vector matrix into the convolutional neural network model. And training a convolutional neural network model, and adjusting parameters in the model. In the prediction phase, the most similar entity is output as the target entity. The problem that the semantic representation capability is insufficient due to the existence of a vocabulary gap in a traditional bag-of-words model is solved, and the accuracy of entity disambiguation is further improved.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. An entity disambiguation method based on word vectors and convolutional neural networks is characterized in that: the method comprises the following steps:

the method for forming the word vector matrix by using the word vectors of the nouns in the fixed-size window with the entity to be disambiguated as the center specifically comprises the following steps:

4) Finishing;

the specific method for forming the word vector matrix by taking the word vectors of the first 20 nouns with larger weights after calculating TF & IDF of the abstract information of each entity in the knowledge base is as follows:

3) finishing;

and 4, step 4: combining known unambiguous entities in the text with target entities and candidate entities in a knowledge base to form a training set, inputting the training set into a convolutional neural network model for training, and adjusting parameters in the model; the method specifically comprises the following steps:

6) loss function Loss of each training sample in neural network_eIs defined as:

Loss_e＝max(0,1-sim(e,e)+sim(e,e′))

the global Loss function is defined as Loss ∑ Loss_e；

10) finishing;

step 6: calculating the cosine similarity of the entity to be disambiguated and each entity in the knowledge base candidate entity set based on the semantic feature vector; taking the candidate entity with the maximum similarity as the final target entity of the entity to be disambiguated; the method specifically comprises the following steps:

3) Traversing the candidate entity set, and calculating the cosine similarity l of each feature vector in E and E_i,

4) Selecting the entity with the largest similarity

As a final prediction result;

5) and (6) ending.

2. The entity disambiguation method of claim 1, characterized in that: the preprocessing in the step 1 is to use Chinese word segmentation program ICTCCLAS of Chinese academy of sciences to perform part-of-speech tagging and word segmentation on a text set, then filter out stop words according to a stop word list, and create a noun dictionary for proper nouns and entity names which are difficult to recognize.

3. The entity disambiguation method of claim 1, characterized in that: and in the step 2, a Chinese word segmentation program ICTCCLAS is called to perform part-of-speech tagging and word segmentation on entity descriptions in the knowledge base, and stop words are filtered according to the stop word list.