CN112215000A

CN112215000A - Text classification method based on entity replacement

Info

Publication number: CN112215000A
Application number: CN202011131161.6A
Authority: CN
Inventors: 刘洪涛; 章家涵
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2020-10-21
Filing date: 2020-10-21
Publication date: 2021-01-12
Anticipated expiration: 2040-10-21
Also published as: CN112215000B

Abstract

The invention requests to protect a text classification method based on entity replacement, belongs to the field of natural language processing, and specifically comprises the following steps: (1) detecting the anchor phrases in the document by using an external knowledge base and inquiring an entity set corresponding to each anchor phrase; (2) averaging the document word vectors to obtain context vectors of the documents; (3) respectively calculating attention weights of entities corresponding to the anchor phrases under the context expression vectors to obtain disambiguation vectors of the phrases (4), replacing the anchor phrases at the original text positions with the disambiguation entity vectors and inputting a long-time memory network to obtain document expression vectors after disambiguation, inputting the document expression vectors into a full connection layer of a neural network, and calculating the probability of each text belonging to each category by using a classifier to train the network; (5) and predicting the class of the text to be predicted by using the trained model, and taking the class with the highest probability as the predicted class to be output. The method can eliminate the situation that the semantic ambiguity of the words in the document exists, and the word order information and the context information are kept, so that the text content can be classified more accurately.

Description

Text classification method based on entity replacement

Technical Field

The invention belongs to the field of natural language processing, and particularly relates to a text classification method based on entity replacement.

Background

Text classification is an important task of natural language processing, and refers to a technique for classifying a given text object according to the characteristics of the text in a fixed category defined in advance. It is widely applied to many scenarios such as topic classification, spam detection, and sentiment classification. In recent years, deep learning and machine learning have made great progress in natural language processing. Recent studies have shown that neural network-based models perform better in text classification tasks than traditional models (e.g., naive bayes). Typical neural network-based text classification models are word-based. They typically use the words in the target document as input to the model, map the words into a space of continuous vectors (word embedding), and combine these vectors by methods such as summing, averaging, Convolutional Neural Networks (CNN) or Recurrent Neural Networks (RNN) to capture the semantics of the document.

In addition to the above methods, there have been studies attempting to capture semantic information using entities in a Knowledge Base (KB). This approach represents a document using a set of entities (or entity bags) that are related to the document. The benefits of using an entity are: unlike words, entities provide unambiguous semantic information because they are uniquely identified in a knowledge base, whereas words may be semantically ambiguous (e.g., "apple" may refer to fruit or apple corporation, which may have different meanings in different contexts). However, as with the previous approach using the bag-of-words model, simply representing a document with a set of entities can lose the lexical information. Meanwhile, some non-entity descriptive words also have rich information.

Disclosure of Invention

The present invention is directed to solving the above problems of the prior art. The text classification method based on entity replacement solves the problem of semantic ambiguity by finding out a proper entity to replace a semantically ambiguous word in an original text and simultaneously retains language order information and descriptive information in the original text. The technical scheme of the invention is as follows:

a method of text classification based on entity substitution, comprising the steps of:

s1, detecting the anchor phrases in the document by using an external knowledge base and inquiring an entity set corresponding to each anchor phrase;

s2, solving an embedding matrix for averaging the entity set obtained by the document word vector to obtain a context vector of the document;

s3, respectively calculating the attention weight of the entity corresponding to each anchor phrase under the document context expression vector to obtain a disambiguation vector of each entity;

s4, replacing the entity on the original position with a disambiguation entity vector and inputting a long-time memory network to obtain a document expression vector after disambiguation, inputting the document expression vector after disambiguation into a full connection layer of a neural network, and using a classifier to calculate the probability of each text belonging to each category to train the network;

and S5, predicting the category of the text to be predicted by using the trained model, and taking the category with the highest probability as the predicted category to be output.

Further, in step S1, the detecting anchor phrases in the document and querying the entity set corresponding to each anchor phrase by using the external knowledge base includes the following steps:

s11, defining the entity as the determined unambiguous object in the knowledge base; an "anchor phrase" is a literal word, and an anchor phrase may correspond to multiple entities, and an entity may also be represented by multiple anchor phrases;

s12 collecting all anchor phrases in the external corpus Wikipedia, for each anchor phrase S, all entities { e } connected with it₁,e₂,...e_KUsing all the anchor phrases and the entity dictionary thereof to form a Wikipedia dictionary;

s13, extracting all n-grams phrases (n is less than or equal to k) in a document T, wherein the n-grams phrases refer to phrases formed by n words, if one n-gram can exist as an anchor phrase in a Wikipedia dictionary and has at least two corresponding entities, adding the n-gram into a candidate anchor phrase, and adopting a 'first longest' method for the n-grams phrases with contradictory coverage, namely selecting the longest n-gram phrase which appears first, wherein all the anchor phrases in one document are expressed as:

U(T)＝{c₁,c₂,...}

the entity set corresponding to the ith anchor phrase is represented as:

E(c_i)＝{e₁,e₂,...}。

further, in step S2, averaging the document word vectors to obtain a context vector of the document, including the following steps:

s21, pre-training by using a Wikipedia2Vec tool to obtain an embedded matrix of words and entities, and enabling the word vector of the ith word in the document

Representing x as a d-dimensional vector),

representing d-dimensional space, d representing the degree of dimension, and the length of the document being n, the sentence is represented as:

x_1:n＝[x₁；x₂；...；x_n]

s22, averaging the word vectors of the document T to obtain the context vector of the document, wherein the calculation formula is as follows:

where C is a context vector for the document.

Further, in step S3, the step of respectively calculating attention weights of entities corresponding to the anchor phrases under the document context representation vector to obtain disambiguation vectors of the anchor phrases includes the following steps:

s31, obtaining the vector representation corresponding to the entity matched in the step S1 by using the embedding matrix pre-trained by the Wikipedia2Vec tool in the step S21, and enabling the jth entity vector corresponding to the ith anchor phrase in the document

S32, for each anchor phrase, calculating the attention weight of the corresponding entity vector under the context expression vector obtained in the step S2, and then weighting and summing the entity vectors to obtain the disambiguation vector of each anchor phrase, wherein the calculation formula is as follows:

wherein alpha is_ijThe attention weight of the jth entity corresponding to the ith anchor phrase of the document under the context C, v is the number of the entities corresponding to the ith anchor phrase of the document, and z_iA disambiguation vector for the ith anchor phrase of the document.

Further, in step S4, replacing the entity at the position of the original text with a disambiguation entity vector and inputting a long-term and short-term memory network to obtain a document expression vector after disambiguation, and inputting the document expression vector into a full link layer of a neural network, and using a classifier to calculate the probability that each text belongs to each category to train the network, the method includes the following steps:

s41, replacing the anchor phrase of the original document with the corresponding disambiguation vector obtained in the step S3, and then the document can be represented as T ═ x₁；...；z₁；...；z_v；...；x_n]，z_vRepresenting the last disambiguation vector, x_nRepresents the last primitive word vector, for convenience of description, denoted as [ l ]₁；...；l_r]Wherein r is the number of vectors contained after replacement;

s42, for the document T, the word vector and the disambiguation vector are input into a bidirectional long-time and short-time memory network in sequence, and for the forward direction of the long-time and short-time memory network, the word vector and the disambiguation vector are input into a bidirectional long-time and short-time memory network in sequence₁,...,l_rFor the reverse direction of the long-short term memory network, the data are output in sequenceL to_r,...,l₁(ii) a Calculating hidden layer state values of each word in the forward direction and the reverse direction, summing the hidden layer state values to obtain a final disambiguation document expression vector, wherein the calculation formula is as follows:

wherein l_iIs the ith vector in the document representation, f is a calculation function of the hidden layer state in the long-time and short-time memory network,

the ith vector in the document is represented as a hidden state vector in a forward long-and-short memory network,

representing the hidden layer state vector of the ith vector in the document in a reverse long-short time memory network, wherein o is a disambiguation vector of the document;

s43, inputting the disambiguation vector of the document into the full-link layer, using softmax normalization to calculate the probability that the document belongs to each category, finally using the log-likelihood function as a loss function, using the random gradient descent, using back propagation iteration to update the model parameters, and training the model by using the minimized loss function, wherein the calculation formula is as follows:

p＝softmax(W_co+b_c)

wherein, W_cIs a full connection layer weight matrix, b_cFor bias terms, softmax is a normalization operation, p is the probability of a document belonging to each category, x is the document in the training set, y is its true category label, and θ is the model parameter.

The invention has the following advantages and beneficial effects:

the invention provides a text classification method based on entity replacement, which is characterized in that a knowledge base and an attention mechanism are utilized to find out a proper entity to replace a semantically fuzzy word in an original text, and a document expression vector after ambiguity removal is obtained. The semantic ambiguity problem is solved, and simultaneously the word order information and the descriptive information in the original text are kept. Therefore, the understanding of the model to the semantics of the documents is improved, and the documents are classified more reliably and accurately.

The main innovation of the method is that the semantically unclear phrases or words at the corresponding positions in the document original text are replaced by the unambiguous entities in the knowledge base, so that the method only finds out the entities and considers the entities as an unordered set, and the language order information and other descriptive information are kept. For each ambiguous phrase, the most likely entity of the phrase is found out by using an attention mechanism, thereby improving the accuracy of determining the entity.

Drawings

FIG. 1 is a flow chart of a preferred embodiment of the present invention;

fig. 2 is a network structure diagram of a text classification method based on entity replacement according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.

The technical scheme for solving the technical problems is as follows:

the invention mainly provides a text classification method based on entity replacement. The process flow shown in fig. 1 is used. An entity set related to the document is found out by using a knowledge base, a correct entity is selected by using an attention mechanism shown in fig. 2, a semantic fuzzy word in the original text is replaced, a document expression vector after ambiguity removal is obtained, and the language sequence information and the descriptive information in the original text are also retained while the semantic fuzzy problem is solved.

The text classification method based on entity replacement comprises the following steps:

in this embodiment, the sub-steps of specifically implementing S1 are as follows:

s11, defining the entity as the determined unambiguous object in the knowledge base; an "anchor phrase" is a literal. One anchor phrase may correspond to a plurality of entities, and an entity may also be represented by a plurality of anchor phrases;

s12 collecting all anchor phrases in the external corpus Wikipedia, for each anchor phrase S, all entities { e } connected with it₁,e₂,...e_KAs its physical dictionary. All the anchor phrases and the entity dictionary form a Wikipedia dictionary together;

s13, extracting all n-grams phrases (n ≦ k) in the document T, and adding an n-gram to the candidate anchor phrase if the n-gram can exist as an anchor phrase in the Wikipedia dictionary and there are at least two corresponding entities. All anchor phrases in a document are represented as:

U(T)＝{c₁,c₂,...}

the entity set corresponding to the ith anchor phrase is represented as:

E(c_i)＝{e₁,e₂,...}

s2, averaging the word vectors of the documents to obtain context vectors of the documents;

in this embodiment, the sub-steps of specifically implementing S2 are as follows:

s21, pre-training by using Wikipedia2Vec tool to obtain embedded matrix of words and entitiesLet the word vector of the ith word in the document

If the document length is n, the sentence is represented as:

x_1:n＝[x₁；x₂；...；x_n]

s22: averaging the word vectors of the document T to obtain a context vector of the document, wherein the calculation formula is as follows:

where C is a context vector for the document.

in this embodiment, the sub-steps of specifically implementing S3 are as follows:

and S31, obtaining a vector representation corresponding to the entity matched in the step S1 by means of the embedding matrix pre-trained by the Wikipedia2Vec tool in the step S21. Let the jth entity vector corresponding to the ith anchor phrase in the document

S32, for each anchor phrase, calculating the attention weight of the corresponding entity vector under the context expression vector obtained in the step S2, and then weighting and summing the entity vectors to obtain the disambiguation vector of each anchor phrase. The calculation formula is as follows:

wherein alpha is_ijAnchoring short for ith documentThe attention weight of the jth entity corresponding to the language under the context C, v is the number of entities corresponding to the ith anchor phrase of the document, and z_iA disambiguation vector for the ith anchor phrase of the document.

S4, replacing the entity on the original position with a disambiguation entity vector and inputting a long-time memory network to obtain a document expression vector after disambiguation, inputting the document expression vector into a full connection layer of a neural network, and using a classifier to calculate the probability of each text belonging to each category to train the network;

in this embodiment, the sub-steps of specifically implementing S4 are as follows:

s41, replacing the anchor phrase of the original document with the corresponding disambiguation vector obtained in the step S3, and then the document can be represented as T ═ x₁；...；z₁；...；z_v；...；x_n]For convenience of description, it is marked as [ l₁；...；l_r]Wherein r is the number of vectors contained after replacement;

s42, for the document T, the word vector and the disambiguation vector are input into a bidirectional long-time and short-time memory network in sequence, and for the forward direction of the long-time and short-time memory network, the word vector and the disambiguation vector are input into a bidirectional long-time and short-time memory network in sequence₁,...,l_rFor the reverse direction of the long-short term memory network, l is input in sequence_r,...,l₁(ii) a And calculating hidden layer state values of each word in the forward direction and the reverse direction, and summing the hidden layer state values to obtain a final disambiguation document representation vector. The calculation formula is as follows:

s43, inputting the disambiguation vector of the document into the full-link layer, using softmax normalization to calculate the probability of the document belonging to each category, finally using the log-likelihood function as a loss function, using the random gradient descent, using the back propagation iteration to update the model parameters, training the model by using the minimized loss function, and calculating

The formula is as follows:

p＝softmax(W_co+b_c)

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims

1. A text classification method based on entity replacement is characterized by comprising the following steps:

2. The method of claim 1, wherein the text classification based on entity replacement,

in step S1, detecting the anchor phrases in the document by using the external knowledge base and querying the entity set corresponding to each anchor phrase, including the following steps:

U(T)＝{c₁,c₂,...}

the entity set corresponding to the ith anchor phrase is represented as:

E(c_i)＝{e₁,e₂,...}。

3. the method of claim 2, wherein the text classification based on entity replacement,

in step S2, averaging the document word vectors to obtain a context vector of the document, including the following steps:

Representing x as a d-dimensional vector),

a d-dimensional space is represented,d represents the degree of dimension, the length of the document is n, and the sentence is represented as:

x_1:n＝[x₁；x₂；...；x_n]

where C is a context vector for the document.

4. The method of claim 3, wherein the text classification based on entity replacement,

in step S3, the step of calculating the attention weight of the entity corresponding to each anchor phrase under the document context expression vector to obtain the disambiguation vector of each anchor phrase includes the following steps:

5. The method of claim 4, wherein the text classification based on entity replacement,

in step S4, replacing the entity at the position of the original text with a disambiguation entity vector and inputting a long-term memory network to obtain a document expression vector after disambiguation, inputting the document expression vector into a full link layer of a neural network, and calculating the probability of each text belonging to each category using a classifier to train the network, including the following steps:

s42, for the document T, the word vector and the disambiguation vector are input into a bidirectional long-time and short-time memory network in sequence, and for the forward direction of the long-time and short-time memory network, the word vector and the disambiguation vector are input into a bidirectional long-time and short-time memory network in sequence₁,...,l_rFor the reverse direction of the long-short term memory network, l is input in sequence_r,...,l₁(ii) a Calculating hidden layer state values of each word in the forward direction and the reverse direction, summing the hidden layer state values to obtain a final disambiguation document expression vector, wherein the calculation formula is as follows:

p＝softmax(W_co+b_c)