CN110516070B

CN110516070B - Chinese question classification method based on text error correction and neural network

Info

Publication number: CN110516070B
Application number: CN201910801515.4A
Authority: CN
Inventors: 杨一何; 刘晋
Original assignee: Shanghai Maritime University
Current assignee: Shanghai Maritime University
Priority date: 2019-08-28
Filing date: 2019-08-28
Publication date: 2022-09-30
Anticipated expiration: 2039-08-28
Also published as: CN110516070A

Abstract

The invention discloses a Chinese question classification method based on text error correction and a neural network, which comprises the following steps: acquiring text data of Chinese question sentences; correcting the text data to obtain corrected text data; preprocessing the corrected text data to obtain a Chinese question matrix vector; inputting the Chinese question matrix vector to a bidirectional gating circulation unit layer to obtain a middle semantic matrix vector; obtaining an attention matrix vector according to the attention weight value corresponding to the intermediate semantic matrix vector; inputting the attention moment matrix vector into a convolution neural network layer to obtain a global feature matrix vector; inputting the global characteristic matrix vector into a full-connection layer to obtain probability distribution of each category; and obtaining a Chinese question classification result based on the probability distribution. By applying the embodiment of the invention, the input question is corrected firstly, and then the bidirectional gating circulation unit network model, the attention mechanism and the convolutional neural network model are combined, so that the classification is more accurate.

Description

Chinese question classification method based on text error correction and neural network

Technical Field

The invention relates to the technical field of intelligent information processing and computers, in particular to a Chinese question classification method based on text error correction and a neural network.

Background

With rapid development of science and technology in the internet era and mass data coming in, the returned results need to be manually screened when a search engine is used for searching key words, and time and labor are consumed for users. The intention of the user can be rapidly obtained through the question-answering system, and the most accurate answer can be returned to the user from hundreds of candidate answers.

The Chinese question classification is the first step of a question-answering system and is one of key technologies for realizing accurate answer of the question-answering system, and through the classification of Chinese question, the question-answering system can effectively narrow the answer range and determine the question processing mode, so that the answer of the question-answering system is more accurate and reliable.

At present, the research on Chinese question classification methods mainly includes three types: one is a method based on rule matching and feature extraction, the other is a method based on traditional machine learning, and the last is a method based on deep learning. The method based on rule matching and feature extraction defines a set of rules for the features of different Chinese question sentences, and realizes the classification of the Chinese question sentences by analyzing the matching degree of the Chinese question sentences and the rules. There are many Chinese question classification methods based on machine learning, and the common methods include naive Bayes classification, support vector machine classification and the like, but still need to actively extract features, and still have certain subjectivity for Chinese question classification. The Chinese question classification method based on the deep learning also comprises a plurality of methods including convolutional neural network classification, cyclic neural network classification and the like, and compared with the traditional method, the Chinese question classification based on the deep learning method has higher accuracy. The current research and application proves that the upper and lower semantic information of Chinese question can be better learned based on a cyclic neural network such as long-short term memory, a bidirectional gating cyclic unit and the like, but the current research and application is not good at extracting local characteristics such as key information and the like. The convolutional neural network based on the convolutional neural network can learn local features in the sentence better, key information in the sentence is extracted, and position information of words can be omitted. At present, Chinese question sentences are classified in various types, the classification effect of a single classification method still cannot completely meet the requirement in practice, and few researches at present can fully utilize the advantages of a cyclic neural network and a convolutional neural network.

In addition, most of the current methods for classifying Chinese question sentences by using a deep learning neural network almost do not consider the situations of language diseases, multiple characters, different characters and the like of input question sentences, and all the situations are used as the input of a model for prediction or training, so that the trained model has larger deviation for similar question sentences, and the prediction is wrong.

Therefore, the problem of low classification accuracy caused by language diseases, different characters, multiple characters and the like of input question and the inherent defect of single classification method exists in the traditional Chinese question classification.

Disclosure of Invention

The invention aims to provide a Chinese question classification method based on text error correction and a neural network, and aims to solve the problem that the classification accuracy is not high enough due to the fact that the input question has language diseases, different characters and multiple characters and the existing classification method is single.

In order to achieve the above object, the present invention provides a method for classifying chinese question sentences based on text error correction and neural network, the method comprising:

acquiring text data of Chinese question sentences;

correcting the text data to obtain corrected text data;

preprocessing the error-corrected text data to obtain a Chinese question matrix vector;

inputting the Chinese question matrix vector to a bidirectional gating circulation unit layer to obtain a middle semantic matrix vector;

obtaining an attention matrix vector according to the attention weight corresponding to the intermediate semantic matrix vector;

inputting the attention moment matrix vector to a convolutional neural network layer to obtain a global feature matrix vector;

inputting the global feature matrix vector to a full-connection layer to obtain probability distribution of each category;

and obtaining a Chinese question classification result based on the probability distribution.

In one implementation manner of the present invention, the step of correcting the error of the text data to obtain the text data after error correction includes:

detecting error characters in the text data of the Chinese question to obtain a suspected error character candidate set;

traversing the obtained suspected error character candidate set, replacing error characters by using a similar pronunciation dictionary, and obtaining an error-corrected text data candidate set;

and calculating sentence confusion of the obtained text data candidate set after error correction through a language model, comparing and sequencing calculation results, and obtaining error-corrected text data.

In one implementation, the step of preprocessing the error-corrected text data to obtain a chinese question matrix vector includes:

removing target characters in the error-corrected text data to obtain preprocessed text data, wherein the target characters comprise preset characters and preset symbols;

and performing word segmentation processing on the obtained preprocessed text data to obtain a word segmentation processing result.

And vectorizing the word segmentation processing result by using a word vector tool to obtain a Chinese question matrix vector.

In one implementation, the step of inputting the matrix vector of the chinese question sentence into the bidirectional gated cyclic unit layer to obtain the intermediate semantic matrix vector includes:

inputting the Chinese question matrix vector into a bidirectional gating circulation unit layer, wherein the bidirectional gating circulation unit layer comprises a plurality of gating circulation units, and a single word vector is input into each gating circulation unit layer according to the sequence of words in the Chinese question;

and splicing the output of each gating circulation unit to generate an intermediate semantic matrix vector.

In one implementation, the step of inputting the attention moment matrix vector to a convolutional neural network layer to obtain a global feature matrix vector includes:

inputting the attention moment array vector into a convolutional neural network layer to enable the convolutional neural network layer to calculate convolutional kernels with different lengths, and obtaining a convolution result;

performing maximum pooling processing on the convolution result, and extracting local features;

and splicing the extracted local features to obtain a global feature matrix vector.

In one implementation, the step of inputting the global feature matrix vector to a full connection layer to obtain probability distributions of respective categories includes:

inputting the global feature matrix vector to a full connection layer to obtain an output vector;

and converting the output vector into probability distribution of corresponding categories according to the normalized exponential function.

Preferably, the step of obtaining the attention matrix vector according to the attention weight corresponding to the intermediate semantic matrix vector includes:

calculating an attention value by adopting a normalization exponential function according to the output of the gating circulation unit at the current moment and the attention weight parameter;

calculating the attention vector at the current moment according to the current moment and the total length of the time sequence;

and splicing the attention vectors at different moments to obtain an attention moment array vector.

By applying the Chinese question classification method based on text error correction and the neural network, the text error correction is carried out on the input question before the input question is classified in use, and after the text error correction, a bidirectional gated cyclic unit network model, a self-attention model and a convolutional neural network model are combined. The problem that the classification accuracy is not high enough due to the inherent defects of word diseases, different characters and multiple characters existing in the input question and the single classification method in the prior art is solved, so that the classification is more accurate.

Drawings

FIG. 1 is a schematic flow chart of an embodiment of the present invention.

Fig. 2 is a schematic structural diagram according to an embodiment of the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention.

Please refer to fig. 1-2. It should be noted that the drawings provided in the present embodiment are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

The invention provides a Chinese question classification method based on text error correction and neural network, as shown in fig. 1-2, comprising:

s110, Chinese question text data is obtained.

It should be noted that the obtained text data of the chinese question may be a search question input by the user in a search engine, a question input in a question-and-answer system, or a text type chinese question directly input by the user, which is not limited herein.

And S120, correcting the text data to obtain corrected text data.

Detecting the characters in the text data by using the language model confusion degree, calculating the likelihood probability value of the characters, and adding a suspected error character candidate set if the likelihood probability value is lower than the sentence text average value;

traversing the obtained suspected error character candidate set, replacing error characters by using a similar pronunciation dictionary, and obtaining an error-corrected text data candidate set; and calculating sentence confusion of the obtained text data candidate set after error correction through a language model, comparing and sequencing calculation results, and obtaining error-corrected text data. The language model may be an N-gram language model, a maximum entropy model, a neural network model, or the like, and is not limited herein.

S130, preprocessing the corrected text data to obtain a Chinese question matrix vector.

It can be understood that before the word segmentation processing is performed on the text data of the Chinese question, the special characters and the symbol punctuations in the Chinese question are removed first, and only the characters are reserved.

Then, the word segmentation processing is carried out to obtain a word segmentation result sequence w ₁ ，w ₂ ，...，w _n-1 ，w _n Where n is the number of words in the Chinese question, w _i For a word in a Chinese question, each w _i Are all words in the domain dictionary.

It can be understood that the word vector model trained in advance is adopted to vectorize the Chinese question after error correction and word segmentation to obtain a two-dimensional matrix vector, namely the word vector matrix of the Chinese question

S ₁ ，S ₂ ，...，S _n-1 ，S _n . Wherein S is _i ∈R ^d ，S _i The method is characterized in that the method is a word vector of each word in a Chinese question, d is the dimension of the word vector, n is the length of the Chinese question, the dimension of the whole Chinese question matrix vector is dXn, and a word vector tool is a Google word vector tool.

S140, inputting the Chinese question matrix vector to a bidirectional gating circulation unit layer to obtain a middle semantic matrix vector;

the word vector matrix S of Chinese question ₁ ，S ₂ ，...，S _n-1 ，S _n The word vector of each word corresponds to one gating circulation unit, and the gating circulation unit layer inputs single word vectors in sequence according to the sequence of the words in the Chinese question. The input sequence feature expression of the Chinese question adopts a directed gating circulation unit, namely the directed gating circulation unit and a reverse gating circulation unit are included, the network structure of the reverse gating circulation unit is the same as that of the forward gating circulation unit, and only the sequence of the input sequence is opposite. Therefore, the input Chinese question word vector matrix is input into the forward gated cyclic unit and the backward gated cyclic unit, i.e. the bidirectional gated cyclic unit.

Each gated cyclic unit outputs vector values of specified dimensions, the outputs of each gated cyclic unit are spliced to generate an intermediate semantic matrix vector H containing upper and lower semantic information ₁ ，H ₂ ，...，H _n-1 ，H _n In which H is _i ∈R ^k And k is the specified dimension size.

The process of outputting the vector value of the designated dimension by each gated loop unit is as follows:

gating the cyclic unit involves 4 parts of computation, let R _t ，Z _t ，

And H _t Respectively representing the reset vector, the refresh vector, the candidate memory cell at time tAnd outputting, wherein the time t represents the input sequence of the Chinese question word vectors;

calculating a reset vector R _t The formula is as follows,

R _t ＝σ(W _r S _t +U _r H _t-1 +B _r )

wherein, W _r And U _r Two variables are weight parameters, H _t-1 Is the output, S, of the last time-gated loop unit _t For the word vector input at the present moment, B _r Is a bias parameter, σ is an activation function;

calculating an update vector Z _t The formula is as follows:

Z _t ＝σ(W _z S _t +U _z H _t-1 +B _z )

wherein, W _z And U _z Two variables are weight parameters, H _t-1 Is the output of the last gated circulation cell, S _t For the word vector input at the present moment, B _z Is a bias parameter, σ is an activation function;

computing candidate memory cells

The formula is as follows,

wherein, W and U are weight parameters, B is bias parameter, and H _t-1 Is the output, S, of the last-moment gated loop unit _t For the word vector input at the present moment, R _t For the calculated reset vector, tanh is a tangent function;

calculating the output H at the current moment _t The formula is as follows:

wherein H _t-1 Is at the last momentOutput of a gated cyclic unit, Z _t Is the calculated update vector and is the update vector,

are candidate memory units for computation.

And S150, obtaining an attention matrix vector according to the attention weight value corresponding to the intermediate semantic matrix vector.

Note that the attention value a is calculated _ij The formula is as follows:

a _ij ＝softmax(W _a2 tanh(W _a1 H _t ))

wherein, W _a2 And W _a1 As a weight parameter, H _t Gating the output of the cyclic unit at the current moment;

calculating the attention vector C at the current moment _t The formula is as follows:

wherein, C _t For the output of the self-attention mechanism at the present moment, T _h Is the total length of the sequence, t is the current time, a _ij Is the attention weight value at the current moment;

splicing the attention vectors at different moments to obtain an attention moment array vector

And S160, inputting the attention moment matrix vector into a convolutional neural network layer to obtain a global feature matrix vector.

It will be appreciated that the moment matrix vector will be noted

As input to the convolutional neural network layer, wherein the convolutional neural network layer uses convolutional kernels of different lengths for calculation, and maximum pooling decimation is performed on the convolutional resultsAnd (4) taking local features, splicing a plurality of pooled results to obtain a global feature vector as the output of the convolutional neural network layer.

The attention moment array vector

At each time C _t Is a specified dimension k, and the length of the moment array vector is a specified time sequence length T _h Therefore, the attention matrix vector can be regarded as a k × T _h Each column represents an attention vector for a word.

The method for extracting the input features by using different convolution kernels comprises the following specific steps:

setting n different convolution kernel sizes, n is greater than or equal to 2, wherein the length of the convolution kernel is constant and is the dimension k of the attention vector, the width m can be changed, and m is greater than or equal to 1 and less than or equal to T _h ；

Different convolution kernels simultaneously carry out convolution calculation on input, and the calculation formula of the convolution kernels is as follows:

T _i ＝f(WC _i：i+m-1 +b)

where W is the weight information, m is the width of the convolution kernel, b is the bias parameter, C _i：i+m-1 The method is the splicing of attention vectors at a plurality of moments, and the formula is as follows:

wherein, C _i Attention vector at time i;

performing maximum pooling on the convolution result to extract local features, wherein the formula is as follows:

wherein max is the calculation of T _i Maximum value of (1);

and splicing a plurality of pooled results to obtain a global feature vector as the output of a convolutional neural network layer, wherein the formula is as follows:

where n is the number of different convolution kernels and T is the output of the convolutional neural network layer.

And S170, inputting the global feature matrix vector to a full-connection layer to obtain probability distribution of each category.

It should be noted that the global feature vector obtained by calculation is used as the input of the Fully connected layer (composed of a Fully connected layer), and the output vector of the Fully connected layer is calculated;

the formula is as follows:

F＝σ(W _f T+B _f )

wherein, W _f Is a weight parameter, B _f Is a bias parameter, σ is the activation function, and T is the global feature vector.

And then converting the output vector into probability distribution of corresponding categories according to the normalized exponential function.

It should be noted that, the normalized exponential function is calculated to obtain the probability distribution of all classes of the model (the classification function Softmax function is used for probability classification); calculating a normalized exponential function to obtain the output of the model, wherein the formula is as follows:

F _j for the j-th dimension, O, of the fully-connected layer output vector _i Is the probability of the ith classification result.

S180, obtaining a Chinese question classification result based on the probability distribution.

And processing the global characteristic matrix vector as the input of the full-connection layer, converting the output of the full-connection layer into the probability distribution of corresponding categories by utilizing a normalized exponential function, wherein the category with the maximum probability value is the predicted category, namely the classification result of the Chinese question (Label is adopted to represent the Label of the classification result). Therefore, the index where the maximum output value is located is taken as the category of the Chinese question, and the Chinese question classification result can be obtained.

It can be understood that after error correction, the invention firstly converts the word vector matrix of the Chinese question into the intermediate semantic matrix vector containing context semantic information by passing the correct Chinese question through the bidirectional gating circulation unit layer. And then, effectively strengthening the function of key words in the Chinese question by using a self-attention mechanism, then extracting local features of the attention matrix vector by using a convolutional neural network and pooling operation, and finally outputting a Chinese question classification result through a full connection layer and a normalized exponential function. By combining the bidirectional gating cyclic unit network model, the self-attention model and the convolutional neural network model, the advantages of different models can be fully exerted, and therefore classification is more accurate.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A Chinese question classification method based on text error correction and neural network is characterized by comprising the following steps:

acquiring text data of Chinese question sentences;

correcting the text data to obtain corrected text data;

inputting the Chinese question matrix vector into a bidirectional gating circulation unit layer to obtain a middle semantic matrix vector;

inputting the attention moment matrix vector into a convolutional neural network layer to obtain a global feature matrix vector;

2. The method for classifying Chinese question sentences based on text error correction and neural network as claimed in claim 1, wherein the step of correcting the text data to obtain the text data after error correction comprises:

detecting error characters in the text data of the Chinese question sentence to obtain a suspected error character candidate set;

traversing the obtained suspected wrong character candidate set, and replacing wrong characters by using a sound-like dictionary to obtain an error-corrected text data candidate set;

3. The method as claimed in claim 2, wherein the step of preprocessing the text data after error correction to obtain a matrix vector of chinese question comprises:

performing word segmentation on the obtained preprocessed text data to obtain word segmentation processing results;

4. The method for classifying Chinese question sentences based on text error correction and neural networks as claimed in claim 3, wherein said step of inputting the matrix vectors of Chinese question sentences into a bidirectional gated cyclic unit layer to obtain intermediate semantic matrix vectors comprises:

inputting the matrix vector of the Chinese question sentence into a bidirectional gating circulation unit layer, wherein the bidirectional gating circulation unit layer comprises a plurality of gating circulation units, and a single word vector is input into each gating circulation unit layer according to the sequence of words in the Chinese question sentence;

5. The method as claimed in claim 4, wherein the step of inputting the attention moment matrix vector to the convolutional neural network layer to obtain a global feature matrix vector comprises:

6. The method for classifying Chinese question sentences based on text error correction and neural network as claimed in claim 1, wherein said step of inputting said global feature matrix vector to a fully connected layer to obtain probability distribution of each category comprises:

7. The method for classifying Chinese question sentences based on text error correction and neural network as claimed in claim 1, wherein the step of obtaining attention matrix vectors according to the attention weights corresponding to the intermediate semantic matrix vectors comprises: