CN111813938A

CN111813938A - Record question-answer classification method based on ERNIE and DPCNN

Info

Publication number: CN111813938A
Application number: CN202010654746.XA
Authority: CN
Inventors: 王莎莎; 彭鹏
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2020-07-09
Filing date: 2020-07-09
Publication date: 2020-10-23

Abstract

The invention relates to a record question-answer classification method based on ERNIE and DPCNN, which mainly comprises the steps of preprocessing a record text data set, inputting the processed data into an ERNIE model to train to obtain a word vector sequence, and then inputting the obtained word vector sequence into a DPCNN model to train.

Description

Record question-answer classification method based on ERNIE and DPCNN

Technical Field

The invention belongs to the technical field of natural language processing, and discloses a method for classifying question-answer pairs in a record based on ERNIE and DPCNN.

Background

As time goes on, more and more records texts are accumulated, however, the records texts contain a large amount of case key information, how to efficiently obtain case information with important value from the data becomes a research hotspot, questions and answers, search, information extraction, crime analysis and the like in the records are all application fields of natural language processing, however, classification of the questions and answers in the records is the basis of the technologies, and thus, the requirement on the classification accuracy of the questions and answers in the records is very high.

With the rapid development of the deep learning field, the accuracy of the text classification technology is continuously improved, for example, garbage information classification and intention identification have good effects, but with the development of the internet, criminals have more and more diversified crime modes by using a network technology means, and correspondingly, higher accuracy is urgently needed for the classification technology of case information.

Most of the traditional text classification techniques adopt RNN or CNN models, but the models have great defects, such as that the RNN model is a recurrent neural network, namely, the result of the next layer depends on the output of the previous layer, and in common, the output needs to be performed word by word, which is obviously not a friendly parallel processing mode, meanwhile, in recent years, some researchers adopt the ultra-deep CNN to classify texts in a large-scale training data environment, which undoubtedly increases the training complexity, the BERT model published in 2018, although good results are obtained in the natural language field, the precision of the BERT model for Chinese classification is not friendly, and the priori knowledge in Chinese sentences is not considered only through context prediction.

Disclosure of Invention

The purpose of the invention is as follows: the accuracy of the classification of the question and answer of the record is further improved, so that the record is conveniently subjected to data processing such as information extraction and retrieval in a subsequent process.

In order to achieve the above purpose, the invention provides a record question-answer classification method based on ERNIE and DPCNN, which mainly comprises the following steps:

step 1, dividing labels of question and answer pairs in a record text, wherein the types of the division of different case records are different;

step 2, data preprocessing: preprocessing an original writing text T to obtain a data set T'; wherein T ═ { T ═ T₁,T₂,T₃,…,T_i,…,T_len(T)}, len (T) denotes the number of text entries, T_iRepresents the ith entry text, T ═ T₁,t₂,t₃,…,t_j,…,t_len(T′)}，t_jTo representThe jth stroke-recording question-answer pair, len (T') represents the number of question-answer pairs;

step 3, vectorizing the data set T' by using an ERNIE model;

step 3.1, firstly, processing the data set into a data format required by the ERNIE model, and utilizing a split function to t_jDividing the content into two parts, namely content and category;

step 3.2, utilizing tokenize in the model to carry out word segmentation processing on the content to obtain

Where W denotes each sample data, i.e. content, W in the data set_kRepresenting the kth word in a sentence;

step 3.3, splicing [ CLS ] at the beginning of the sample W to obtain W', and setting the fixed length of a sentence as seg _ size;

step 3.4, converting W ' into a vector W ' by using a model, if the length of W ' is less than seg _ size, filling 0, otherwise, truncating;

step 3.5, the processed data are stored, so that the processed data only need to be loaded when the model is debugged each time, and the efficiency of the model for processing the data is improved;

step 3.6, establishing an iterator, setting the number of samples in each group as batch, and then transmitting the processed data into an ERNIE model to obtain a word vector sequence;

step 4, carrying out operations such as convolution, pooling and the like on the sequence subjected to vector quantization by using a DPCNN model;

step 4.1, performing convolution operation twice on a word vector sequence obtained by training an ERNIE model by using a DPCNN model, activating by using a nonlinear activation function Relu after each convolution, and performing padding operation, namely boundary filling, in order to prevent boundary information from missing;

step 4.2, after the operation of 4.1, performing maximum pooling operation on the obtained data to obtain a feature vector X;

4.3, setting a judgment condition length, if the actual length of the sentence is greater than the length, performing convolution operation on the feature vector X for two times, performing Relu and padding operation on the convolved data each time to obtain a feature vector F, and performing feature fusion on the feature vectors X and F to obtain a feature vector X';

and 4.4, performing maximum pooling operation on the new feature vector X ' to obtain a feature vector X ', reducing the dimension of the X ' through a full connection layer to obtain a feature vector X ', wherein X ' is { X ', and dimension reduction is performed on the feature vector X ', and₁,x₂,x₃,…,x_Nn represents the number of text categories;

step 4.5, inputting the feature vector X' into the softmax layer to obtain a predicted question-answer category probability value P, wherein P is { P ═ P { (P)₁,p₂,p₃,…,p_m,…,p_N}，p_mThe probability value representing the mth category, Max (P), is the final classification of the question and answer;

step 5, training the ERNIE-DPCNN model by utilizing a training set and a verification set;

optimizing parameters in the model by adopting a gradient descent method, defining cross entropy as a loss function of the model, setting an imporv _ num parameter, and finishing the training of the model if the model effect is not improved when the model training exceeds the parameter value;

step 6, probability classification of the text of the notes, questions and answers;

inputting data to be predicted into a trained ERNIE-DPCNN model for processing, inputting a feature vector obtained through model training into a softmax layer, and using the maximum probability as the category of the predicted data as in the step 4.5.

Drawings

FIG. 1 is a schematic flow chart of a record question-answer classification method based on ERNIE and DPCNN of the present invention;

FIG. 2 is an internal structure diagram of the ERNIE-DPCNN model of the present invention.

Detailed Description

The following describes a specific implementation process of the present invention with reference to fig. 1 and 2, and the following contents are as follows:

(1) notes question-answer classification

As shown in fig. 1, firstly, the question and answer pairs in the record text are classified into categories, such as "personal situation", "crime passing", "physical condition", and the like, and the categories of the classification of different case records are different, such as the categories of "drug source", "drug sucking mode", and the like, recorded by the drug-sucking case record, and the categories of "stealing means", "stealing place", and the like, recorded by the stealing case record;

(2) data pre-processing

Since the original entry information may be a question or a question answer both above and below, the original entry text T is preprocessed, where T ═ T {₁,T₂,T₃,…,T_i,…,T_len(T)}, len (T) denotes the number of text entries, T_iRepresenting the ith entry text, and grouping the questions and answers in the entry to obtain a data set T', T ═ T₁,t₂,t₃,…,t_j,…,t_len(T′)}，t_jRepresenting the jth stroke-recording question-answer pair, and len (T') representing the sample number of the question-answer pair;

(3) vectorization of data set T' based on ERNIE model

(3.1) processing the data set into the data format required by the ERNIE model, namely seg _ len, input _ ids, input _ mask and cable, wherein seg _ len represents the actual length of a sentence, input _ ids represents that a word in the sentence is represented by id in a word list in the ERNIE model, input _ mask represents that a word in the sentence is represented by 1, and label represents the id number of a label of the sentence in a corresponding classified word list. Then using split function to pair t_jDividing the content into two parts, namely content and category;

(3.2) carrying out word segmentation processing by using token in the model to obtain

Where W denotes content, W per dataset_kRepresenting the kth word in a sentence;

(3.3) splicing [ CLS ] at the beginning of a sample W to obtain W ', setting a fixed length seg _ size of a sentence, converting W' into a vector W 'by using an ERNIE model, namely input _ ids, filling 0 if the length of W' is less than seg _ size, and otherwise, truncating, wherein input _ mask indicates that a word in the sentence is represented by 1, and the rest positions are filled by 0;

(3.4) establishing an iterator, namely the number of each group of samples is batch, and inputting the processed data into an ERNIE model to train to obtain a word vector sequence;

(4) performing operations such as convolution, pooling and the like on word vector sequence based on DPCNN model

(4.1) performing convolution operation twice on the word vector sequence obtained by training the ERNIE model by using the DPCNN model, as shown in FIG. 2, wherein the size of a convolution kernel is 3, the number of the convolution kernels is 250, and the size of the word vector sequence after convolution is n_outThe formula is as follows:

wherein n is_inRepresenting the size of an original word vector sequence, p representing the size of padding, f representing the size of a convolution kernel, s being a step length, activating by using a nonlinear activation function Relu after each convolution, and in addition, in order to prevent boundary information from missing, padding operation, namely boundary filling, is needed;

(4.2) after the operation, performing maximum pooling operation, namely down-sampling operation on the obtained data, performing maximum pooling by using a pooling layer with the size of 3 and the step size of 2, namely generating a new internal representation of the document by taking the maximum value of 3 continuous internal vectors by the pool layer, and obtaining a feature vector X after the operation;

(4.3) setting length as a judgment condition, if the actual length of a sentence is greater than the length, performing convolution operation on the feature vector X twice, as shown in FIG. 2, performing Relu and padding operation on data convolved each time to obtain a feature vector F, and performing feature fusion on the feature vector X and the feature vector F to obtain a feature vector X';

(4.4) performing maximum pooling operation on the new feature vector X ' to obtain a feature vector X ', performing dimensionality reduction on the X ' through a full connection layer to obtain a dimensionality-reduced feature vector X ', wherein X ' (X)₁,x₂,x₃,…,x_NN stands for text category number, and the feature vector X' ″ is input to the softmax layer, which is formulated as follows:

obtaining a predicted question-answer class probability value P, P ═ P₁,p₂,p₃,…,p_m,…,p_N}，p_mThe probability value representing the mth category, Max (P), is the final classification of the question and answer;

(5) ERNIE-DPCNN model training

Parameters in the model are optimized by adopting a gradient descent method, cross entropy is defined as a loss function loss of the model, and a multi-classification cross entropy formula is as follows:

H(D,Y)＝-∑D(x)log Y(x) (3)

and D is a true value, Y is a predicted value, an inprov _ num parameter is set, and when the model training exceeds the parameter value, if the model effect is not improved, the model training is ended.

Claims

1. A record question-answer classification method based on ERNIE and DPCNN is characterized by comprising the following steps:

step 1, classifying the notes, questions and answers;

step 2, preprocessing data;

step 3, vectorizing the data set T' based on the ERNIE model;

step 4, carrying out operations such as convolution, pooling and the like on the word vector sequence based on the DPCNN model;

and 5, training an ERNIE-DPCNN model.

2. The ERNIE and DPCNN-based bibliographic question-answer classification method according to claim 1, characterized by the steps of3, processing the data set into data formats required by the ERNIE model, namely seg _ len, input _ ids, input _ mask and cable, wherein seg _ len represents the actual length of a sentence, input _ ids represents that a word in the sentence is represented by id in a word list in the ERNIE model, input _ mask represents that a word in the sentence is represented by 1, and label represents the id number of a label of the sentence in a corresponding classified word list. Then using split function to pair t_jDividing the content into two parts, namely content and category, wherein the specific content of the step is as follows:

performing word segmentation processing by using token in model to obtainWhere W denotes each sample data, i.e. content, W in the data set_kRepresenting the kth word in a sentence;

splicing [ CLS ] at the beginning of a sample W to obtain W ', setting a fixed length seg _ size of a sentence, converting W' into a vector W 'by using a model, namely input _ ids, filling 0 if the length of W' is less than the seg _ size, otherwise, truncating, wherein input _ mask indicates that a word in the sentence is represented by 1, and filling the rest positions by 0;

and establishing an iterator, namely the number of samples in each group is batch, and then transmitting the processed data into an ERNIE model to train to obtain a word vector sequence.

3. The method for classifying the scripts and the questions and the answers based on the ERNIE and the DPCNN as claimed in claim 1, wherein in step 4, the DPCNN model is used to perform operations such as convolution and pooling on the word vector sequence obtained by training the ERNIE model, and the specific contents are as follows:

firstly, performing convolution operation twice on a word vector sequence, wherein the size of convolution kernels is 3, the number of the convolution kernels is 250, and the size of the word vector sequence after convolution is n_outAfter each convolution, a nonlinear activation function Relu is used for activation, and in addition, in order to prevent boundary information from missing, padding operation, namely boundary filling, is required;

after the operation, performing maximum pooling operation, namely downsampling operation on the obtained data, performing maximum pooling by using a pooling layer with the size of 3 and the step length of 2, namely generating a new internal representation of the document by taking the maximum value of 3 continuous internal vectors by the pooling layer, and obtaining a feature vector X after the operation;

setting length as a judgment condition, if the actual length of a sentence is greater than length, performing convolution operation on a feature vector X twice, similarly performing Relu and padding operation on data after each convolution to obtain a feature vector F, performing feature fusion on the feature vectors X and F to obtain a feature vector X ', performing maximum pooling operation on a new feature vector X ' to obtain a feature vector X ', and performing dimension reduction on the X ' through a full connection layer to obtain a dimension-reduced feature vector X ', wherein X ' is { X ', and X ', where X ' is₁，x₂，x₃，...，x_NN represents the number of text categories, and the feature vector X' ″ is input into the softmax layer to obtain a predicted probability value of question-answer categories P, { P ═ P { (P)₁，p₂，p₃，...，p_m，...，p_N}，p_mThe probability value representing the mth category, max (p), is the final classification of the question and answer.

4. The method as claimed in claim 1, wherein in step 5, parameters in the model are optimized by using a gradient descent method in the training of the ERNIE-DPCNN model, cross entropy is defined as a loss function loss of the model, an inprov _ num parameter is set, and when the number of times of model training exceeds the parameter value, if the model effect is not improved, the model training is ended.