CN111813938A - Record question-answer classification method based on ERNIE and DPCNN - Google Patents

Record question-answer classification method based on ERNIE and DPCNN Download PDF

Info

Publication number
CN111813938A
CN111813938A CN202010654746.XA CN202010654746A CN111813938A CN 111813938 A CN111813938 A CN 111813938A CN 202010654746 A CN202010654746 A CN 202010654746A CN 111813938 A CN111813938 A CN 111813938A
Authority
CN
China
Prior art keywords
model
ernie
word
dpcnn
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010654746.XA
Other languages
Chinese (zh)
Inventor
王莎莎
彭鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN202010654746.XA priority Critical patent/CN111813938A/en
Publication of CN111813938A publication Critical patent/CN111813938A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Probability & Statistics with Applications (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a record question-answer classification method based on ERNIE and DPCNN, which mainly comprises the steps of preprocessing a record text data set, inputting the processed data into an ERNIE model to train to obtain a word vector sequence, and then inputting the obtained word vector sequence into a DPCNN model to train.

Description

Record question-answer classification method based on ERNIE and DPCNN
Technical Field
The invention belongs to the technical field of natural language processing, and discloses a method for classifying question-answer pairs in a record based on ERNIE and DPCNN.
Background
As time goes on, more and more records texts are accumulated, however, the records texts contain a large amount of case key information, how to efficiently obtain case information with important value from the data becomes a research hotspot, questions and answers, search, information extraction, crime analysis and the like in the records are all application fields of natural language processing, however, classification of the questions and answers in the records is the basis of the technologies, and thus, the requirement on the classification accuracy of the questions and answers in the records is very high.
With the rapid development of the deep learning field, the accuracy of the text classification technology is continuously improved, for example, garbage information classification and intention identification have good effects, but with the development of the internet, criminals have more and more diversified crime modes by using a network technology means, and correspondingly, higher accuracy is urgently needed for the classification technology of case information.
Most of the traditional text classification techniques adopt RNN or CNN models, but the models have great defects, such as that the RNN model is a recurrent neural network, namely, the result of the next layer depends on the output of the previous layer, and in common, the output needs to be performed word by word, which is obviously not a friendly parallel processing mode, meanwhile, in recent years, some researchers adopt the ultra-deep CNN to classify texts in a large-scale training data environment, which undoubtedly increases the training complexity, the BERT model published in 2018, although good results are obtained in the natural language field, the precision of the BERT model for Chinese classification is not friendly, and the priori knowledge in Chinese sentences is not considered only through context prediction.
Disclosure of Invention
The purpose of the invention is as follows: the accuracy of the classification of the question and answer of the record is further improved, so that the record is conveniently subjected to data processing such as information extraction and retrieval in a subsequent process.
In order to achieve the above purpose, the invention provides a record question-answer classification method based on ERNIE and DPCNN, which mainly comprises the following steps:
step 1, dividing labels of question and answer pairs in a record text, wherein the types of the division of different case records are different;
step 2, data preprocessing: preprocessing an original writing text T to obtain a data set T'; wherein T ═ { T ═ T1,T2,T3,…,Ti,…,Tlen(T)}, len (T) denotes the number of text entries, TiRepresents the ith entry text, T ═ T1,t2,t3,…,tj,…,tlen(T′)},tjTo representThe jth stroke-recording question-answer pair, len (T') represents the number of question-answer pairs;
step 3, vectorizing the data set T' by using an ERNIE model;
step 3.1, firstly, processing the data set into a data format required by the ERNIE model, and utilizing a split function to tjDividing the content into two parts, namely content and category;
step 3.2, utilizing tokenize in the model to carry out word segmentation processing on the content to obtain
Figure BDA0002576331200000021
Figure BDA0002576331200000022
Where W denotes each sample data, i.e. content, W in the data setkRepresenting the kth word in a sentence;
step 3.3, splicing [ CLS ] at the beginning of the sample W to obtain W', and setting the fixed length of a sentence as seg _ size;
step 3.4, converting W ' into a vector W ' by using a model, if the length of W ' is less than seg _ size, filling 0, otherwise, truncating;
step 3.5, the processed data are stored, so that the processed data only need to be loaded when the model is debugged each time, and the efficiency of the model for processing the data is improved;
step 3.6, establishing an iterator, setting the number of samples in each group as batch, and then transmitting the processed data into an ERNIE model to obtain a word vector sequence;
step 4, carrying out operations such as convolution, pooling and the like on the sequence subjected to vector quantization by using a DPCNN model;
step 4.1, performing convolution operation twice on a word vector sequence obtained by training an ERNIE model by using a DPCNN model, activating by using a nonlinear activation function Relu after each convolution, and performing padding operation, namely boundary filling, in order to prevent boundary information from missing;
step 4.2, after the operation of 4.1, performing maximum pooling operation on the obtained data to obtain a feature vector X;
4.3, setting a judgment condition length, if the actual length of the sentence is greater than the length, performing convolution operation on the feature vector X for two times, performing Relu and padding operation on the convolved data each time to obtain a feature vector F, and performing feature fusion on the feature vectors X and F to obtain a feature vector X';
and 4.4, performing maximum pooling operation on the new feature vector X ' to obtain a feature vector X ', reducing the dimension of the X ' through a full connection layer to obtain a feature vector X ', wherein X ' is { X ', and dimension reduction is performed on the feature vector X ', and1,x2,x3,…,xNn represents the number of text categories;
step 4.5, inputting the feature vector X' into the softmax layer to obtain a predicted question-answer category probability value P, wherein P is { P ═ P { (P)1,p2,p3,…,pm,…,pN},pmThe probability value representing the mth category, Max (P), is the final classification of the question and answer;
step 5, training the ERNIE-DPCNN model by utilizing a training set and a verification set;
optimizing parameters in the model by adopting a gradient descent method, defining cross entropy as a loss function of the model, setting an imporv _ num parameter, and finishing the training of the model if the model effect is not improved when the model training exceeds the parameter value;
step 6, probability classification of the text of the notes, questions and answers;
inputting data to be predicted into a trained ERNIE-DPCNN model for processing, inputting a feature vector obtained through model training into a softmax layer, and using the maximum probability as the category of the predicted data as in the step 4.5.
Drawings
FIG. 1 is a schematic flow chart of a record question-answer classification method based on ERNIE and DPCNN of the present invention;
FIG. 2 is an internal structure diagram of the ERNIE-DPCNN model of the present invention.
Detailed Description
The following describes a specific implementation process of the present invention with reference to fig. 1 and 2, and the following contents are as follows:
(1) notes question-answer classification
As shown in fig. 1, firstly, the question and answer pairs in the record text are classified into categories, such as "personal situation", "crime passing", "physical condition", and the like, and the categories of the classification of different case records are different, such as the categories of "drug source", "drug sucking mode", and the like, recorded by the drug-sucking case record, and the categories of "stealing means", "stealing place", and the like, recorded by the stealing case record;
(2) data pre-processing
Since the original entry information may be a question or a question answer both above and below, the original entry text T is preprocessed, where T ═ T {1,T2,T3,…,Ti,…,Tlen(T)}, len (T) denotes the number of text entries, TiRepresenting the ith entry text, and grouping the questions and answers in the entry to obtain a data set T', T ═ T1,t2,t3,…,tj,…,tlen(T′)},tjRepresenting the jth stroke-recording question-answer pair, and len (T') representing the sample number of the question-answer pair;
(3) vectorization of data set T' based on ERNIE model
(3.1) processing the data set into the data format required by the ERNIE model, namely seg _ len, input _ ids, input _ mask and cable, wherein seg _ len represents the actual length of a sentence, input _ ids represents that a word in the sentence is represented by id in a word list in the ERNIE model, input _ mask represents that a word in the sentence is represented by 1, and label represents the id number of a label of the sentence in a corresponding classified word list. Then using split function to pair tjDividing the content into two parts, namely content and category;
(3.2) carrying out word segmentation processing by using token in the model to obtain
Figure BDA0002576331200000031
Where W denotes content, W per datasetkRepresenting the kth word in a sentence;
(3.3) splicing [ CLS ] at the beginning of a sample W to obtain W ', setting a fixed length seg _ size of a sentence, converting W' into a vector W 'by using an ERNIE model, namely input _ ids, filling 0 if the length of W' is less than seg _ size, and otherwise, truncating, wherein input _ mask indicates that a word in the sentence is represented by 1, and the rest positions are filled by 0;
(3.4) establishing an iterator, namely the number of each group of samples is batch, and inputting the processed data into an ERNIE model to train to obtain a word vector sequence;
(4) performing operations such as convolution, pooling and the like on word vector sequence based on DPCNN model
(4.1) performing convolution operation twice on the word vector sequence obtained by training the ERNIE model by using the DPCNN model, as shown in FIG. 2, wherein the size of a convolution kernel is 3, the number of the convolution kernels is 250, and the size of the word vector sequence after convolution is noutThe formula is as follows:
Figure BDA0002576331200000041
wherein n isinRepresenting the size of an original word vector sequence, p representing the size of padding, f representing the size of a convolution kernel, s being a step length, activating by using a nonlinear activation function Relu after each convolution, and in addition, in order to prevent boundary information from missing, padding operation, namely boundary filling, is needed;
(4.2) after the operation, performing maximum pooling operation, namely down-sampling operation on the obtained data, performing maximum pooling by using a pooling layer with the size of 3 and the step size of 2, namely generating a new internal representation of the document by taking the maximum value of 3 continuous internal vectors by the pool layer, and obtaining a feature vector X after the operation;
(4.3) setting length as a judgment condition, if the actual length of a sentence is greater than the length, performing convolution operation on the feature vector X twice, as shown in FIG. 2, performing Relu and padding operation on data convolved each time to obtain a feature vector F, and performing feature fusion on the feature vector X and the feature vector F to obtain a feature vector X';
(4.4) performing maximum pooling operation on the new feature vector X ' to obtain a feature vector X ', performing dimensionality reduction on the X ' through a full connection layer to obtain a dimensionality-reduced feature vector X ', wherein X ' (X)1,x2,x3,…,xNN stands for text category number, and the feature vector X' ″ is input to the softmax layer, which is formulated as follows:
Figure BDA0002576331200000042
obtaining a predicted question-answer class probability value P, P ═ P1,p2,p3,…,pm,…,pN},pmThe probability value representing the mth category, Max (P), is the final classification of the question and answer;
(5) ERNIE-DPCNN model training
Parameters in the model are optimized by adopting a gradient descent method, cross entropy is defined as a loss function loss of the model, and a multi-classification cross entropy formula is as follows:
H(D,Y)=-∑D(x)log Y(x) (3)
and D is a true value, Y is a predicted value, an inprov _ num parameter is set, and when the model training exceeds the parameter value, if the model effect is not improved, the model training is ended.

Claims (4)

1. A record question-answer classification method based on ERNIE and DPCNN is characterized by comprising the following steps:
step 1, classifying the notes, questions and answers;
step 2, preprocessing data;
step 3, vectorizing the data set T' based on the ERNIE model;
step 4, carrying out operations such as convolution, pooling and the like on the word vector sequence based on the DPCNN model;
and 5, training an ERNIE-DPCNN model.
2. The ERNIE and DPCNN-based bibliographic question-answer classification method according to claim 1, characterized by the steps of3, processing the data set into data formats required by the ERNIE model, namely seg _ len, input _ ids, input _ mask and cable, wherein seg _ len represents the actual length of a sentence, input _ ids represents that a word in the sentence is represented by id in a word list in the ERNIE model, input _ mask represents that a word in the sentence is represented by 1, and label represents the id number of a label of the sentence in a corresponding classified word list. Then using split function to pair tjDividing the content into two parts, namely content and category, wherein the specific content of the step is as follows:
performing word segmentation processing by using token in model to obtainWhere W denotes each sample data, i.e. content, W in the data setkRepresenting the kth word in a sentence;
splicing [ CLS ] at the beginning of a sample W to obtain W ', setting a fixed length seg _ size of a sentence, converting W' into a vector W 'by using a model, namely input _ ids, filling 0 if the length of W' is less than the seg _ size, otherwise, truncating, wherein input _ mask indicates that a word in the sentence is represented by 1, and filling the rest positions by 0;
and establishing an iterator, namely the number of samples in each group is batch, and then transmitting the processed data into an ERNIE model to train to obtain a word vector sequence.
3. The method for classifying the scripts and the questions and the answers based on the ERNIE and the DPCNN as claimed in claim 1, wherein in step 4, the DPCNN model is used to perform operations such as convolution and pooling on the word vector sequence obtained by training the ERNIE model, and the specific contents are as follows:
firstly, performing convolution operation twice on a word vector sequence, wherein the size of convolution kernels is 3, the number of the convolution kernels is 250, and the size of the word vector sequence after convolution is noutAfter each convolution, a nonlinear activation function Relu is used for activation, and in addition, in order to prevent boundary information from missing, padding operation, namely boundary filling, is required;
after the operation, performing maximum pooling operation, namely downsampling operation on the obtained data, performing maximum pooling by using a pooling layer with the size of 3 and the step length of 2, namely generating a new internal representation of the document by taking the maximum value of 3 continuous internal vectors by the pooling layer, and obtaining a feature vector X after the operation;
setting length as a judgment condition, if the actual length of a sentence is greater than length, performing convolution operation on a feature vector X twice, similarly performing Relu and padding operation on data after each convolution to obtain a feature vector F, performing feature fusion on the feature vectors X and F to obtain a feature vector X ', performing maximum pooling operation on a new feature vector X ' to obtain a feature vector X ', and performing dimension reduction on the X ' through a full connection layer to obtain a dimension-reduced feature vector X ', wherein X ' is { X ', and X ', where X ' is1,x2,x3,...,xNN represents the number of text categories, and the feature vector X' ″ is input into the softmax layer to obtain a predicted probability value of question-answer categories P, { P ═ P { (P)1,p2,p3,...,pm,...,pN},pmThe probability value representing the mth category, max (p), is the final classification of the question and answer.
4. The method as claimed in claim 1, wherein in step 5, parameters in the model are optimized by using a gradient descent method in the training of the ERNIE-DPCNN model, cross entropy is defined as a loss function loss of the model, an inprov _ num parameter is set, and when the number of times of model training exceeds the parameter value, if the model effect is not improved, the model training is ended.
CN202010654746.XA 2020-07-09 2020-07-09 Record question-answer classification method based on ERNIE and DPCNN Pending CN111813938A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010654746.XA CN111813938A (en) 2020-07-09 2020-07-09 Record question-answer classification method based on ERNIE and DPCNN

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010654746.XA CN111813938A (en) 2020-07-09 2020-07-09 Record question-answer classification method based on ERNIE and DPCNN

Publications (1)

Publication Number Publication Date
CN111813938A true CN111813938A (en) 2020-10-23

Family

ID=72843254

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010654746.XA Pending CN111813938A (en) 2020-07-09 2020-07-09 Record question-answer classification method based on ERNIE and DPCNN

Country Status (1)

Country Link
CN (1) CN111813938A (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180240012A1 (en) * 2017-02-17 2018-08-23 Wipro Limited Method and system for determining classification of text
CN111160017A (en) * 2019-12-12 2020-05-15 北京文思海辉金信软件有限公司 Keyword extraction method, phonetics scoring method and phonetics recommendation method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180240012A1 (en) * 2017-02-17 2018-08-23 Wipro Limited Method and system for determining classification of text
CN111160017A (en) * 2019-12-12 2020-05-15 北京文思海辉金信软件有限公司 Keyword extraction method, phonetics scoring method and phonetics recommendation method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
RIE JOHNSON等: "Deep Pyramid Convolutional Neural Networks for Text Categorization", 《PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINDUISTICS》 *
李舟军等: "面向自然语言处理的预训练技术研究综述", 《计算机科学》 *

Similar Documents

Publication Publication Date Title
CN108984526B (en) Document theme vector extraction method based on deep learning
CN109766277B (en) Software fault diagnosis method based on transfer learning and DNN
CN112115238B (en) Question-answering method and system based on BERT and knowledge base
CN109766544B (en) Document keyword extraction method and device based on LDA and word vector
CN113254599A (en) Multi-label microblog text classification method based on semi-supervised learning
CN112231447B (en) Method and system for extracting Chinese document events
CN111488739A (en) Implicit discourse relation identification method based on multi-granularity generated image enhancement representation
CN109684626A (en) Method for recognizing semantics, model, storage medium and device
CN110647612A (en) Visual conversation generation method based on double-visual attention network
CN111966812B (en) Automatic question answering method based on dynamic word vector and storage medium
CN110750635B (en) French recommendation method based on joint deep learning model
CN110851594A (en) Text classification method and device based on multi-channel deep learning model
CN109492105B (en) Text emotion classification method based on multi-feature ensemble learning
CN112434164B (en) Network public opinion analysis method and system taking topic discovery and emotion analysis into consideration
KR20200105057A (en) Apparatus and method for extracting inquiry features for alalysis of inquery sentence
Sheshikala et al. Natural language processing and machine learning classifier used for detecting the author of the sentence
CN112036705A (en) Quality inspection result data acquisition method, device and equipment
CN112100212A (en) Case scenario extraction method based on machine learning and rule matching
CN113988079A (en) Low-data-oriented dynamic enhanced multi-hop text reading recognition processing method
CN114742047A (en) Text emotion recognition method based on maximum probability filling and multi-head attention mechanism
CN111159405B (en) Irony detection method based on background knowledge
CN112632272A (en) Microblog emotion classification method and system based on syntactic analysis
CN116821297A (en) Stylized legal consultation question-answering method, system, storage medium and equipment
Nouhaila et al. Arabic sentiment analysis based on 1-D convolutional neural network
Zhao et al. Commented content classification with deep neural network based on attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20201023