CN113468324A

CN113468324A - Text classification method and system based on BERT pre-training model and convolutional network

Info

Publication number: CN113468324A
Application number: CN202110621401.9A
Authority: CN
Inventors: 唐果; 曹安蕲; 傅洛伊; 王新兵
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2021-06-03
Filing date: 2021-06-03
Publication date: 2021-10-01

Abstract

The invention provides a text classification method and system based on a BERT pre-training model and a convolution network, which comprises the following steps: step 1: acquiring thesis text data of each field and category label data of the field to which the thesis belongs in a database; step 2: removing noise in the text, dividing paper text data and category label data into a training set and a test set, and storing the training set and the test set in a text file; and step 3: performing word embedding on the text data in the training set and the test set by using a BERT pre-training model; and 4, step 4: embedding words of each text segment into a text matrix as input of a convolutional neural network, and performing text feature extraction on a sentence matrix by using the convolutional neural network; and 5: and inputting the features extracted by the convolutional neural network into a fully-connected neural network layer for classification. The invention classifies through the fully-connected neural network, so that a user can quickly and accurately classify the thesis documents according to the subject field.

Description

Text classification method and system based on BERT pre-training model and convolutional network

Technical Field

The invention relates to the technical field of deep learning and natural language processing, in particular to a text classification method and system based on a BERT pre-training model and a convolution network.

Background

Patent document CN112487189A (application number: CN202011445448.6) discloses a graph-volume network enhanced implicit discourse text relation classification method. The method is a method for implicit discourse relation classification from electronic texts. Firstly, a BERT pre-training model is introduced, and more efficient dynamic word vector representation is provided, so that overall representation of chapter level is improved; secondly, the invention introduces a graph neural network to model the word-level relationship between sentences, and can more accurately predict the implicit relationship type between sentence pairs.

At present, technologies for classifying texts based on deep learning methods are more, for example, only a traditional convolutional neural network or a recurrent neural network is used for text classification, but the methods are difficult to find a numerical value which can effectively represent text data as a value for computer calculation, and the classification accuracy of the methods is low. Since the BERT pre-training model is released, the accuracy of the task can be greatly improved by applying the BERT pre-training model to the task related to natural language processing. The BERT pre-training model essentially learns the semantic features of the whole text by enabling each word to be embedded to acquire the information of the whole input text through a multi-head attention mechanism. In the paper of the BERT pre-training model, the authors recommend using symbol embedding of the beginning symbols of the input sentences for classification, but this method does not effectively utilize the overall information of the text, and also reduces the accuracy of text classification. Therefore, the invention adds a convolution layer on the basis of the BERT pre-training model for extracting the feature information of the whole text, and finally uses the fully-connected neural network layer to classify the extracted features, thereby improving the classification precision.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a text classification method and system based on a BERT pre-training model and a convolution network.

The text classification method based on the BERT pre-training model and the convolution network provided by the invention comprises the following steps:

step 1: acquiring thesis text data of each field and category label data of the field to which the thesis belongs in a database;

step 2: removing noise in the text, dividing paper text data and category label data into a training set and a test set, and storing the training set and the test set in a text file;

and step 3: word embedding is carried out on text data in a training set and a testing set by using a BERT pre-training model, and each word in the text is represented by using different vectors;

and 4, step 4: embedding words of each text segment into a text matrix as input of a convolutional neural network, and performing text feature extraction on a sentence matrix by using the convolutional neural network;

and 5: and inputting the features extracted by the convolutional neural network into a fully-connected neural network layer for classification.

Preferably, the unique identification of all domain papers is obtained from a database: paper ID data;

and acquiring the text data and the category label data of the paper from the database according to the acquired paper ID data, and storing the text data and the category label data in the memory.

Preferably, the text data stored in the memory is subjected to data cleaning, and characters and spaces which do not meet preset requirements in the text are removed.

Preferably, the cleaned text data and the category label data are in one-to-one correspondence;

all data are divided into a training data set and a testing data set according to a ratio of seven to three, and finally stored in a text file.

Preferably, a network is built by using a Pythrch deep learning framework, a BERT pre-training model is loaded, word embedding is carried out on text data of a training set and a testing set, and each word in the text is represented by different high-dimensional vectors.

The text classification system based on the BERT pre-training model and the convolution network provided by the invention comprises:

module M1: acquiring thesis text data of each field and category label data of the field to which the thesis belongs in a database;

module M2: removing noise in the text, dividing paper text data and category label data into a training set and a test set, and storing the training set and the test set in a text file;

module M3: word embedding is carried out on text data in a training set and a testing set by using a BERT pre-training model, and each word in the text is represented by using different vectors;

module M4: embedding words of each text segment into a text matrix as input of a convolutional neural network, and performing text feature extraction on a sentence matrix by using the convolutional neural network;

module M5: and inputting the features extracted by the convolutional neural network into a fully-connected neural network layer for classification.

Compared with the prior art, the invention has the following beneficial effects:

(1) the invention is based on the paper text classification technology of the BERT pre-training model and the convolution network, can embed the paper text in each field, and uses the form that the computer can calculate to represent the text;

(2) the invention uses the convolution neural network structure to realize the text feature extraction of the texts, so that a computer can obtain the semantic features of each section of text, and the subsequent processing and application on the text semantics are convenient;

(3) the invention classifies the features extracted by the convolutional neural network by using the fully-connected neural network, so that a user can quickly and accurately classify the thesis documents according to the subject field.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic diagram of embedding of an input text by the BERT pre-training model;

FIG. 3 is a schematic diagram of the convolution of a text embedding matrix;

fig. 4 is a schematic diagram of classification of a fully-connected neural network.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

Example 1

The paper text classification method based on the BERT pre-training model and the convolutional network provided by the invention comprises the following steps:

step 1: acquiring the text information (including titles and abstracts) of papers in various fields and the category label information of the fields to which the papers belong from a database;

step 2: removing noise in the text, namely some characters without practical significance or redundant space characters and the like in the text, dividing the paper text data and the category label data into a training set and a test set, and storing the training set and the test set in the text file;

and step 3: word embedding text data in training and test sets using a BERT (transform-based bi-directional coding representation) pre-training model, i.e. representing each word in the text with a different vector;

and 4, step 4: embedding words of each text segment into a text matrix as input of a CNN (convolutional neural network), and performing text feature extraction on a sentence matrix by using the convolutional neural network;

and 5: and inputting the features extracted by the convolutional neural network into a basic fully-connected neural network layer for classification.

Specifically, the step 1 includes:

step 1.1: acquiring unique identifications of all domain papers, namely ID information of all the papers, from a database;

step 1.2: and acquiring the text data and the category label data of the papers from the database according to the acquired paper ID data, and storing the text data and the category label data in a memory.

Specifically, the step 2 includes:

step 2.1: carrying out data cleaning on the text data stored in the memory, and removing some characters without practical significance and redundant spaces in the text;

step 2.2: the cleaned text data and the corresponding category labels are in one-to-one correspondence;

step 2.3: and all data are divided into a training data set and a testing data set according to a ratio of seven to three, and finally stored in a text file.

Specifically, the step 3 includes:

step 3.1: text data in the training set and the test set is word-embedded using a BERT pre-training model, i.e., a high-dimensional vector is used to represent each word in the text.

Specifically, the step 4 includes:

step 4.1: embedding words of each piece of paper text data into a text matrix to be used as computer data representation of the text data;

step 4.2: the text matrix is used as input of CNN (convolutional neural network), and the convolutional neural network is used for extracting the characteristics of the text from the sentence matrix.

Specifically, the step 5 includes:

step 5.1: and inputting the features extracted by the convolutional neural network into a basic fully-connected neural network layer for classification.

The system for classifying the thesis text based on the BERT pre-training model and the convolutional network comprises the following modules:

module M1: acquiring the text information (including titles and abstracts) of papers in various fields and the category label information of the fields to which the papers belong from a database;

module M2: removing noise in the text, namely some characters without practical significance or redundant space characters and the like in the text, dividing the paper text data and the category label data into a training set and a test set, and storing the training set and the test set in the text file;

module M3: word embedding text data in training and test sets using a BERT (transform-based bi-directional coding representation) pre-training model, i.e. representing each word in the text with a different vector;

module M4: embedding words of each text segment into a text matrix as input of a CNN (convolutional neural network), and performing text feature extraction on a sentence matrix by using the convolutional neural network;

module M5: and inputting the features extracted by the convolutional neural network into a basic fully-connected neural network layer for classification.

Specifically, the module M1 includes:

step M1.1: acquiring unique identifications of all domain papers, namely ID information of all the papers, from a database;

step M1.2: and acquiring the text data and the category label data of the papers from the database according to the acquired paper ID data, and storing the text data and the category label data in a memory.

Specifically, the module M2 includes:

step M2.1: carrying out data cleaning on the text data stored in the memory, and removing some characters without practical significance and redundant spaces in the text;

step M2.2: the cleaned text data and the corresponding category labels are in one-to-one correspondence;

step M2.3: and all data are divided into a training data set and a testing data set according to a ratio of seven to three, and finally stored in a text file.

Specifically, the module M3 includes:

step M3.1: text data in the training set and the test set is word-embedded using a BERT pre-training model, i.e., a high-dimensional vector is used to represent each word in the text.

Specifically, the module M4 includes:

step M4.1: embedding words of each piece of paper text data into a text matrix to be used as computer data representation of the text data;

step M4.2: the text matrix is used as input of CNN (convolutional neural network), and the convolutional neural network is used for extracting the characteristics of the text from the sentence matrix.

Specifically, the module M5 includes:

step M5.1: and inputting the features extracted by the convolutional neural network into a basic fully-connected neural network layer for classification.

Example 2

Example 2 is a modification of example 1.

The paper text classification technology based on the BERT pre-training model and the convolutional network provided by the embodiment relates to the technical field of sorting label data containing paper titles, abstracts, text data and texts obtained from a database, word embedding is carried out on the text data of the papers by using the BERT pre-training model, word semantics are represented by using vectors, the word vectors form a semantic matrix of the whole text, then the convolutional network is used for extracting features in the text semantic matrix, and finally the extracted semantic features are used for classifying the text; specifically, as shown in fig. 1, the method comprises the following steps:

step S1: acquiring text information of papers in each field and category label information of the fields to which the papers belong from a database;

step S2: removing noise data in the text, dividing the paper text data and the category label data into a training set and a testing set, and storing the training set and the testing set in a text file;

step S3: word embedding is carried out on the text data in the training set and the testing set by using a BERT pre-training model, namely, each word in the text is represented by using different vectors;

step S4: embedding words of each text segment into a text matrix as input of a convolutional neural network, and performing text feature extraction on a sentence matrix by using the convolutional neural network;

step S5: and inputting the features extracted by the convolutional neural network into a basic fully-connected neural network layer for classification.

Step S1 includes: the paper data of all fields including the title and abstract of each paper is obtained from the Ncovpaper database. There are two total databases for obtaining paper text data and category label data, the first is a relational database based on MySQL, and the second is a non-relational database based on ElasticSearch. The two databases store information including title, abstract and category of the paper. Specifically, the method comprises the following steps:

step S101: the title and summary data of all 240987 papers and the domain category data to which these papers belong are obtained from the Ncovpaper database. After the relevant information of the thesis is acquired, the thesis information is stored in the memory in the following format for later use:

[{“label”:label0,“title”:title0,“abstract”:abstract0},

{“label”:label1,“title”:title1,“abstract”:abstract1},…]

wherein, "label" represents the category of the paper, "title" represents the title information of the paper, and "abstrat" represents the abstract information of the paper.

Step S2 includes: removing noise data in the text, then dividing the paper text data and the category label data into a training set and a testing set, and storing the training set and the testing set in the text file, specifically:

step S201: according to the data obtained from the Ncovpaper database, the following steps are carried out on the data:

sequentially acquiring title, abstract and category information of a thesis, splicing the title and the abstract into a text segment, merging the two texts into a text segment, then using a text cleaning function to perform text noise removal on the spliced character string, taking the example of removing redundant spaces in the text as an example, using a method of replacing after regular expression matching, using 'text' to represent the spliced text, and example codes are as follows:

text＝re.sub(r"[]+",”,text)

step S202: dividing the cleaned text and the category label data into a training set and a test set according to a seven-to-three ratio and storing the training set and the test set in a text file respectively, in order to ensure that the distribution of each category in the training set and the data set is approximately the same, dividing the data of each category according to the seven-to-three ratio, and then merging the data into the training set and the data set respectively, specifically according to the following steps:

classifying all data according to the labels according to the categories, then classifying the data in a seven-to-three ratio in a plurality of categories which are classified, if arrays 'train _ set' and 'test _ set' are used for storing a training set and a test set, the array 'temp' is used for temporarily storing the data of a certain label, and the number of text categories is assumed to be N_labelThen the Python scripts that divide each category are as follows:

for i in range(N_label):

temp＝[]

for elem in raw_data:

ifelem['label']＝＝i:

temp.append(elem)

length＝len(temp)

train_set.extend(temp[:int(length*0.7)])

test_set.extend(temp[int(length*0.7):])

this completes the task of classifying all data by categories.

Step S203: the divided training set data and test set data are stored in the memory, and then the data stored in the memory is stored by using a text file, wherein the JSON file format is used, and the storage examples are as follows:

[

{“label”:label0,“title”:title0,“abstract”:abstract0},

{“label”:label1,“title”:title1,“abstract”:abstract1},

…

]

Step S3 includes: and (3) building a network by using a Pythrch deep learning framework and loading a BERT pre-training model, realizing word embedding of text data of a training set and a testing set, and representing each word in the text by using different high-dimensional vectors. Specifically, the method comprises the following steps:

step S301: building a network by using a Pythrch deep learning framework and loading a BERT pre-training model, wherein the papers comprise multiple languages such as English, Chinese, Spanish and Japanese, and the loaded model is a BERT multi-language pre-training model, and the basic configuration of the model is as follows:

{

"architectures":[

"BertForMaskedLM"

],

"attention_probs_dropout_prob":0.1,

"directionality":"bidi",

"hidden_act":"gelu",

"hidden_dropout_prob":0.1,

"hidden_size":768,

"initializer_range":0.02,

"intermediate_size":3072,

"layer_norm_eps":1e-12,

"max_position_embeddings":512,

"model_type":"bert",

"num_attention_heads":12,

"num_hidden_layers":12,

"pad_token_id":0,

"pooler_fc_size":768,

"pooler_num_attention_heads":12,

"pooler_num_fc_layers":3,

"pooler_size_per_head":128,

"pooler_type":"first_token_transform",

"type_vocab_size":2,

"vocab_size":105879

}

wherein, the field of "architecture" represents the structure name of the pre-training model, and the field of "attribute _ probs _ dropout _ prob" is a probability value representing the probability of the model randomly setting zero to the neural network unit when performing feature extraction on the sentence. The specific "hidden _ size" represents the dimension of the vector embedded in each word, the "max _ position _ embeddings" field represents the maximum sentence length accepted by the model, the "pad _ token _ id" field represents the filling value when filling is performed when the sentence length is smaller than the set length, and the "vocab _ size" field represents the size of the vocabulary.

Step S302: after the BERT pre-training model is loaded successfully, word embedding is performed on the text data of the training set and the test set, different high-dimensional vectors are used for representing each word in the text, then all obtained vectors are spliced into a digital matrix to be used as the input of the next layer, and the specific word embedding and splicing process is shown in fig. 2.

Step S4 includes: embedding words of each text segment obtained by using a BERT pre-training model in the step S3 to form a text matrix, then taking the words as the input of a convolutional neural network, and performing text feature extraction on the sentence matrix by using the convolutional neural network, specifically:

step S401: the convolutional neural network layer is designed because the input of the convolutional network is a sentenceMatrix, so its channel is 1, if N_{ch annels}Represents the number of convolution kernels, (N)_kw,N_kh) Represents the convolution kernel size, (N)_pw,N_ph) Representing the pooled size, the convolutional network structure is as follows:

self.cnn_layer＝nn.Sequential(

nn.Conv2d(in_channels＝,out_channels＝N_{ch annels},kernel_size＝(N_kw,N_kh)),

nn.PReLU(),

nn.MaxPool2d(kernel_size＝(N_pw,N_ph))

)

where in _ channels represents an input channel, out _ channels represents an output channel, and also represents the number of convolution kernels, and kernel _ size represents the size of the convolution kernel.

Step S402: all the channel results after convolution and pooling are spliced into a single vector to be used as the input of the next linear classification layer, and the specific flow is shown in fig. 3.

Step S5 includes: inputting the features extracted by the convolutional neural network into a basic fully-connected neural network layer for classification, specifically:

step S501: the output vector dimension of the convolution layer above is higher, so three layers are totally arranged in the linear classification layer, and the high-dimensional vector is classified. The specific network structure flow is shown in fig. 4.

Firstly, based on the paper text classification technology of the BERT pre-training model and the convolution network, the paper texts in various fields can be represented by numerical matrixes. Secondly, the convolution network extracts the text features of the numerical matrix of the text, so that a computer can understand the semantic features of the text; finally, the invention uses the fully-connected neural network to classify the text of the features extracted by the convolutional network, finally realizes higher classification accuracy and helps users to conveniently and quickly perform preliminary thesis document classification on a large number of documents.

Those skilled in the art will appreciate that, in addition to implementing the systems, apparatus, and various modules thereof provided by the present invention in purely computer readable program code, the same procedures can be implemented entirely by logically programming method steps such that the systems, apparatus, and various modules thereof are provided in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system, the device and the modules thereof provided by the present invention can be considered as a hardware component, and the modules included in the system, the device and the modules thereof for implementing various programs can also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. A text classification method based on a BERT pre-training model and a convolution network is characterized by comprising the following steps:

2. The text classification method based on the BERT pre-training model and the convolutional network as claimed in claim 1, wherein the unique identifiers of all domain papers are obtained from the database: paper ID data;

3. The text classification method based on the BERT pre-training model and the convolutional network as claimed in claim 1, wherein the text data stored in the memory is subjected to data cleaning to remove characters and spaces in the text which do not meet the preset requirements.

4. The text classification method based on the BERT pre-training model and the convolutional network as claimed in claim 3, wherein the text data after cleaning and the class label data are in one-to-one correspondence;

5. The text classification method based on the BERT pre-training model and the convolutional network as claimed in claim 1, wherein a network is built by using a Pythrch deep learning framework and the BERT pre-training model is loaded, text data of a training set and a test set are word-embedded, and each word in the text is represented by a different high-dimensional vector.

6. A text classification system based on a BERT pre-trained model and a convolutional network, comprising:

7. The text classification system based on the BERT pre-training model and the convolutional network as claimed in claim 6, wherein the unique identifiers of all domain papers are obtained from the database: paper ID data;

8. The text classification system based on the BERT pre-training model and the convolutional network as claimed in claim 6, wherein the text data stored in the memory is subjected to data cleaning to remove characters and spaces in the text which do not meet the preset requirements.

9. The text classification system based on the BERT pre-trained model and the convolutional network as claimed in claim 8, wherein the cleaned text data and the class label data are in one-to-one correspondence;

10. The text classification system based on the BERT pre-training model and the convolutional network as claimed in claim 6, wherein the network is built by using a Pythrch deep learning framework and the BERT pre-training model is loaded, the text data of the training set and the test set are word-embedded, and each word in the text is represented by a different high-dimensional vector.