CN113468324A - Text classification method and system based on BERT pre-training model and convolutional network - Google Patents
Text classification method and system based on BERT pre-training model and convolutional network Download PDFInfo
- Publication number
- CN113468324A CN113468324A CN202110621401.9A CN202110621401A CN113468324A CN 113468324 A CN113468324 A CN 113468324A CN 202110621401 A CN202110621401 A CN 202110621401A CN 113468324 A CN113468324 A CN 113468324A
- Authority
- CN
- China
- Prior art keywords
- text
- data
- training
- neural network
- bert pre
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012549 training Methods 0.000 title claims abstract description 94
- 238000000034 method Methods 0.000 title claims abstract description 27
- 238000012360 testing method Methods 0.000 claims abstract description 45
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 40
- 239000011159 matrix material Substances 0.000 claims abstract description 29
- 238000013528 artificial neural network Methods 0.000 claims abstract description 20
- 238000000605 extraction Methods 0.000 claims abstract description 11
- 239000013598 vector Substances 0.000 claims description 23
- 238000004140 cleaning Methods 0.000 claims description 8
- 238000013135 deep learning Methods 0.000 claims description 8
- 238000005516 engineering process Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000013515 script Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a text classification method and system based on a BERT pre-training model and a convolution network, which comprises the following steps: step 1: acquiring thesis text data of each field and category label data of the field to which the thesis belongs in a database; step 2: removing noise in the text, dividing paper text data and category label data into a training set and a test set, and storing the training set and the test set in a text file; and step 3: performing word embedding on the text data in the training set and the test set by using a BERT pre-training model; and 4, step 4: embedding words of each text segment into a text matrix as input of a convolutional neural network, and performing text feature extraction on a sentence matrix by using the convolutional neural network; and 5: and inputting the features extracted by the convolutional neural network into a fully-connected neural network layer for classification. The invention classifies through the fully-connected neural network, so that a user can quickly and accurately classify the thesis documents according to the subject field.
Description
Technical Field
The invention relates to the technical field of deep learning and natural language processing, in particular to a text classification method and system based on a BERT pre-training model and a convolution network.
Background
Patent document CN112487189A (application number: CN202011445448.6) discloses a graph-volume network enhanced implicit discourse text relation classification method. The method is a method for implicit discourse relation classification from electronic texts. Firstly, a BERT pre-training model is introduced, and more efficient dynamic word vector representation is provided, so that overall representation of chapter level is improved; secondly, the invention introduces a graph neural network to model the word-level relationship between sentences, and can more accurately predict the implicit relationship type between sentence pairs.
At present, technologies for classifying texts based on deep learning methods are more, for example, only a traditional convolutional neural network or a recurrent neural network is used for text classification, but the methods are difficult to find a numerical value which can effectively represent text data as a value for computer calculation, and the classification accuracy of the methods is low. Since the BERT pre-training model is released, the accuracy of the task can be greatly improved by applying the BERT pre-training model to the task related to natural language processing. The BERT pre-training model essentially learns the semantic features of the whole text by enabling each word to be embedded to acquire the information of the whole input text through a multi-head attention mechanism. In the paper of the BERT pre-training model, the authors recommend using symbol embedding of the beginning symbols of the input sentences for classification, but this method does not effectively utilize the overall information of the text, and also reduces the accuracy of text classification. Therefore, the invention adds a convolution layer on the basis of the BERT pre-training model for extracting the feature information of the whole text, and finally uses the fully-connected neural network layer to classify the extracted features, thereby improving the classification precision.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a text classification method and system based on a BERT pre-training model and a convolution network.
The text classification method based on the BERT pre-training model and the convolution network provided by the invention comprises the following steps:
step 1: acquiring thesis text data of each field and category label data of the field to which the thesis belongs in a database;
step 2: removing noise in the text, dividing paper text data and category label data into a training set and a test set, and storing the training set and the test set in a text file;
and step 3: word embedding is carried out on text data in a training set and a testing set by using a BERT pre-training model, and each word in the text is represented by using different vectors;
and 4, step 4: embedding words of each text segment into a text matrix as input of a convolutional neural network, and performing text feature extraction on a sentence matrix by using the convolutional neural network;
and 5: and inputting the features extracted by the convolutional neural network into a fully-connected neural network layer for classification.
Preferably, the unique identification of all domain papers is obtained from a database: paper ID data;
and acquiring the text data and the category label data of the paper from the database according to the acquired paper ID data, and storing the text data and the category label data in the memory.
Preferably, the text data stored in the memory is subjected to data cleaning, and characters and spaces which do not meet preset requirements in the text are removed.
Preferably, the cleaned text data and the category label data are in one-to-one correspondence;
all data are divided into a training data set and a testing data set according to a ratio of seven to three, and finally stored in a text file.
Preferably, a network is built by using a Pythrch deep learning framework, a BERT pre-training model is loaded, word embedding is carried out on text data of a training set and a testing set, and each word in the text is represented by different high-dimensional vectors.
The text classification system based on the BERT pre-training model and the convolution network provided by the invention comprises:
module M1: acquiring thesis text data of each field and category label data of the field to which the thesis belongs in a database;
module M2: removing noise in the text, dividing paper text data and category label data into a training set and a test set, and storing the training set and the test set in a text file;
module M3: word embedding is carried out on text data in a training set and a testing set by using a BERT pre-training model, and each word in the text is represented by using different vectors;
module M4: embedding words of each text segment into a text matrix as input of a convolutional neural network, and performing text feature extraction on a sentence matrix by using the convolutional neural network;
module M5: and inputting the features extracted by the convolutional neural network into a fully-connected neural network layer for classification.
Preferably, the unique identification of all domain papers is obtained from a database: paper ID data;
and acquiring the text data and the category label data of the paper from the database according to the acquired paper ID data, and storing the text data and the category label data in the memory.
Preferably, the text data stored in the memory is subjected to data cleaning, and characters and spaces which do not meet preset requirements in the text are removed.
Preferably, the cleaned text data and the category label data are in one-to-one correspondence;
all data are divided into a training data set and a testing data set according to a ratio of seven to three, and finally stored in a text file.
Preferably, a network is built by using a Pythrch deep learning framework, a BERT pre-training model is loaded, word embedding is carried out on text data of a training set and a testing set, and each word in the text is represented by different high-dimensional vectors.
Compared with the prior art, the invention has the following beneficial effects:
(1) the invention is based on the paper text classification technology of the BERT pre-training model and the convolution network, can embed the paper text in each field, and uses the form that the computer can calculate to represent the text;
(2) the invention uses the convolution neural network structure to realize the text feature extraction of the texts, so that a computer can obtain the semantic features of each section of text, and the subsequent processing and application on the text semantics are convenient;
(3) the invention classifies the features extracted by the convolutional neural network by using the fully-connected neural network, so that a user can quickly and accurately classify the thesis documents according to the subject field.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a schematic diagram of embedding of an input text by the BERT pre-training model;
FIG. 3 is a schematic diagram of the convolution of a text embedding matrix;
fig. 4 is a schematic diagram of classification of a fully-connected neural network.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.
Example 1
The paper text classification method based on the BERT pre-training model and the convolutional network provided by the invention comprises the following steps:
step 1: acquiring the text information (including titles and abstracts) of papers in various fields and the category label information of the fields to which the papers belong from a database;
step 2: removing noise in the text, namely some characters without practical significance or redundant space characters and the like in the text, dividing the paper text data and the category label data into a training set and a test set, and storing the training set and the test set in the text file;
and step 3: word embedding text data in training and test sets using a BERT (transform-based bi-directional coding representation) pre-training model, i.e. representing each word in the text with a different vector;
and 4, step 4: embedding words of each text segment into a text matrix as input of a CNN (convolutional neural network), and performing text feature extraction on a sentence matrix by using the convolutional neural network;
and 5: and inputting the features extracted by the convolutional neural network into a basic fully-connected neural network layer for classification.
Specifically, the step 1 includes:
step 1.1: acquiring unique identifications of all domain papers, namely ID information of all the papers, from a database;
step 1.2: and acquiring the text data and the category label data of the papers from the database according to the acquired paper ID data, and storing the text data and the category label data in a memory.
Specifically, the step 2 includes:
step 2.1: carrying out data cleaning on the text data stored in the memory, and removing some characters without practical significance and redundant spaces in the text;
step 2.2: the cleaned text data and the corresponding category labels are in one-to-one correspondence;
step 2.3: and all data are divided into a training data set and a testing data set according to a ratio of seven to three, and finally stored in a text file.
Specifically, the step 3 includes:
step 3.1: text data in the training set and the test set is word-embedded using a BERT pre-training model, i.e., a high-dimensional vector is used to represent each word in the text.
Specifically, the step 4 includes:
step 4.1: embedding words of each piece of paper text data into a text matrix to be used as computer data representation of the text data;
step 4.2: the text matrix is used as input of CNN (convolutional neural network), and the convolutional neural network is used for extracting the characteristics of the text from the sentence matrix.
Specifically, the step 5 includes:
step 5.1: and inputting the features extracted by the convolutional neural network into a basic fully-connected neural network layer for classification.
The system for classifying the thesis text based on the BERT pre-training model and the convolutional network comprises the following modules:
module M1: acquiring the text information (including titles and abstracts) of papers in various fields and the category label information of the fields to which the papers belong from a database;
module M2: removing noise in the text, namely some characters without practical significance or redundant space characters and the like in the text, dividing the paper text data and the category label data into a training set and a test set, and storing the training set and the test set in the text file;
module M3: word embedding text data in training and test sets using a BERT (transform-based bi-directional coding representation) pre-training model, i.e. representing each word in the text with a different vector;
module M4: embedding words of each text segment into a text matrix as input of a CNN (convolutional neural network), and performing text feature extraction on a sentence matrix by using the convolutional neural network;
module M5: and inputting the features extracted by the convolutional neural network into a basic fully-connected neural network layer for classification.
Specifically, the module M1 includes:
step M1.1: acquiring unique identifications of all domain papers, namely ID information of all the papers, from a database;
step M1.2: and acquiring the text data and the category label data of the papers from the database according to the acquired paper ID data, and storing the text data and the category label data in a memory.
Specifically, the module M2 includes:
step M2.1: carrying out data cleaning on the text data stored in the memory, and removing some characters without practical significance and redundant spaces in the text;
step M2.2: the cleaned text data and the corresponding category labels are in one-to-one correspondence;
step M2.3: and all data are divided into a training data set and a testing data set according to a ratio of seven to three, and finally stored in a text file.
Specifically, the module M3 includes:
step M3.1: text data in the training set and the test set is word-embedded using a BERT pre-training model, i.e., a high-dimensional vector is used to represent each word in the text.
Specifically, the module M4 includes:
step M4.1: embedding words of each piece of paper text data into a text matrix to be used as computer data representation of the text data;
step M4.2: the text matrix is used as input of CNN (convolutional neural network), and the convolutional neural network is used for extracting the characteristics of the text from the sentence matrix.
Specifically, the module M5 includes:
step M5.1: and inputting the features extracted by the convolutional neural network into a basic fully-connected neural network layer for classification.
Example 2
Example 2 is a modification of example 1.
The paper text classification technology based on the BERT pre-training model and the convolutional network provided by the embodiment relates to the technical field of sorting label data containing paper titles, abstracts, text data and texts obtained from a database, word embedding is carried out on the text data of the papers by using the BERT pre-training model, word semantics are represented by using vectors, the word vectors form a semantic matrix of the whole text, then the convolutional network is used for extracting features in the text semantic matrix, and finally the extracted semantic features are used for classifying the text; specifically, as shown in fig. 1, the method comprises the following steps:
step S1: acquiring text information of papers in each field and category label information of the fields to which the papers belong from a database;
step S2: removing noise data in the text, dividing the paper text data and the category label data into a training set and a testing set, and storing the training set and the testing set in a text file;
step S3: word embedding is carried out on the text data in the training set and the testing set by using a BERT pre-training model, namely, each word in the text is represented by using different vectors;
step S4: embedding words of each text segment into a text matrix as input of a convolutional neural network, and performing text feature extraction on a sentence matrix by using the convolutional neural network;
step S5: and inputting the features extracted by the convolutional neural network into a basic fully-connected neural network layer for classification.
Step S1 includes: the paper data of all fields including the title and abstract of each paper is obtained from the Ncovpaper database. There are two total databases for obtaining paper text data and category label data, the first is a relational database based on MySQL, and the second is a non-relational database based on ElasticSearch. The two databases store information including title, abstract and category of the paper. Specifically, the method comprises the following steps:
step S101: the title and summary data of all 240987 papers and the domain category data to which these papers belong are obtained from the Ncovpaper database. After the relevant information of the thesis is acquired, the thesis information is stored in the memory in the following format for later use:
[{“label”:label0,“title”:title0,“abstract”:abstract0},
{“label”:label1,“title”:title1,“abstract”:abstract1},…]
wherein, "label" represents the category of the paper, "title" represents the title information of the paper, and "abstrat" represents the abstract information of the paper.
Step S2 includes: removing noise data in the text, then dividing the paper text data and the category label data into a training set and a testing set, and storing the training set and the testing set in the text file, specifically:
step S201: according to the data obtained from the Ncovpaper database, the following steps are carried out on the data:
sequentially acquiring title, abstract and category information of a thesis, splicing the title and the abstract into a text segment, merging the two texts into a text segment, then using a text cleaning function to perform text noise removal on the spliced character string, taking the example of removing redundant spaces in the text as an example, using a method of replacing after regular expression matching, using 'text' to represent the spliced text, and example codes are as follows:
text=re.sub(r"[]+",”,text)
step S202: dividing the cleaned text and the category label data into a training set and a test set according to a seven-to-three ratio and storing the training set and the test set in a text file respectively, in order to ensure that the distribution of each category in the training set and the data set is approximately the same, dividing the data of each category according to the seven-to-three ratio, and then merging the data into the training set and the data set respectively, specifically according to the following steps:
classifying all data according to the labels according to the categories, then classifying the data in a seven-to-three ratio in a plurality of categories which are classified, if arrays 'train _ set' and 'test _ set' are used for storing a training set and a test set, the array 'temp' is used for temporarily storing the data of a certain label, and the number of text categories is assumed to be NlabelThen the Python scripts that divide each category are as follows:
for i in range(Nlabel):
temp=[]
for elem in raw_data:
ifelem['label']==i:
temp.append(elem)
length=len(temp)
train_set.extend(temp[:int(length*0.7)])
test_set.extend(temp[int(length*0.7):])
this completes the task of classifying all data by categories.
Step S203: the divided training set data and test set data are stored in the memory, and then the data stored in the memory is stored by using a text file, wherein the JSON file format is used, and the storage examples are as follows:
[
{“label”:label0,“title”:title0,“abstract”:abstract0},
{“label”:label1,“title”:title1,“abstract”:abstract1},
…
]
wherein, "label" represents the category of the paper, "title" represents the title information of the paper, and "abstrat" represents the abstract information of the paper.
Step S3 includes: and (3) building a network by using a Pythrch deep learning framework and loading a BERT pre-training model, realizing word embedding of text data of a training set and a testing set, and representing each word in the text by using different high-dimensional vectors. Specifically, the method comprises the following steps:
step S301: building a network by using a Pythrch deep learning framework and loading a BERT pre-training model, wherein the papers comprise multiple languages such as English, Chinese, Spanish and Japanese, and the loaded model is a BERT multi-language pre-training model, and the basic configuration of the model is as follows:
{
"architectures":[
"BertForMaskedLM"
],
"attention_probs_dropout_prob":0.1,
"directionality":"bidi",
"hidden_act":"gelu",
"hidden_dropout_prob":0.1,
"hidden_size":768,
"initializer_range":0.02,
"intermediate_size":3072,
"layer_norm_eps":1e-12,
"max_position_embeddings":512,
"model_type":"bert",
"num_attention_heads":12,
"num_hidden_layers":12,
"pad_token_id":0,
"pooler_fc_size":768,
"pooler_num_attention_heads":12,
"pooler_num_fc_layers":3,
"pooler_size_per_head":128,
"pooler_type":"first_token_transform",
"type_vocab_size":2,
"vocab_size":105879
}
wherein, the field of "architecture" represents the structure name of the pre-training model, and the field of "attribute _ probs _ dropout _ prob" is a probability value representing the probability of the model randomly setting zero to the neural network unit when performing feature extraction on the sentence. The specific "hidden _ size" represents the dimension of the vector embedded in each word, the "max _ position _ embeddings" field represents the maximum sentence length accepted by the model, the "pad _ token _ id" field represents the filling value when filling is performed when the sentence length is smaller than the set length, and the "vocab _ size" field represents the size of the vocabulary.
Step S302: after the BERT pre-training model is loaded successfully, word embedding is performed on the text data of the training set and the test set, different high-dimensional vectors are used for representing each word in the text, then all obtained vectors are spliced into a digital matrix to be used as the input of the next layer, and the specific word embedding and splicing process is shown in fig. 2.
Step S4 includes: embedding words of each text segment obtained by using a BERT pre-training model in the step S3 to form a text matrix, then taking the words as the input of a convolutional neural network, and performing text feature extraction on the sentence matrix by using the convolutional neural network, specifically:
step S401: the convolutional neural network layer is designed because the input of the convolutional network is a sentenceMatrix, so its channel is 1, if Nch annelsRepresents the number of convolution kernels, (N)kw,Nkh) Represents the convolution kernel size, (N)pw,Nph) Representing the pooled size, the convolutional network structure is as follows:
self.cnn_layer=nn.Sequential(
nn.Conv2d(in_channels=,out_channels=Nch annels,kernel_size=(Nkw,Nkh)),
nn.PReLU(),
nn.MaxPool2d(kernel_size=(Npw,Nph))
)
where in _ channels represents an input channel, out _ channels represents an output channel, and also represents the number of convolution kernels, and kernel _ size represents the size of the convolution kernel.
Step S402: all the channel results after convolution and pooling are spliced into a single vector to be used as the input of the next linear classification layer, and the specific flow is shown in fig. 3.
Step S5 includes: inputting the features extracted by the convolutional neural network into a basic fully-connected neural network layer for classification, specifically:
step S501: the output vector dimension of the convolution layer above is higher, so three layers are totally arranged in the linear classification layer, and the high-dimensional vector is classified. The specific network structure flow is shown in fig. 4.
Firstly, based on the paper text classification technology of the BERT pre-training model and the convolution network, the paper texts in various fields can be represented by numerical matrixes. Secondly, the convolution network extracts the text features of the numerical matrix of the text, so that a computer can understand the semantic features of the text; finally, the invention uses the fully-connected neural network to classify the text of the features extracted by the convolutional network, finally realizes higher classification accuracy and helps users to conveniently and quickly perform preliminary thesis document classification on a large number of documents.
Those skilled in the art will appreciate that, in addition to implementing the systems, apparatus, and various modules thereof provided by the present invention in purely computer readable program code, the same procedures can be implemented entirely by logically programming method steps such that the systems, apparatus, and various modules thereof are provided in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system, the device and the modules thereof provided by the present invention can be considered as a hardware component, and the modules included in the system, the device and the modules thereof for implementing various programs can also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.
Claims (10)
1. A text classification method based on a BERT pre-training model and a convolution network is characterized by comprising the following steps:
step 1: acquiring thesis text data of each field and category label data of the field to which the thesis belongs in a database;
step 2: removing noise in the text, dividing paper text data and category label data into a training set and a test set, and storing the training set and the test set in a text file;
and step 3: word embedding is carried out on text data in a training set and a testing set by using a BERT pre-training model, and each word in the text is represented by using different vectors;
and 4, step 4: embedding words of each text segment into a text matrix as input of a convolutional neural network, and performing text feature extraction on a sentence matrix by using the convolutional neural network;
and 5: and inputting the features extracted by the convolutional neural network into a fully-connected neural network layer for classification.
2. The text classification method based on the BERT pre-training model and the convolutional network as claimed in claim 1, wherein the unique identifiers of all domain papers are obtained from the database: paper ID data;
and acquiring the text data and the category label data of the paper from the database according to the acquired paper ID data, and storing the text data and the category label data in the memory.
3. The text classification method based on the BERT pre-training model and the convolutional network as claimed in claim 1, wherein the text data stored in the memory is subjected to data cleaning to remove characters and spaces in the text which do not meet the preset requirements.
4. The text classification method based on the BERT pre-training model and the convolutional network as claimed in claim 3, wherein the text data after cleaning and the class label data are in one-to-one correspondence;
all data are divided into a training data set and a testing data set according to a ratio of seven to three, and finally stored in a text file.
5. The text classification method based on the BERT pre-training model and the convolutional network as claimed in claim 1, wherein a network is built by using a Pythrch deep learning framework and the BERT pre-training model is loaded, text data of a training set and a test set are word-embedded, and each word in the text is represented by a different high-dimensional vector.
6. A text classification system based on a BERT pre-trained model and a convolutional network, comprising:
module M1: acquiring thesis text data of each field and category label data of the field to which the thesis belongs in a database;
module M2: removing noise in the text, dividing paper text data and category label data into a training set and a test set, and storing the training set and the test set in a text file;
module M3: word embedding is carried out on text data in a training set and a testing set by using a BERT pre-training model, and each word in the text is represented by using different vectors;
module M4: embedding words of each text segment into a text matrix as input of a convolutional neural network, and performing text feature extraction on a sentence matrix by using the convolutional neural network;
module M5: and inputting the features extracted by the convolutional neural network into a fully-connected neural network layer for classification.
7. The text classification system based on the BERT pre-training model and the convolutional network as claimed in claim 6, wherein the unique identifiers of all domain papers are obtained from the database: paper ID data;
and acquiring the text data and the category label data of the paper from the database according to the acquired paper ID data, and storing the text data and the category label data in the memory.
8. The text classification system based on the BERT pre-training model and the convolutional network as claimed in claim 6, wherein the text data stored in the memory is subjected to data cleaning to remove characters and spaces in the text which do not meet the preset requirements.
9. The text classification system based on the BERT pre-trained model and the convolutional network as claimed in claim 8, wherein the cleaned text data and the class label data are in one-to-one correspondence;
all data are divided into a training data set and a testing data set according to a ratio of seven to three, and finally stored in a text file.
10. The text classification system based on the BERT pre-training model and the convolutional network as claimed in claim 6, wherein the network is built by using a Pythrch deep learning framework and the BERT pre-training model is loaded, the text data of the training set and the test set are word-embedded, and each word in the text is represented by a different high-dimensional vector.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110621401.9A CN113468324A (en) | 2021-06-03 | 2021-06-03 | Text classification method and system based on BERT pre-training model and convolutional network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110621401.9A CN113468324A (en) | 2021-06-03 | 2021-06-03 | Text classification method and system based on BERT pre-training model and convolutional network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113468324A true CN113468324A (en) | 2021-10-01 |
Family
ID=77872216
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110621401.9A Pending CN113468324A (en) | 2021-06-03 | 2021-06-03 | Text classification method and system based on BERT pre-training model and convolutional network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113468324A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114154493A (en) * | 2022-01-28 | 2022-03-08 | 北京芯盾时代科技有限公司 | Short message category identification method and device |
CN115658886A (en) * | 2022-09-20 | 2023-01-31 | 广东技术师范大学 | Intelligent liver cancer staging method, system and medium based on semantic text |
CN117455421A (en) * | 2023-12-25 | 2024-01-26 | 杭州青塔科技有限公司 | Subject classification method and device for scientific research projects, computer equipment and storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110334210A (en) * | 2019-05-30 | 2019-10-15 | 哈尔滨理工大学 | A kind of Chinese sentiment analysis method merged based on BERT with LSTM, CNN |
CN111177376A (en) * | 2019-12-17 | 2020-05-19 | 东华大学 | Chinese text classification method based on BERT and CNN hierarchical connection |
CN111651605A (en) * | 2020-06-04 | 2020-09-11 | 电子科技大学 | Lung cancer leading edge trend prediction method based on multi-label classification |
CN112100387A (en) * | 2020-11-13 | 2020-12-18 | 支付宝(杭州)信息技术有限公司 | Training method and device of neural network system for text classification |
CN112100388A (en) * | 2020-11-18 | 2020-12-18 | 南京华苏科技有限公司 | Method for analyzing emotional polarity of long text news public sentiment |
CN112765358A (en) * | 2021-02-23 | 2021-05-07 | 西安交通大学 | Taxpayer industry classification method based on noise label learning |
CN112861524A (en) * | 2021-04-07 | 2021-05-28 | 中南大学 | Deep learning-based multilevel Chinese fine-grained emotion analysis method |
-
2021
- 2021-06-03 CN CN202110621401.9A patent/CN113468324A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110334210A (en) * | 2019-05-30 | 2019-10-15 | 哈尔滨理工大学 | A kind of Chinese sentiment analysis method merged based on BERT with LSTM, CNN |
CN111177376A (en) * | 2019-12-17 | 2020-05-19 | 东华大学 | Chinese text classification method based on BERT and CNN hierarchical connection |
CN111651605A (en) * | 2020-06-04 | 2020-09-11 | 电子科技大学 | Lung cancer leading edge trend prediction method based on multi-label classification |
CN112100387A (en) * | 2020-11-13 | 2020-12-18 | 支付宝(杭州)信息技术有限公司 | Training method and device of neural network system for text classification |
CN112100388A (en) * | 2020-11-18 | 2020-12-18 | 南京华苏科技有限公司 | Method for analyzing emotional polarity of long text news public sentiment |
CN112765358A (en) * | 2021-02-23 | 2021-05-07 | 西安交通大学 | Taxpayer industry classification method based on noise label learning |
CN112861524A (en) * | 2021-04-07 | 2021-05-28 | 中南大学 | Deep learning-based multilevel Chinese fine-grained emotion analysis method |
Non-Patent Citations (2)
Title |
---|
张楠: "《深度学习自然语言处理实战》", 31 August 2020 * |
高志强, 中国铁道出版社有限公司 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114154493A (en) * | 2022-01-28 | 2022-03-08 | 北京芯盾时代科技有限公司 | Short message category identification method and device |
CN115658886A (en) * | 2022-09-20 | 2023-01-31 | 广东技术师范大学 | Intelligent liver cancer staging method, system and medium based on semantic text |
CN117455421A (en) * | 2023-12-25 | 2024-01-26 | 杭州青塔科技有限公司 | Subject classification method and device for scientific research projects, computer equipment and storage medium |
CN117455421B (en) * | 2023-12-25 | 2024-04-16 | 杭州青塔科技有限公司 | Subject classification method and device for scientific research projects, computer equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Gong et al. | Natural language inference over interaction space | |
Toledo et al. | Information extraction from historical handwritten document images with a context-aware neural model | |
CN113468324A (en) | Text classification method and system based on BERT pre-training model and convolutional network | |
Campos et al. | Biomedical named entity recognition: a survey of machine-learning tools | |
CN109858039B (en) | Text information identification method and identification device | |
Rosca et al. | Sequence-to-sequence neural network models for transliteration | |
CN111930942B (en) | Text classification method, language model training method, device and equipment | |
Li et al. | Word embedding and text classification based on deep learning methods | |
CN111563375B (en) | Text generation method and device | |
CN110597961A (en) | Text category labeling method and device, electronic equipment and storage medium | |
CN112528649B (en) | English pinyin identification method and system for multi-language mixed text | |
CN110457483B (en) | Long text generation method based on neural topic model | |
US9460076B1 (en) | Method for unsupervised learning of grammatical parsers | |
CN111414746A (en) | Matching statement determination method, device, equipment and storage medium | |
CN113836938A (en) | Text similarity calculation method and device, storage medium and electronic device | |
CN111985243A (en) | Emotion model training method, emotion analysis device and storage medium | |
CN110941958A (en) | Text category labeling method and device, electronic equipment and storage medium | |
WO2021223882A1 (en) | Prediction explanation in machine learning classifiers | |
CN113486178A (en) | Text recognition model training method, text recognition device and medium | |
CN114528835A (en) | Semi-supervised specialized term extraction method, medium and equipment based on interval discrimination | |
CN117453851A (en) | Text index enhanced question-answering method and system based on knowledge graph | |
CN114036950A (en) | Medical text named entity recognition method and system | |
CN117113937A (en) | Electric power field reading and understanding method and system based on large-scale language model | |
Alosaimy et al. | Tagging classical Arabic text using available morphological analysers and part of speech taggers | |
CN112948580B (en) | Text classification method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20211001 |
|
RJ01 | Rejection of invention patent application after publication |