CN113468324A - Text classification method and system based on BERT pre-training model and convolutional network - Google Patents

Text classification method and system based on BERT pre-training model and convolutional network Download PDF

Info

Publication number
CN113468324A
CN113468324A CN202110621401.9A CN202110621401A CN113468324A CN 113468324 A CN113468324 A CN 113468324A CN 202110621401 A CN202110621401 A CN 202110621401A CN 113468324 A CN113468324 A CN 113468324A
Authority
CN
China
Prior art keywords
text
data
training
neural network
bert pre
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110621401.9A
Other languages
Chinese (zh)
Inventor
唐果
曹安蕲
傅洛伊
王新兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202110621401.9A priority Critical patent/CN113468324A/en
Publication of CN113468324A publication Critical patent/CN113468324A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a text classification method and system based on a BERT pre-training model and a convolution network, which comprises the following steps: step 1: acquiring thesis text data of each field and category label data of the field to which the thesis belongs in a database; step 2: removing noise in the text, dividing paper text data and category label data into a training set and a test set, and storing the training set and the test set in a text file; and step 3: performing word embedding on the text data in the training set and the test set by using a BERT pre-training model; and 4, step 4: embedding words of each text segment into a text matrix as input of a convolutional neural network, and performing text feature extraction on a sentence matrix by using the convolutional neural network; and 5: and inputting the features extracted by the convolutional neural network into a fully-connected neural network layer for classification. The invention classifies through the fully-connected neural network, so that a user can quickly and accurately classify the thesis documents according to the subject field.

Description

Text classification method and system based on BERT pre-training model and convolutional network
Technical Field
The invention relates to the technical field of deep learning and natural language processing, in particular to a text classification method and system based on a BERT pre-training model and a convolution network.
Background
Patent document CN112487189A (application number: CN202011445448.6) discloses a graph-volume network enhanced implicit discourse text relation classification method. The method is a method for implicit discourse relation classification from electronic texts. Firstly, a BERT pre-training model is introduced, and more efficient dynamic word vector representation is provided, so that overall representation of chapter level is improved; secondly, the invention introduces a graph neural network to model the word-level relationship between sentences, and can more accurately predict the implicit relationship type between sentence pairs.
At present, technologies for classifying texts based on deep learning methods are more, for example, only a traditional convolutional neural network or a recurrent neural network is used for text classification, but the methods are difficult to find a numerical value which can effectively represent text data as a value for computer calculation, and the classification accuracy of the methods is low. Since the BERT pre-training model is released, the accuracy of the task can be greatly improved by applying the BERT pre-training model to the task related to natural language processing. The BERT pre-training model essentially learns the semantic features of the whole text by enabling each word to be embedded to acquire the information of the whole input text through a multi-head attention mechanism. In the paper of the BERT pre-training model, the authors recommend using symbol embedding of the beginning symbols of the input sentences for classification, but this method does not effectively utilize the overall information of the text, and also reduces the accuracy of text classification. Therefore, the invention adds a convolution layer on the basis of the BERT pre-training model for extracting the feature information of the whole text, and finally uses the fully-connected neural network layer to classify the extracted features, thereby improving the classification precision.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a text classification method and system based on a BERT pre-training model and a convolution network.
The text classification method based on the BERT pre-training model and the convolution network provided by the invention comprises the following steps:
step 1: acquiring thesis text data of each field and category label data of the field to which the thesis belongs in a database;
step 2: removing noise in the text, dividing paper text data and category label data into a training set and a test set, and storing the training set and the test set in a text file;
and step 3: word embedding is carried out on text data in a training set and a testing set by using a BERT pre-training model, and each word in the text is represented by using different vectors;
and 4, step 4: embedding words of each text segment into a text matrix as input of a convolutional neural network, and performing text feature extraction on a sentence matrix by using the convolutional neural network;
and 5: and inputting the features extracted by the convolutional neural network into a fully-connected neural network layer for classification.
Preferably, the unique identification of all domain papers is obtained from a database: paper ID data;
and acquiring the text data and the category label data of the paper from the database according to the acquired paper ID data, and storing the text data and the category label data in the memory.
Preferably, the text data stored in the memory is subjected to data cleaning, and characters and spaces which do not meet preset requirements in the text are removed.
Preferably, the cleaned text data and the category label data are in one-to-one correspondence;
all data are divided into a training data set and a testing data set according to a ratio of seven to three, and finally stored in a text file.
Preferably, a network is built by using a Pythrch deep learning framework, a BERT pre-training model is loaded, word embedding is carried out on text data of a training set and a testing set, and each word in the text is represented by different high-dimensional vectors.
The text classification system based on the BERT pre-training model and the convolution network provided by the invention comprises:
module M1: acquiring thesis text data of each field and category label data of the field to which the thesis belongs in a database;
module M2: removing noise in the text, dividing paper text data and category label data into a training set and a test set, and storing the training set and the test set in a text file;
module M3: word embedding is carried out on text data in a training set and a testing set by using a BERT pre-training model, and each word in the text is represented by using different vectors;
module M4: embedding words of each text segment into a text matrix as input of a convolutional neural network, and performing text feature extraction on a sentence matrix by using the convolutional neural network;
module M5: and inputting the features extracted by the convolutional neural network into a fully-connected neural network layer for classification.
Preferably, the unique identification of all domain papers is obtained from a database: paper ID data;
and acquiring the text data and the category label data of the paper from the database according to the acquired paper ID data, and storing the text data and the category label data in the memory.
Preferably, the text data stored in the memory is subjected to data cleaning, and characters and spaces which do not meet preset requirements in the text are removed.
Preferably, the cleaned text data and the category label data are in one-to-one correspondence;
all data are divided into a training data set and a testing data set according to a ratio of seven to three, and finally stored in a text file.
Preferably, a network is built by using a Pythrch deep learning framework, a BERT pre-training model is loaded, word embedding is carried out on text data of a training set and a testing set, and each word in the text is represented by different high-dimensional vectors.
Compared with the prior art, the invention has the following beneficial effects:
(1) the invention is based on the paper text classification technology of the BERT pre-training model and the convolution network, can embed the paper text in each field, and uses the form that the computer can calculate to represent the text;
(2) the invention uses the convolution neural network structure to realize the text feature extraction of the texts, so that a computer can obtain the semantic features of each section of text, and the subsequent processing and application on the text semantics are convenient;
(3) the invention classifies the features extracted by the convolutional neural network by using the fully-connected neural network, so that a user can quickly and accurately classify the thesis documents according to the subject field.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a schematic diagram of embedding of an input text by the BERT pre-training model;
FIG. 3 is a schematic diagram of the convolution of a text embedding matrix;
fig. 4 is a schematic diagram of classification of a fully-connected neural network.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.
Example 1
The paper text classification method based on the BERT pre-training model and the convolutional network provided by the invention comprises the following steps:
step 1: acquiring the text information (including titles and abstracts) of papers in various fields and the category label information of the fields to which the papers belong from a database;
step 2: removing noise in the text, namely some characters without practical significance or redundant space characters and the like in the text, dividing the paper text data and the category label data into a training set and a test set, and storing the training set and the test set in the text file;
and step 3: word embedding text data in training and test sets using a BERT (transform-based bi-directional coding representation) pre-training model, i.e. representing each word in the text with a different vector;
and 4, step 4: embedding words of each text segment into a text matrix as input of a CNN (convolutional neural network), and performing text feature extraction on a sentence matrix by using the convolutional neural network;
and 5: and inputting the features extracted by the convolutional neural network into a basic fully-connected neural network layer for classification.
Specifically, the step 1 includes:
step 1.1: acquiring unique identifications of all domain papers, namely ID information of all the papers, from a database;
step 1.2: and acquiring the text data and the category label data of the papers from the database according to the acquired paper ID data, and storing the text data and the category label data in a memory.
Specifically, the step 2 includes:
step 2.1: carrying out data cleaning on the text data stored in the memory, and removing some characters without practical significance and redundant spaces in the text;
step 2.2: the cleaned text data and the corresponding category labels are in one-to-one correspondence;
step 2.3: and all data are divided into a training data set and a testing data set according to a ratio of seven to three, and finally stored in a text file.
Specifically, the step 3 includes:
step 3.1: text data in the training set and the test set is word-embedded using a BERT pre-training model, i.e., a high-dimensional vector is used to represent each word in the text.
Specifically, the step 4 includes:
step 4.1: embedding words of each piece of paper text data into a text matrix to be used as computer data representation of the text data;
step 4.2: the text matrix is used as input of CNN (convolutional neural network), and the convolutional neural network is used for extracting the characteristics of the text from the sentence matrix.
Specifically, the step 5 includes:
step 5.1: and inputting the features extracted by the convolutional neural network into a basic fully-connected neural network layer for classification.
The system for classifying the thesis text based on the BERT pre-training model and the convolutional network comprises the following modules:
module M1: acquiring the text information (including titles and abstracts) of papers in various fields and the category label information of the fields to which the papers belong from a database;
module M2: removing noise in the text, namely some characters without practical significance or redundant space characters and the like in the text, dividing the paper text data and the category label data into a training set and a test set, and storing the training set and the test set in the text file;
module M3: word embedding text data in training and test sets using a BERT (transform-based bi-directional coding representation) pre-training model, i.e. representing each word in the text with a different vector;
module M4: embedding words of each text segment into a text matrix as input of a CNN (convolutional neural network), and performing text feature extraction on a sentence matrix by using the convolutional neural network;
module M5: and inputting the features extracted by the convolutional neural network into a basic fully-connected neural network layer for classification.
Specifically, the module M1 includes:
step M1.1: acquiring unique identifications of all domain papers, namely ID information of all the papers, from a database;
step M1.2: and acquiring the text data and the category label data of the papers from the database according to the acquired paper ID data, and storing the text data and the category label data in a memory.
Specifically, the module M2 includes:
step M2.1: carrying out data cleaning on the text data stored in the memory, and removing some characters without practical significance and redundant spaces in the text;
step M2.2: the cleaned text data and the corresponding category labels are in one-to-one correspondence;
step M2.3: and all data are divided into a training data set and a testing data set according to a ratio of seven to three, and finally stored in a text file.
Specifically, the module M3 includes:
step M3.1: text data in the training set and the test set is word-embedded using a BERT pre-training model, i.e., a high-dimensional vector is used to represent each word in the text.
Specifically, the module M4 includes:
step M4.1: embedding words of each piece of paper text data into a text matrix to be used as computer data representation of the text data;
step M4.2: the text matrix is used as input of CNN (convolutional neural network), and the convolutional neural network is used for extracting the characteristics of the text from the sentence matrix.
Specifically, the module M5 includes:
step M5.1: and inputting the features extracted by the convolutional neural network into a basic fully-connected neural network layer for classification.
Example 2
Example 2 is a modification of example 1.
The paper text classification technology based on the BERT pre-training model and the convolutional network provided by the embodiment relates to the technical field of sorting label data containing paper titles, abstracts, text data and texts obtained from a database, word embedding is carried out on the text data of the papers by using the BERT pre-training model, word semantics are represented by using vectors, the word vectors form a semantic matrix of the whole text, then the convolutional network is used for extracting features in the text semantic matrix, and finally the extracted semantic features are used for classifying the text; specifically, as shown in fig. 1, the method comprises the following steps:
step S1: acquiring text information of papers in each field and category label information of the fields to which the papers belong from a database;
step S2: removing noise data in the text, dividing the paper text data and the category label data into a training set and a testing set, and storing the training set and the testing set in a text file;
step S3: word embedding is carried out on the text data in the training set and the testing set by using a BERT pre-training model, namely, each word in the text is represented by using different vectors;
step S4: embedding words of each text segment into a text matrix as input of a convolutional neural network, and performing text feature extraction on a sentence matrix by using the convolutional neural network;
step S5: and inputting the features extracted by the convolutional neural network into a basic fully-connected neural network layer for classification.
Step S1 includes: the paper data of all fields including the title and abstract of each paper is obtained from the Ncovpaper database. There are two total databases for obtaining paper text data and category label data, the first is a relational database based on MySQL, and the second is a non-relational database based on ElasticSearch. The two databases store information including title, abstract and category of the paper. Specifically, the method comprises the following steps:
step S101: the title and summary data of all 240987 papers and the domain category data to which these papers belong are obtained from the Ncovpaper database. After the relevant information of the thesis is acquired, the thesis information is stored in the memory in the following format for later use:
[{“label”:label0,“title”:title0,“abstract”:abstract0},
{“label”:label1,“title”:title1,“abstract”:abstract1},…]
wherein, "label" represents the category of the paper, "title" represents the title information of the paper, and "abstrat" represents the abstract information of the paper.
Step S2 includes: removing noise data in the text, then dividing the paper text data and the category label data into a training set and a testing set, and storing the training set and the testing set in the text file, specifically:
step S201: according to the data obtained from the Ncovpaper database, the following steps are carried out on the data:
sequentially acquiring title, abstract and category information of a thesis, splicing the title and the abstract into a text segment, merging the two texts into a text segment, then using a text cleaning function to perform text noise removal on the spliced character string, taking the example of removing redundant spaces in the text as an example, using a method of replacing after regular expression matching, using 'text' to represent the spliced text, and example codes are as follows:
text=re.sub(r"[]+",”,text)
step S202: dividing the cleaned text and the category label data into a training set and a test set according to a seven-to-three ratio and storing the training set and the test set in a text file respectively, in order to ensure that the distribution of each category in the training set and the data set is approximately the same, dividing the data of each category according to the seven-to-three ratio, and then merging the data into the training set and the data set respectively, specifically according to the following steps:
classifying all data according to the labels according to the categories, then classifying the data in a seven-to-three ratio in a plurality of categories which are classified, if arrays 'train _ set' and 'test _ set' are used for storing a training set and a test set, the array 'temp' is used for temporarily storing the data of a certain label, and the number of text categories is assumed to be NlabelThen the Python scripts that divide each category are as follows:
for i in range(Nlabel):
temp=[]
for elem in raw_data:
ifelem['label']==i:
temp.append(elem)
length=len(temp)
train_set.extend(temp[:int(length*0.7)])
test_set.extend(temp[int(length*0.7):])
this completes the task of classifying all data by categories.
Step S203: the divided training set data and test set data are stored in the memory, and then the data stored in the memory is stored by using a text file, wherein the JSON file format is used, and the storage examples are as follows:
[
{“label”:label0,“title”:title0,“abstract”:abstract0},
{“label”:label1,“title”:title1,“abstract”:abstract1},
]
wherein, "label" represents the category of the paper, "title" represents the title information of the paper, and "abstrat" represents the abstract information of the paper.
Step S3 includes: and (3) building a network by using a Pythrch deep learning framework and loading a BERT pre-training model, realizing word embedding of text data of a training set and a testing set, and representing each word in the text by using different high-dimensional vectors. Specifically, the method comprises the following steps:
step S301: building a network by using a Pythrch deep learning framework and loading a BERT pre-training model, wherein the papers comprise multiple languages such as English, Chinese, Spanish and Japanese, and the loaded model is a BERT multi-language pre-training model, and the basic configuration of the model is as follows:
{
"architectures":[
"BertForMaskedLM"
],
"attention_probs_dropout_prob":0.1,
"directionality":"bidi",
"hidden_act":"gelu",
"hidden_dropout_prob":0.1,
"hidden_size":768,
"initializer_range":0.02,
"intermediate_size":3072,
"layer_norm_eps":1e-12,
"max_position_embeddings":512,
"model_type":"bert",
"num_attention_heads":12,
"num_hidden_layers":12,
"pad_token_id":0,
"pooler_fc_size":768,
"pooler_num_attention_heads":12,
"pooler_num_fc_layers":3,
"pooler_size_per_head":128,
"pooler_type":"first_token_transform",
"type_vocab_size":2,
"vocab_size":105879
}
wherein, the field of "architecture" represents the structure name of the pre-training model, and the field of "attribute _ probs _ dropout _ prob" is a probability value representing the probability of the model randomly setting zero to the neural network unit when performing feature extraction on the sentence. The specific "hidden _ size" represents the dimension of the vector embedded in each word, the "max _ position _ embeddings" field represents the maximum sentence length accepted by the model, the "pad _ token _ id" field represents the filling value when filling is performed when the sentence length is smaller than the set length, and the "vocab _ size" field represents the size of the vocabulary.
Step S302: after the BERT pre-training model is loaded successfully, word embedding is performed on the text data of the training set and the test set, different high-dimensional vectors are used for representing each word in the text, then all obtained vectors are spliced into a digital matrix to be used as the input of the next layer, and the specific word embedding and splicing process is shown in fig. 2.
Step S4 includes: embedding words of each text segment obtained by using a BERT pre-training model in the step S3 to form a text matrix, then taking the words as the input of a convolutional neural network, and performing text feature extraction on the sentence matrix by using the convolutional neural network, specifically:
step S401: the convolutional neural network layer is designed because the input of the convolutional network is a sentenceMatrix, so its channel is 1, if Nch annelsRepresents the number of convolution kernels, (N)kw,Nkh) Represents the convolution kernel size, (N)pw,Nph) Representing the pooled size, the convolutional network structure is as follows:
self.cnn_layer=nn.Sequential(
nn.Conv2d(in_channels=,out_channels=Nch annels,kernel_size=(Nkw,Nkh)),
nn.PReLU(),
nn.MaxPool2d(kernel_size=(Npw,Nph))
)
where in _ channels represents an input channel, out _ channels represents an output channel, and also represents the number of convolution kernels, and kernel _ size represents the size of the convolution kernel.
Step S402: all the channel results after convolution and pooling are spliced into a single vector to be used as the input of the next linear classification layer, and the specific flow is shown in fig. 3.
Step S5 includes: inputting the features extracted by the convolutional neural network into a basic fully-connected neural network layer for classification, specifically:
step S501: the output vector dimension of the convolution layer above is higher, so three layers are totally arranged in the linear classification layer, and the high-dimensional vector is classified. The specific network structure flow is shown in fig. 4.
Firstly, based on the paper text classification technology of the BERT pre-training model and the convolution network, the paper texts in various fields can be represented by numerical matrixes. Secondly, the convolution network extracts the text features of the numerical matrix of the text, so that a computer can understand the semantic features of the text; finally, the invention uses the fully-connected neural network to classify the text of the features extracted by the convolutional network, finally realizes higher classification accuracy and helps users to conveniently and quickly perform preliminary thesis document classification on a large number of documents.
Those skilled in the art will appreciate that, in addition to implementing the systems, apparatus, and various modules thereof provided by the present invention in purely computer readable program code, the same procedures can be implemented entirely by logically programming method steps such that the systems, apparatus, and various modules thereof are provided in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system, the device and the modules thereof provided by the present invention can be considered as a hardware component, and the modules included in the system, the device and the modules thereof for implementing various programs can also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims (10)

1. A text classification method based on a BERT pre-training model and a convolution network is characterized by comprising the following steps:
step 1: acquiring thesis text data of each field and category label data of the field to which the thesis belongs in a database;
step 2: removing noise in the text, dividing paper text data and category label data into a training set and a test set, and storing the training set and the test set in a text file;
and step 3: word embedding is carried out on text data in a training set and a testing set by using a BERT pre-training model, and each word in the text is represented by using different vectors;
and 4, step 4: embedding words of each text segment into a text matrix as input of a convolutional neural network, and performing text feature extraction on a sentence matrix by using the convolutional neural network;
and 5: and inputting the features extracted by the convolutional neural network into a fully-connected neural network layer for classification.
2. The text classification method based on the BERT pre-training model and the convolutional network as claimed in claim 1, wherein the unique identifiers of all domain papers are obtained from the database: paper ID data;
and acquiring the text data and the category label data of the paper from the database according to the acquired paper ID data, and storing the text data and the category label data in the memory.
3. The text classification method based on the BERT pre-training model and the convolutional network as claimed in claim 1, wherein the text data stored in the memory is subjected to data cleaning to remove characters and spaces in the text which do not meet the preset requirements.
4. The text classification method based on the BERT pre-training model and the convolutional network as claimed in claim 3, wherein the text data after cleaning and the class label data are in one-to-one correspondence;
all data are divided into a training data set and a testing data set according to a ratio of seven to three, and finally stored in a text file.
5. The text classification method based on the BERT pre-training model and the convolutional network as claimed in claim 1, wherein a network is built by using a Pythrch deep learning framework and the BERT pre-training model is loaded, text data of a training set and a test set are word-embedded, and each word in the text is represented by a different high-dimensional vector.
6. A text classification system based on a BERT pre-trained model and a convolutional network, comprising:
module M1: acquiring thesis text data of each field and category label data of the field to which the thesis belongs in a database;
module M2: removing noise in the text, dividing paper text data and category label data into a training set and a test set, and storing the training set and the test set in a text file;
module M3: word embedding is carried out on text data in a training set and a testing set by using a BERT pre-training model, and each word in the text is represented by using different vectors;
module M4: embedding words of each text segment into a text matrix as input of a convolutional neural network, and performing text feature extraction on a sentence matrix by using the convolutional neural network;
module M5: and inputting the features extracted by the convolutional neural network into a fully-connected neural network layer for classification.
7. The text classification system based on the BERT pre-training model and the convolutional network as claimed in claim 6, wherein the unique identifiers of all domain papers are obtained from the database: paper ID data;
and acquiring the text data and the category label data of the paper from the database according to the acquired paper ID data, and storing the text data and the category label data in the memory.
8. The text classification system based on the BERT pre-training model and the convolutional network as claimed in claim 6, wherein the text data stored in the memory is subjected to data cleaning to remove characters and spaces in the text which do not meet the preset requirements.
9. The text classification system based on the BERT pre-trained model and the convolutional network as claimed in claim 8, wherein the cleaned text data and the class label data are in one-to-one correspondence;
all data are divided into a training data set and a testing data set according to a ratio of seven to three, and finally stored in a text file.
10. The text classification system based on the BERT pre-training model and the convolutional network as claimed in claim 6, wherein the network is built by using a Pythrch deep learning framework and the BERT pre-training model is loaded, the text data of the training set and the test set are word-embedded, and each word in the text is represented by a different high-dimensional vector.
CN202110621401.9A 2021-06-03 2021-06-03 Text classification method and system based on BERT pre-training model and convolutional network Pending CN113468324A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110621401.9A CN113468324A (en) 2021-06-03 2021-06-03 Text classification method and system based on BERT pre-training model and convolutional network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110621401.9A CN113468324A (en) 2021-06-03 2021-06-03 Text classification method and system based on BERT pre-training model and convolutional network

Publications (1)

Publication Number Publication Date
CN113468324A true CN113468324A (en) 2021-10-01

Family

ID=77872216

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110621401.9A Pending CN113468324A (en) 2021-06-03 2021-06-03 Text classification method and system based on BERT pre-training model and convolutional network

Country Status (1)

Country Link
CN (1) CN113468324A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114154493A (en) * 2022-01-28 2022-03-08 北京芯盾时代科技有限公司 Short message category identification method and device
CN115658886A (en) * 2022-09-20 2023-01-31 广东技术师范大学 Intelligent liver cancer staging method, system and medium based on semantic text
CN117455421A (en) * 2023-12-25 2024-01-26 杭州青塔科技有限公司 Subject classification method and device for scientific research projects, computer equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110334210A (en) * 2019-05-30 2019-10-15 哈尔滨理工大学 A kind of Chinese sentiment analysis method merged based on BERT with LSTM, CNN
CN111177376A (en) * 2019-12-17 2020-05-19 东华大学 Chinese text classification method based on BERT and CNN hierarchical connection
CN111651605A (en) * 2020-06-04 2020-09-11 电子科技大学 Lung cancer leading edge trend prediction method based on multi-label classification
CN112100387A (en) * 2020-11-13 2020-12-18 支付宝(杭州)信息技术有限公司 Training method and device of neural network system for text classification
CN112100388A (en) * 2020-11-18 2020-12-18 南京华苏科技有限公司 Method for analyzing emotional polarity of long text news public sentiment
CN112765358A (en) * 2021-02-23 2021-05-07 西安交通大学 Taxpayer industry classification method based on noise label learning
CN112861524A (en) * 2021-04-07 2021-05-28 中南大学 Deep learning-based multilevel Chinese fine-grained emotion analysis method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110334210A (en) * 2019-05-30 2019-10-15 哈尔滨理工大学 A kind of Chinese sentiment analysis method merged based on BERT with LSTM, CNN
CN111177376A (en) * 2019-12-17 2020-05-19 东华大学 Chinese text classification method based on BERT and CNN hierarchical connection
CN111651605A (en) * 2020-06-04 2020-09-11 电子科技大学 Lung cancer leading edge trend prediction method based on multi-label classification
CN112100387A (en) * 2020-11-13 2020-12-18 支付宝(杭州)信息技术有限公司 Training method and device of neural network system for text classification
CN112100388A (en) * 2020-11-18 2020-12-18 南京华苏科技有限公司 Method for analyzing emotional polarity of long text news public sentiment
CN112765358A (en) * 2021-02-23 2021-05-07 西安交通大学 Taxpayer industry classification method based on noise label learning
CN112861524A (en) * 2021-04-07 2021-05-28 中南大学 Deep learning-based multilevel Chinese fine-grained emotion analysis method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张楠: "《深度学习自然语言处理实战》", 31 August 2020 *
高志强, 中国铁道出版社有限公司 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114154493A (en) * 2022-01-28 2022-03-08 北京芯盾时代科技有限公司 Short message category identification method and device
CN115658886A (en) * 2022-09-20 2023-01-31 广东技术师范大学 Intelligent liver cancer staging method, system and medium based on semantic text
CN117455421A (en) * 2023-12-25 2024-01-26 杭州青塔科技有限公司 Subject classification method and device for scientific research projects, computer equipment and storage medium
CN117455421B (en) * 2023-12-25 2024-04-16 杭州青塔科技有限公司 Subject classification method and device for scientific research projects, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
Gong et al. Natural language inference over interaction space
Toledo et al. Information extraction from historical handwritten document images with a context-aware neural model
CN113468324A (en) Text classification method and system based on BERT pre-training model and convolutional network
Campos et al. Biomedical named entity recognition: a survey of machine-learning tools
CN109858039B (en) Text information identification method and identification device
Rosca et al. Sequence-to-sequence neural network models for transliteration
CN111930942B (en) Text classification method, language model training method, device and equipment
Li et al. Word embedding and text classification based on deep learning methods
CN111563375B (en) Text generation method and device
CN110597961A (en) Text category labeling method and device, electronic equipment and storage medium
CN112528649B (en) English pinyin identification method and system for multi-language mixed text
CN110457483B (en) Long text generation method based on neural topic model
US9460076B1 (en) Method for unsupervised learning of grammatical parsers
CN111414746A (en) Matching statement determination method, device, equipment and storage medium
CN113836938A (en) Text similarity calculation method and device, storage medium and electronic device
CN111985243A (en) Emotion model training method, emotion analysis device and storage medium
CN110941958A (en) Text category labeling method and device, electronic equipment and storage medium
WO2021223882A1 (en) Prediction explanation in machine learning classifiers
CN113486178A (en) Text recognition model training method, text recognition device and medium
CN114528835A (en) Semi-supervised specialized term extraction method, medium and equipment based on interval discrimination
CN117453851A (en) Text index enhanced question-answering method and system based on knowledge graph
CN114036950A (en) Medical text named entity recognition method and system
CN117113937A (en) Electric power field reading and understanding method and system based on large-scale language model
Alosaimy et al. Tagging classical Arabic text using available morphological analysers and part of speech taggers
CN112948580B (en) Text classification method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20211001

RJ01 Rejection of invention patent application after publication