CN113139053B

CN113139053B - Text classification method based on self-supervision contrast learning

Info

Publication number: CN113139053B
Application number: CN202110406702.XA
Authority: CN
Inventors: 程良伦; 王德培; 张伟文; 李睿濠; 谭骏铭
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2021-04-15
Filing date: 2021-04-15
Publication date: 2024-03-05
Anticipated expiration: 2041-04-15
Also published as: CN113139053A

Abstract

The invention provides a text classification method based on self-supervision contrast learning, which relates to the technical field of natural language processing, and comprises the following steps: acquiring sample texts and category labels corresponding to each sample text; dividing the sample text into a training set, a verification set and a test set and constructing an initial classification model; preprocessing all sample texts; inputting all the preprocessed sample texts into an initial classification model, and pre-training the initial classification model by using a self-supervision contrast learning method based on the sample texts in a training set; adjusting the initial classification model after pre-training by using sample texts in the verification set; testing the adjusted initial classification model by using the sample text in the test set to obtain a final classification model; and inputting the text to be classified into a final classification model to obtain a result to be classified. The invention realizes quick learning under a small amount of marked data, has low data cost and accurate classification result.

Description

Text classification method based on self-supervision contrast learning

Technical Field

The invention relates to the technical field of natural language processing, in particular to a text classification method based on self-supervision contrast learning.

Background

Currently, most text classification techniques are based on deep neural networks, which require large amounts of labeled data when classifying text. Obtaining large amounts of labeling data requires high economic overhead and intensive human labor and is also difficult to ensure labeling accuracy. With the gradual expansion of the application field of machine learning, the field data with labels is seriously deficient. The self-supervision learning method has great progress in image processing tasks, and the self-supervision learning can improve the generalization performance of the model by only needing less data and labels. How to apply the self-supervision learning mode in the natural language processing field.

The Chinese patent CN112395419A published in 2.2021 and 23 provides a training method and device for a text classification model, and a text classification method and device, wherein the method comprises the following steps: determining a first vector group and a second vector group set according to the first sample text of the sample text set and the label set; inputting the first vector group and the second vector group set into a word level attention layer to obtain a third vector set and a fourth vector set; inputting the third vector set and the fourth vector set into a sentence-level attention layer to obtain a first text vector set related to the tag set; inputting the first text vector set into a full connection layer to obtain a prediction label of the first text; training the text classification model based on the predictive labels and the first label group corresponding to the first sample text in the label set until a training stop condition is reached. According to the method, the accuracy of the text classification model is improved to a certain extent through the steps, but when the text classification model is trained, a large number of accurate sample texts and label sets are obtained, and the data cost is high; and the accuracy of the tag can have an impact on the accuracy of the classification.

Disclosure of Invention

The invention provides a text classification method based on self-supervision contrast learning, which can realize quick learning under a small amount of marked data, classify the text to be classified, and has the advantages of low data cost and accurate classification result.

In order to solve the technical problems, the technical scheme of the invention is as follows:

the invention provides a text classification method based on self-supervision contrast learning, which comprises the following steps:

s1: acquiring sample texts and category labels corresponding to each sample text; dividing the sample text into a training set, a verification set and a test set and constructing an initial classification model;

s2: preprocessing all sample texts;

s3: inputting all the preprocessed sample texts into an initial classification model, and pre-training the initial classification model by using a self-supervision contrast learning method based on the sample texts in a training set; adjusting the initial classification model after pre-training by using sample texts in the verification set; testing the adjusted initial classification model by using the sample text in the test set to obtain a final classification model;

s4: and inputting the text to be classified into a final classification model to obtain a classification result of the text to be classified.

Preferably, the sample text is obtained from an existing Cnews dataset.

Preferably, the method for obtaining the category label corresponding to the sample text includes: the method for manual labeling, the method for semi-automatic labeling by adopting auxiliary tools and the method for full-automatic labeling by adopting rules and dictionary.

Preferably, the specific method for pretreatment is as follows:

text clause: sentence segmentation is carried out on the text according to punctuation marks;

sentence segmentation: dividing Chinese words according to semantics, and dividing English into words according to space;

removing stop words: the deactivated vocabulary, punctuation marks and numbers that do not significantly contribute to classification are removed.

Preferably, in the step S3, the specific method for obtaining the final classification model is as follows:

s3.1: based on the preprocessed sample texts, word vector representation forms of all the sample texts are obtained;

s3.2: extracting features of all sample texts in the word vector representation form;

s3.3: pooling operation is carried out on the sample text after feature extraction, and a pooled training set, verification set and test set are obtained;

s3.4: based on the sample text in the pooled training set, pre-training the initial classification model by using a self-supervision learning method; continuously adjusting the initial classification model by using the sample text in the pooled verification set through setting a first loss function, and finishing adjustment when the value of the first loss function is minimum;

s3.5: testing the adjusted initial classification model by using sample texts in the pooled test data set; and setting a second loss function, and when the value of the second loss function is minimum, completing the test to obtain a final classification model.

Preferably, in S3.1, the specific method for obtaining the word vector representation form of the sample text is as follows:

word vector training is carried out on all the preprocessed sample texts by using a word embedding technology, and the sample texts are vectorized and encoded into x _i ＝{w ₁ ，w ₂ ，…，w _j X, where x _i Vector representing ith sample text, w _j A word vector representing the jth word in the ith sample text.

Preferably, in S3.2, feature extraction is performed on all sample texts by using the multi-layer CNN, and the sample texts are divided into positive type sample texts and negative type sample texts according to the features.

Preferably, in S3.3, the pooling operation, specifically, the maximum pooling operation, is performed on the sample text after feature extraction.

Preferably, in S3.4, the first loss function is:

where x represents a sample text vector, x ⁺ Representing a positive class sample text vector, x _m Represents the m-th negative sample text vector, N represents the number of negative sample texts, f represents the encoder, f ^T Coding transpose representing sample text vector x, f _m Representing the coding result of the negative sample text, f ⁺ Representing the result of encoding a positive class sample text, exp () represents an exponential function based on e.

Preferably, in S3.5, the second loss function is:

wherein C represents the number of categories of the sample text, C is a certain category, y _i A label representing the text label of the i-th sample,a label representing a prediction of the i-th sample text.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

according to the method, a small amount of sample texts with category labels are divided into a training set, a verification set and a test set, firstly, based on sample texts of the training set, an initial classification model is pre-trained by using a self-supervision comparison learning method, the initial classification model which is trained is adjusted by using sample texts of the verification set, and finally, the adjusted initial classification model is tested by using sample texts of the test set, so that a final classification model is obtained; inputting the text to be classified into a final classification model to obtain a classification result of the text to be classified. In the complete training process, the invention uses a small amount of sample text with category labels as input to train the initial classification model, thereby greatly reducing the dependence on a large amount of data with accurate labels, reducing the repeated labor of manual labels, realizing quick learning under a small amount of data with labels, having low data cost and accurate classification result.

Drawings

FIG. 1 is a flow chart of a text classification method based on self-supervised contrast learning according to an embodiment;

FIG. 2 is a flow chart of a method of obtaining a final classification model according to an embodiment;

FIG. 3 is a diagram showing learning speed and classification accuracy of a conventional textRNN at 500 per class of sample text;

FIG. 4 is a diagram showing learning speed and classification accuracy of a conventional textCNN at 500 per class of sample text;

FIG. 5 is a diagram showing learning speed and classification accuracy of a conventional SCL-RNN at 500 per class of sample text;

FIG. 6 is a schematic diagram of learning speed and classification accuracy of 500 sample texts according to the method of the embodiment;

FIG. 7 is a diagram showing learning speed and classification accuracy of a conventional textRNN when 1000 text samples are used in each class;

FIG. 8 is a diagram showing learning speed and classification accuracy of a conventional textCNN when 1000 text samples are each;

FIG. 9 is a diagram showing learning speed and classification accuracy of a conventional SCL-RNN when 1000 sample texts are used in each class;

FIG. 10 is a diagram showing learning speed and classification accuracy of 1000 sample texts according to the method of the embodiment.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the present patent;

for the purpose of better illustrating the embodiments, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the actual product dimensions;

it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical scheme of the invention is further described below with reference to the accompanying drawings and examples.

Examples

The invention provides a text classification method based on self-supervision contrast learning, which can realize quick learning under a small amount of marked data, classify the text to be classified, and has low data components and accurate classification results.

the embodiment provides a text classification method based on self-supervision contrast learning, as shown in fig. 1, the method comprises the following steps:

s1: acquiring sample texts and category labels corresponding to each sample text; dividing the sample text into a training set, a verification set and a test set and constructing an initial classification model; in the embodiment, an initial classification model is constructed by using an unsupervised clustering method;

s2: preprocessing all sample texts;

The sample text is obtained from an existing Cnews dataset.

The method for acquiring the category label corresponding to the sample text comprises the following steps: the method for manual labeling, the method for semi-automatic labeling by adopting auxiliary tools and the method for full-automatic labeling by adopting rules and dictionary.

The specific method for preprocessing comprises the following steps:

sentence segmentation: dividing Chinese words according to semantics, and replacing English with space cutting words;

As shown in fig. 2, in the step S3, a specific method for obtaining the final classification model is as follows:

and extracting the characteristics of all the sample texts by using the multi-layer CNN, and dividing the sample texts into positive type sample texts and negative type sample texts according to the characteristics. In this embodiment, the basic parameters of the CNN network are set as follows: the input dimension is 500, dropout=0.2, filter=256, kernel_size=3/4/5, and the activation function uses the relu function.

S3.3: performing maximum pooling operation on the sample text after feature extraction to obtain a pooled training set, a pooled verification set and a pooled test set;

the first loss function is:

The second loss function is:

wherein,c represents the number of categories of the sample text, C is a certain category, y _i A label representing the text label of the i-th sample,a label representing a prediction of the i-th sample text.

In the implementation process, sample texts are obtained from a Cnews data set, 500 and 1000 sample texts in each class are respectively taken, the parameter setting is consistent, the text classification method (SCL-CNN) based on self-supervision contrast learning of the embodiment is compared with the traditional TextRNN, textCNN and SCL-RNN, when the number of each class of sample texts is 500, the results are shown in the following table,

LR＝0.001	Batch-size	Epochs	Acc
				TextRNN	64	50	55.6
TextCNN	64	50	83.0
				SCL-RNN	64	50	91.7
SCL-CNN	64	50	97.7

when the number of each type of sample text is 1000, the results are shown in the following table,

LR＝0.001	Batch-size	Epochs	Acc
				TextRNN	64	50	57.45
TextCNN	64	50	85.7
				SCL-RNN	64	50	95.35
SCL-CNN	64	50	97.6

the Batch-size and the Epochs in the table are used as parameters, the settings are kept consistent, the Acc represents the precision, and as can be seen from the table, the classification precision of the text classification method (SCL-CNN) based on self-supervision contrast learning provided by the embodiment is higher than that of the traditional TextRNN, textCNN and SCL-RNN, and the sample texts are improved from 500 in each class to 1000 in each class, so that the method does not bring improvement of the classification precision, and can realize accurate classification of the texts to be classified under a small amount of marked data;

the left side of the graphs in fig. 3-6 is a graph of learning speed of 500 times each class of sample text, the abscissa is Epochs, and the ordinate is the LOSS value of the second LOSS function; it can be seen that when training is performed by the method provided by this example, the Epochs value at the inflection point appears much earlier than those at the inflection points of the conventional TextRNN, textCNN and SCL-RNN; the right side in fig. 3-6 is a schematic diagram of 500 time classification precision of each class of sample text, the abscissa is Epochs, and the ordinate is classification precision value, and it can be seen that the maximum classification precision is reached more quickly when training is performed by the method provided by the embodiment;

the sample texts are increased from 500 to 1000 in each class, and the left side of fig. 7-10 is a schematic diagram of learning speed of 1000 in each class of sample texts, and it can also be seen that when training is performed by the method provided by this embodiment, the value of Epochs appearing at the inflection point is far earlier than those appearing at the inflection points of the traditional TextRNN, textCNN and SCL-RNN; 3-6, 1000 time classification accuracy diagrams of each class of sample text are shown, and it can be seen that the maximum classification accuracy is reached faster when training is performed by the method provided by the embodiment;

to sum up, the learning speed of the method provided by the embodiment is far faster than that of the traditional TextRNN, textCNN and SCL-RNN;

comparing fig. 6 and fig. 10, increasing the sample text from 500 to 1000 in each class, the epoch value of the inflection point and the epoch value of the maximum classification precision are not significantly increased, which indicates that the method can realize rapid learning with a small amount of marked data.

It is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims

1. A text classification method based on self-supervised contrast learning, the method comprising the steps of:

s2: preprocessing all sample texts;

s3: inputting all the preprocessed sample texts into an initial classification model, and pre-training the initial classification model by using a self-supervision contrast learning method based on the sample texts in a training set; adjusting the initial classification model after pre-training by using sample texts in the verification set; testing the adjusted initial classification model by using the sample text in the test set to obtain a final classification model; the specific method comprises the following steps:

the first loss function is:

where x represents a sample text vector, x ⁺ Representing a positive class sample text vector, x _m Represents the m-th negative sample text vector, N represents the number of negative sample texts, f represents the encoder, f ^T Coding transpose representing sample text vector x, f _m Representing the coding result of the negative sample text, f ⁺ Representing the coding result of the positive class sample text, exp () represents an exponential function based on e;

s3.5: testing the adjusted initial classification model by using sample texts in the pooled test data set; setting a second loss function, and when the value of the second loss function is minimum, completing the test to obtain a final classification model;

the second loss function is:

wherein C represents the number of categories of the sample text, C is a certain category, y _i A label representing the text label of the i-th sample,a tag representing a prediction of an ith sample text;

2. A method of classifying text based on self-supervised contrast learning as recited in claim 1, wherein the sample text is obtained from an existing Cnews dataset.

3. The text classification method based on self-supervised contrast learning as set forth in claim 2, wherein the method for obtaining the category label corresponding to the sample text includes: the method for manual labeling, the method for semi-automatic labeling by adopting auxiliary tools and the method for full-automatic labeling by adopting rules and dictionary.

4. A method of classifying text based on self-supervised contrast learning as recited in claim 3, wherein the preprocessing includes sentence segmentation, word segmentation, and stop word removal for the sample text.

5. The text classification method based on self-supervised contrast learning of claim 4, wherein in S3.1, the specific method for obtaining the word vector representation form of the sample text is as follows:

6. The text classification method based on self-supervised contrast learning as claimed in claim 5, wherein in S3.2, feature extraction is performed on all sample texts by using multi-layer CNN, and the sample texts are classified into positive type sample texts and negative type sample texts according to features.

7. The text classification method based on self-supervised contrast learning of claim 6, wherein in S3.3, the pooling operation, specifically, the maximum pooling operation, is performed on the sample text after feature extraction.