CN113869398B

CN113869398B - Unbalanced text classification method, device, equipment and storage medium

Info

Publication number: CN113869398B
Application number: CN202111128220.9A
Authority: CN
Inventors: 司世景; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-09-26
Filing date: 2021-09-26
Publication date: 2024-06-21
Anticipated expiration: 2041-09-26
Also published as: CN113869398A

Abstract

The application relates to the technical field of artificial intelligence, and discloses an unbalanced text classification method, device, equipment and storage medium, which comprise the steps of acquiring a text training data set; training the feature extractor by using a contrast learning algorithm based on the text training data set to obtain a feature extraction model; performing equalization processing on the text training data set by using an equalization algorithm to obtain an equalization data set; extracting the characteristics of each text in the balanced data set by using a characteristic extraction model to obtain a characteristic vector; training a classifier based on the feature vector to obtain a classification model; and obtaining text data to be processed, and processing the text data to be processed by the feature extraction model and the classification model to obtain corresponding categories. The application also relates to text data to be processed by the blockchain technology and corresponding category data thereof which are stored in the blockchain. According to the application, under the unbalanced training set, the feature extraction model and the classification model obtained through training also have better text classification accuracy.

Description

Unbalanced text classification method, device, equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a method, apparatus, device, and storage medium for unbalanced text classification.

Background

Text classification is currently the most common and important task type in the field of natural language processing. In text classification, sample equalization has a great influence on the final classification result, in practical situations, the problem of data imbalance is very common, and data of different categories is not ideally distributed uniformly in most cases, but is unbalanced. If the unbalanced class sample is directly used for learning and classifying, the model has good learning effect on the class sample with high frequency of occurrence, and the problem of poor learning effect and poor generalization effect can occur on the class sample with low frequency of occurrence. In the prior art, unbalanced samples are often solved by means of resampling, data synthesis or re-weighting, which can solve a part of the problem of unbalanced data, but the performance of these methods is not significant. Therefore, how to train the obtained classifier based on the unbalanced training set, the classifier can improve the accuracy of text classification becomes a problem to be solved urgently.

Disclosure of Invention

The application provides an unbalanced text classification method, device, equipment and storage medium, which are used for solving the problem that the classification effect of a classification model obtained based on unbalanced training set training in the prior art is poor.

In order to solve the above problems, the present application provides an unbalanced text classification method, including:

Acquiring a text training data set;

training a feature extractor by using a contrast learning algorithm based on the text training data set to obtain a feature extraction model;

Performing equalization processing on the text training data set by using an equalization algorithm to obtain an equalization data set;

Extracting the characteristics of each text in the balanced data set by using the characteristic extraction model to obtain a corresponding characteristic vector;

training a classifier based on the feature vector to obtain a classification model;

obtaining text data to be processed, wherein the text data to be processed is processed by the feature extraction model to obtain corresponding vectors to be processed, and the vectors to be processed are processed by the classification model to obtain categories corresponding to the text data to be processed.

Further, before acquiring the text training data set, the method further comprises:

Performing data augmentation on each text data to obtain corresponding augmented data, and storing the text data and the augmented data into the text training data set;

training the feature extractor by using a contrast learning algorithm comprises;

Based on the augmentation data, by combining a contrast learning algorithm into the feature extractor, first feature vectors corresponding to the augmentation data of the homology obtained by the feature extractor are mutually gathered, and first feature vectors corresponding to the augmentation data of different homology are mutually far away.

Further, the combining a contrast learning algorithm into the feature extractor, so that first feature vectors corresponding to the augmentation data of the homology obtained by the feature extractor are mutually gathered, and first feature vectors corresponding to the augmentation data of different homology are mutually far away from each other, including:

And calculating a loss function of the feature extractor by utilizing a noise contrast estimation function and an inner product function in the contrast learning algorithm, so that the homologous first feature vectors are gathered in an embedding space, and the first feature vectors with different sources are far away in the embedding space.

Further, the performing equalization processing on the text training data set by using an equalization algorithm includes:

Acquiring each category in the text training data set and the corresponding text data amount;

Calculating an average value of text data quantity corresponding to each category to obtain an average data quantity;

Comparing the text data amount of each category with the average data amount;

If the data size is smaller than the average data size, taking the text data corresponding to the category as the text data to be amplified, and calculating the difference value between the text data size corresponding to the category and the average data size to obtain the amplified number;

based on the augmentation quantity, a synthetic minority oversampling algorithm or a data augmentation tool is adopted to augment the text data to be augmented.

Further, before the feature extraction is performed on each text in the balanced dataset by using the feature extraction model, the method further includes:

Performing word segmentation processing on the text by using the crust word segmentation to obtain a plurality of corresponding words;

vectorization is carried out on the words to obtain word vectors corresponding to the words.

Further, the feature extraction of each text in the balanced dataset by using the feature extraction model to obtain a corresponding feature vector includes:

the convolution kernel is obtained by convolving the word vector;

Processing the convolution kernel by adopting an activation function to obtain a corresponding characteristic map;

And carrying out pooling treatment on the feature map to obtain feature vectors corresponding to the word vectors.

Further, training the classifier based on the feature vector to obtain a classification model includes:

and training the classifier by using a small-batch gradient descent algorithm based on the feature vector to obtain the classification model.

In order to solve the above problems, the present application also provides an unbalanced text classification apparatus, the apparatus comprising:

the acquisition module is used for acquiring a text training data set;

the feature extraction model training module is used for training a feature extractor by utilizing a contrast learning algorithm based on the text training data set to obtain a feature extraction model;

the equalization processing module is used for performing equalization processing on the text training data set by using an equalization algorithm to obtain an equalization data set;

the feature extraction module is used for carrying out feature extraction on each text in the balanced data set by utilizing the feature extraction model to obtain a corresponding feature vector;

The classification model training module is used for training the classifier based on the feature vector to obtain a classification model;

The classification module is used for obtaining text data to be processed, the text data to be processed is processed by the feature extraction model to obtain corresponding vectors to be processed, and the vectors to be processed are processed by the classification model to obtain categories corresponding to the text data to be processed.

In order to solve the above problems, the present application also provides a computer apparatus comprising:

At least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the unbalanced text classification method as described above.

In order to solve the above-mentioned problems, the present application also provides a non-volatile computer readable storage medium having stored thereon computer readable instructions which, when executed by a processor, implement the unbalanced text classification method as described above.

Compared with the prior art, the unbalanced text classification method, the unbalanced text classification device, the unbalanced text classification equipment and the storage medium have at least the following beneficial effects:

training a feature extractor by using a contrast learning algorithm based on the text training data set to obtain a feature extraction model, and introducing the contrast learning algorithm to enable features extracted by the feature extraction model to be better and to represent corresponding words; performing equalization processing on the text training data set by using an equalization algorithm to obtain an equalization data set, facilitating subsequent training of a classification model, and performing feature extraction on each text in the equalization data set by using a feature extraction model to obtain a corresponding feature vector; training the classifier based on the feature vector to obtain a classification model; through training the feature extractor and the classifier respectively, better feature extraction models and classification models can be obtained, finally, text data to be processed is obtained, the text data to be processed is processed by the feature extraction models to obtain corresponding vectors to be processed, the vectors to be processed are processed by the classification models to obtain categories corresponding to the text data to be processed, and therefore training based on unbalanced data to obtain the feature extraction models and the classification models can be better achieved, and accurate classification is achieved on the data to be processed by the feature extraction models and the classification models.

Drawings

In order to more clearly illustrate the solution of the present application, a brief description will be given below of the drawings required for the description of the embodiments of the present application, and it will be apparent that the drawings in the following description are some embodiments of the present application, and that other drawings may be obtained according to these drawings without the need for inventive effort for a person of ordinary skill in the art.

FIG. 1 is a flow chart illustrating a method for classifying unbalanced text according to an embodiment of the present application;

FIG. 2 is a schematic block diagram of an unbalanced text classification device according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the applications herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having" and any variations thereof in the description of the application and the claims and the description of the drawings above are intended to cover a non-exclusive inclusion. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those skilled in the art will appreciate, either explicitly or implicitly, that the embodiments described herein may be combined with other embodiments.

The application provides an unbalanced text classification method. Referring to fig. 1, a flow chart of an unbalanced text classification method according to an embodiment of the application is shown.

In this embodiment, the unbalanced text classification method includes:

s1, acquiring a text training data set;

In particular, the text training data set is input by a user by retrieving the text training data set from a database, or directly receiving the text training data set, which includes a large amount of text data. And the text data in the text training data set is tagged.

Further, the acquiring the text training data set includes:

Sending a call request to a database, wherein the call request carries a signature verification token;

And receiving a signature verification result returned by the database, and calling the text training data set in the database when the signature verification result is passed.

By encrypting the database and checking the signature when extracting the data in the database, the safety of the data is ensured.

Specifically, the feature grabbing capability and stability of the feature extractor are improved by introducing a contrast learning algorithm, specifically a contrast sample loss, when the feature extractor is trained. After the contrast learning algorithm is introduced, data augmentation is needed to be carried out on text data, for example, partial results in the original text are replaced, randomly inserted, randomly exchanged or randomly deleted by using synonyms, independent augmentation data are obtained, the vector processed by the feature extractor is led to be mutually gathered by introducing the contrast learning algorithm, feature vectors corresponding to homologous texts are mutually gathered, different source texts are mutually far away, and finally the obtained feature extraction model can maximally capture the feature information of each text.

Meanwhile, partial structures in the sentence are subjected to synonym replacement, so that excessive fitting of the model caused by direct oversampling is avoided to a certain extent, familiarity of the model to unbalanced texts is increased, a good initial value is provided for subsequent model learning, and the effect of optimizing the model is achieved.

By introducing a contrast learning algorithm, the feature extraction model obtained through training can maximally capture the feature information of each text.

Still further, the combining a contrast learning algorithm into the feature extractor, so as to aggregate the first feature vectors corresponding to the augmentation data of the homology obtained by the feature extractor, and the first feature vectors corresponding to the augmentation data of different homology are far away from each other, including:

And calculating a loss function of the feature extractor by using a noise contrast estimation function and an inner product function in the contrast learning algorithm, and performing feature extraction on text data by using a feature extraction model combined with the contrast learning algorithm, so that the homologous first feature vectors are gathered in an embedding space, and the first feature vectors with different sources are far away in the embedding space. Judging whether the homologous data belong to the same text data or not; a plurality of different augmentation data derived from the same text data augmentation, the plurality of different augmentation data also belonging to the same source.

By combining the noise contrast estimation function L _NCE with the inner product function S _SimCLR, the contrast sample loss L _con, that is,

Wherein v _i ⁽¹⁾ is the ith data augmented by the first data, v ⁽²⁾ is the data augmented by the second data, i.e., dataset, s is a similarity measure, i.e., inner product, v _j ⁽²⁾ is the jth data augmented by the second data, v _i ⁽²⁾ is the ith data augmented by the second data, exp () is an exponential function based on e, v _i ⁽¹⁾ is the ith data augmented by the first data, and v ⁽²⁾ is the dataset augmented by the second data; v _-i ⁽¹⁾ represents the data set after the second data augmentation, except that the i-th, S _SimCLR is the similarity of SimCLR, i.e. the inner product, consistent with S; representing/>, between the ith data augmented by the first data and the ith data augmented by the second data Similarity; representing v ⁽²⁾ and/>A data set consisting of two data sets; /(I)Representing v ⁽¹⁾ and/>A data set consisting of two data sets; i, j=1, 2, … …, N.

By adopting the mode of combining the noise contrast estimation function and the inner product function as the loss function of the feature extractor, the feature extraction effect of the feature extraction model obtained by final training is better, the first feature vectors corresponding to the homologous augmentation data are mutually gathered, and the first feature vectors corresponding to the augmentation data of different sources are mutually far away.

S2, training a feature extractor by using a contrast learning algorithm based on the text training data set to obtain a feature extraction model;

Specifically, feature extraction is performed on each text in the text training dataset through a feature extractor such as an encoder in TextCNN, transform, so as to obtain a feature vector corresponding to the text. The feature extractor can extract a feature vector which is deeper in the text or better in the text, and the feature vector can better represent the text through a contrast learning algorithm. And taking the contrast learning algorithm as a loss function of the feature extractor, and continuously training according to the text training data set until the loss function converges, so as to finally obtain a feature extraction model. And calculating a loss function for each round of training of the feature extractor, and updating parameters of the feature extractor based on the extracted vector and the loss function, so that a plurality of rounds of training are performed until the loss function converges, and final parameters of the feature extractor are updated to obtain a feature extraction model.

The guiding principle of the contrast learning algorithm is as follows: by automatically constructing similar and dissimilar instances, a representation learning model is learned by which similar instances are made closer together in projection space and dissimilar instances are made farther apart in projection space.

S3, carrying out equalization processing on the text training data set by using an equalization algorithm to obtain an equalization data set;

Specifically, since it is difficult to ensure that the labels corresponding to the data in the training text set are balanced, but if the text is classified and the unbalanced data is used for training, the classification effect of the trained classifier on the low-frequency class sample is poor. Therefore, the text training data set is balanced by using an equalization algorithm, so that sample equalization of each category is realized, and the classifier has better classification effect on the whole. The equalization algorithm can be a synthetic minority class oversampling algorithm, resampling, data synthesis or re-weighting, and the like, and finally an equalization data set is obtained.

Comparing the text data amount of each category with the average data amount;

Specifically, since the text training data set is provided with the labels, the data in the text training data set can be divided into a plurality of categories according to the labels, and the text data amount contained in each category is obtained; calculating the average value of the text data quantity of each category to obtain the average data quantity, and determining the text data to be amplified and the amplification quantity based on the average data quantity; and based on the augmentation quantity, the text data to be augmented is augmented to achieve the average data quantity, and the text training data set is subjected to equalization processing.

The text data to be amplified is amplified by a synthetic minority oversampling algorithm, which is a nearest neighbor-based technique, and the distance between the data points in the feature space is determined by euclidean. The percentage of oversampling represents the number of composite samples to be created, the percentage parameter of oversampling always being a multiple of 100. If the percentage of over-sampling is 100, then for each instance, new samples will be created, and therefore the number of minority class instances will double.

The method comprises the steps that 1) a corresponding k adjacent neighbors are found in all samples of a minority class through a synthesized minority class oversampling algorithm, and the adjacent neighbors can be judged through Euclidean distance; 2) Randomly selecting one sample from the k neighbors, regenerating a random number between 0 and 1, calculating the difference value between the randomly selected sample and the sample, multiplying the difference value by the random number, and adding the samples; and repeating the step 2) for N times to synthesize N new samples.

The EDA data augmentation kit can be adopted to carry out data augmentation, and EDA is a simple data augmentation technology, and 4 augmentation operations, namely synonym replacement, random insertion, random exchange and random deletion, are adopted; synonym replacement is to randomly select n non-stop words in a sentence. For each selected word, replacing with its randomly selected synonym; random insertion is to find a non-stop word in a sentence at will, randomly select a synonym of the non-stop word, and insert the synonym into any position in the sentence. Repeating n times; the random switch randomly selects two words in the sentence and switches positions. Repeating n times; the random deletion is to randomly delete each word with the probability of p in the sentence.

And amplifying the text data to be amplified by adopting a synthetic minority oversampling algorithm to finally obtain an equalized data set so as to facilitate the training of a subsequent model.

S4, extracting the characteristics of each text in the balanced data set by using the characteristic extraction model to obtain a corresponding characteristic vector;

and extracting the characteristics of each text in the balanced dataset by using the characteristic extraction models such as TextCNN, transform or an encoder after training is finished, and obtaining corresponding characteristic vectors.

According to the crust word segmentation kit directly utilized in the application, each input text can be segmented by importing the crust kit, so that word segmentation processing of the text is realized.

For example, when "the flow after investigation is damage" the word "investigation/after/flow/yes/damage" is obtained after the word segmentation process is performed by using the accurate pattern of the bargain word segmentation. In the real-time example of the application, each word is directly and independently separated.

And then, mapping each Word to a high-dimensional vector space by using Word2Vec to obtain a Word vector corresponding to each Word.

After word segmentation, each word is converted into a corresponding vector, so that subsequent further processing is facilitated.

Still further, the word segmentation processing is performed on the text by using the barker word segmentation, and obtaining the corresponding plurality of words includes:

Scanning the text data based on a preset Trie, and identifying a plurality of segmentation combinations of words in the text;

Constructing a directed acyclic graph based on all the identified segmentation combinations, dynamically planning and searching a maximum probability path by using the directed acyclic graph, determining the segmentation combination with the maximum probability, and segmenting the text based on the segmentation combination with the maximum probability;

For unrecognized words, a hidden Markov model is used for segmentation.

Specifically, the Trie is also called a dictionary tree, which is a common data structure and is also a prefix tree, and is used for fast string matching in a string list. And scanning the text based on a preset Trie, identifying a plurality of segmentation combinations of words in the text, and performing scanning matching on the text and the Trie to generate the plurality of segmentation combinations of the words. And combining multiple types of segmentation to form a directed acyclic graph, wherein each node in the directed acyclic graph is a segmented word.

And then, carrying out dynamic programming and searching for the maximum probability path by using the directed acyclic graph, and converting the occurrence frequency of each word into frequency when generating a Trie by using a dictionary. For a plurality of given segmentation combinations, searching the occurrence frequency of the segmentation combination, namely the probability of each node in the directed acyclic graph, for each segmentation combination, and calculating the maximum probability path according to the established directed acyclic graph, wherein the main function of calculating the maximum probability path is calc. The function calc is a bottom-up dynamic programming, and calculates the probability logarithmic score of each segmentation combination of the sentence to be processed in a way of traversing each word of the sentence to be processed in a reverse order from the last word of the sentence to be processed. The situation that the probability logarithm score is highest is then saved and output in such a split combination mode. And obtaining a segmentation combination with the maximum probability, and segmenting the sentence to be processed based on the segmentation combination.

And because the dictionary is limited and cannot contain all words, the words which do not appear in the dictionary are segmented by adopting a hidden Markov model, the hidden Markov model marks Chinese words according to four states of BEMS, B refers to a starting position, E refers to an ending position, M refers to a middle position, S refers to a position for forming words independently, and the Chinese words are marked by adopting the four states in the crust analysis, for example, beijing can BE marked as BE, north/B Beijing/E, north is the starting position, beijing is the ending position, and splitting is performed. The method can be used for segmenting the statement to be processed, so that the segmentation combination closest to the real situation can be obtained.

the convolution kernel is obtained by convolving the word vector;

Specifically, when the TextCNN feature extractor is adopted, the word vector is first convolved to obtain a convolution kernel, the width of the convolution kernel is the same as that of the word vector, the convolution kernel only moves in the height direction, and the minimum unit for explaining the sliding of the convolution kernel each time is a word.

Before the convolution, the padding operation is also performed on the original word vector in order to make the lengths of the vectors consistent.

After convolution, the convolution kernel is processed through an activation function to obtain a characteristic spectrum, and the characteristic spectrum is subjected to pooling operation to form a new characteristic vector by characteristic spectrum stitching.

The feature vectors of the words are extracted through convolution, activation and pooling, so that subsequent training of the classifier is facilitated, and the extracted feature vectors can just represent the text.

S5, training the classifier based on the feature vector to obtain a classification model.

The classifier is trained by utilizing the feature vectors corresponding to the equalized data set to obtain the final classifier. The classifier in the embodiment of the application can be a Softmax classifier.

Small batch gradient descent is a compromise approach to batch gradient descent and random gradient descent, with the idea being to update parameters using batch_size samples for each iteration. The method has the advantages that through matrix operation, the neural network parameters are not much slower than single data in each batch, the iteration times required by convergence can be greatly reduced by using one batch each time, and the convergence result is more approximate to the gradient descent effect.

The classifier is trained in a small-batch gradient descent algorithm mode, so that the classifier can be trained to obtain a better classification model while the training process is faster.

S6, obtaining text data to be processed, wherein the text data to be processed is processed by the feature extraction model to obtain corresponding vectors to be processed, and the vectors to be processed are processed by the classification model to obtain categories corresponding to the text data to be processed.

Specifically, after the feature extraction model and the classifier are obtained, the text data to be processed input by the user can be directly received or directly obtained from the database to be processed, the text data to be processed is directly input into the feature extraction model to obtain a corresponding vector to be processed, the vector to be processed is input into the final classifier, namely the classification model to be processed, and the class corresponding to the text data to be processed is output, wherein the class is the class with the highest probability obtained by the classifier. The trained feature extraction model is matched with the classification model, so that the accuracy of text classification is improved.

It is emphasized that, in order to further guarantee the privacy and security of the data, the text data to be processed and all the data of the corresponding category obtained therefrom may also be stored in a node of a blockchain.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The blockchain (Blockchain), essentially a de-centralized database, is a string of data blocks that are generated in association using cryptographic methods, each of which contains information from a batch of network transactions for verifying the validity (anti-counterfeit) of its information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.

The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Wherein artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Training a feature extractor by using a contrast learning algorithm based on the text training data set to obtain a feature extraction model, and introducing the contrast learning algorithm to enable features extracted by the feature extraction model to be better and to represent corresponding words; performing equalization processing on the text training data set by using an equalization algorithm to obtain an equalization data set, facilitating subsequent training of a classification model, and performing feature extraction on each text in the equalization data set by using a feature extraction model to obtain a corresponding feature vector; training the classifier based on the feature vector to obtain a classification model; through training the feature extractor and the classifier respectively, a better feature extraction model can be obtained for processing to obtain a corresponding vector to be processed, the vector to be processed passes through the classification model, finally, the text data to be processed is obtained, and the text data to be processed is processed through the feature extraction model and the classification model to obtain a category corresponding to the text data to be processed, so that training based on unbalanced data to obtain the feature extraction model and the classification model can be better realized, and accurate classification is realized on the data to be processed by utilizing the feature extraction model and the classification model.

As shown in fig. 2, a functional block diagram of the unbalanced text classification apparatus according to the present application.

The unbalanced text classification apparatus 100 of the present application may be installed in an electronic device. Depending on the implemented functionality, the unbalanced text classification apparatus 100 may include an acquisition module 101, a feature extraction model training module 102, an equalization processing module 103, a feature extraction module 104, a classification model training module 105, and a classification module 106. The module of the application, which may also be referred to as a unit, refers to a series of computer program segments, which are stored in the memory of the electronic device, capable of being executed by the processor of the electronic device and of performing a fixed function.

In the present embodiment, the functions concerning the respective modules/units are as follows:

An acquisition module 101 for acquiring a text training data set;

The acquisition module 101 acquires a text training data set by acquiring a text training data set from a database, or directly receives a text training data set input by a user, the text training data set including a large amount of text data.

Further, the unbalanced text classification apparatus 100 includes an augmentation module, and the feature extraction model training module 102 includes an augmentation training sub-module;

The augmentation module is used for carrying out data augmentation on each text data to obtain corresponding augmentation data, and storing the text data and the augmentation data into the text training data set;

The augmentation training sub-module is configured to combine a contrast learning algorithm into the feature extractor based on the augmentation data, so that first feature vectors corresponding to the augmentation data of the homology obtained by the feature extractor are mutually aggregated, and first feature vectors corresponding to the augmentation data of different homology are mutually separated.

Specifically, the augmentation training submodule improves the feature grabbing capacity and stability of the feature extractor by introducing a contrast learning algorithm, specifically a contrast sample loss, when training the feature extractor. After the contrast learning algorithm is introduced, data augmentation is needed to be carried out on text data, for example, partial results in the original text are replaced, randomly inserted, randomly exchanged or randomly deleted by using synonyms, independent augmentation data are obtained, the vector processed by the feature extractor is led to be mutually gathered by introducing the contrast learning algorithm, feature vectors corresponding to homologous texts are mutually gathered, different source texts are mutually far away, and finally the obtained feature extraction model can maximally capture the feature information of each text.

Through the cooperation of the augmentation module and the augmentation training submodule, the feature extraction model obtained through training can maximally capture the feature information of each text.

Still further, the augmented training submodule includes a computing unit;

The computing unit is used for computing the loss function of the feature extractor by utilizing the noise contrast estimation function and the inner product function in the contrast learning algorithm, and extracting the features of the text data by utilizing a feature extraction model combined with the contrast learning algorithm, so that the homologous first feature vectors are gathered in an embedding space, and the first feature vectors with different sources are far away in the embedding space. Judging whether the homologous data belong to the same text data or not; a plurality of different augmentation data derived from the same text data augmentation, the plurality of different augmentation data also belonging to the same source.

By combining the noise contrast estimation function L _NCE and the inner product function S _SimCLR, a contrast sample loss L _con is calculated, i.e

Wherein v _i ⁽¹⁾ is the ith data augmented by the first data, v ⁽²⁾ is the data augmented by the second data, i.e., dataset, s is a similarity measure, i.e., inner product, v _j ⁽²⁾ is the jth data augmented by the second data, v _i ⁽²⁾ is the ith data augmented by the second data, exp () is an exponential function based on e, v _i ⁽¹⁾ is the ith data augmented by the first data, and v ⁽²⁾ is the dataset augmented by the second data; v _-i ⁽¹⁾ represents the data set after the second data augmentation, except that the i-th, S _SimCLR is the similarity of SimCLR, i.e. the inner product, consistent with S; representing the similarity between the ith data augmented by the first data and the ith data augmented by the second data; /(I) Representing v ⁽²⁾ and/>A data set consisting of two data sets; /(I)Representing v ⁽¹⁾ and/>A data set consisting of two data sets; i, j=1, 2, … …, N.

The calculation unit adopts a mode of combining a noise comparison estimation function and an inner product function to serve as a loss function of the feature extractor, so that the feature extraction effect of the feature extraction model obtained through final training is better, first feature vectors corresponding to homologous augmentation data are mutually gathered, and first feature vectors corresponding to different sources of augmentation data are mutually far away.

The feature extraction model training module 102 is configured to train the feature extractor by using a contrast learning algorithm based on the text training data set to obtain a feature extraction model;

Specifically, the feature extraction model training module 102 performs feature extraction on each text in the text training dataset through a feature extractor, such as an encoder in TextCNN, transform, to obtain a feature vector corresponding to the text. The feature extractor can extract a feature vector which is deeper in the text or better in the text, and the feature vector can better represent the text through a contrast learning algorithm. And taking the contrast learning algorithm as a loss function of the feature extractor, and continuously training according to the text training data set until the loss function converges, so as to finally obtain a feature extraction model. And calculating a loss function for each round of training of the feature extractor, and updating parameters of the feature extractor based on the extracted vector and the loss function, so that a plurality of rounds of training are performed until the loss function converges, and final parameters of the feature extractor are updated to obtain a feature extraction model.

The equalization processing module 103 is configured to perform equalization processing on the text training data set by using an equalization algorithm to obtain an equalized data set;

Specifically, the equalization processing module 103 performs equalization processing on the text training data set by using an equalization algorithm, so that sample equalization of each category is realized, and the overall classification effect of the classifier is better. The equalization algorithm can be a synthetic minority class oversampling algorithm, resampling, data synthesis or re-weighting, and the like, and finally an equalization data set is obtained.

Further, the equalization processing module 103 includes a data amount obtaining sub-module, an average calculating sub-module, a data amount comparing sub-module, a difference calculating sub-module and a data amplifying sub-module;

The data volume acquisition sub-module is used for acquiring each category and the corresponding text data volume in the text training data set;

the average calculation sub-module is used for calculating the average value of the text data quantity corresponding to each category to obtain the average data quantity;

a data quantity comparison sub-module for comparing the text data quantity of each category with the average data quantity;

The difference value calculation sub-module is used for taking the text data corresponding to the category as the text data to be amplified if the text data is smaller than the average data quantity, and calculating the difference value between the text data quantity corresponding to the category and the average data quantity to obtain the amplified quantity;

and the data augmentation sub-module is used for augmenting the text data to be augmented by adopting a synthetic minority oversampling algorithm or a data augmentation tool based on the augmentation quantity.

Specifically, since the text training data set is provided with the labels, the data in the text training data set can be divided into a plurality of categories according to the labels, and the data volume acquisition sub-module acquires the text data volume contained in each category; the average calculation sub-module calculates the average value of the text data quantity of each category to obtain the average data quantity, and the data quantity comparison sub-module and the difference calculation sub-module determine the text data to be amplified and the amplification quantity based on the average data quantity; and the data augmentation sub-module is used for augmenting the text data to be augmented based on the augmentation quantity so as to achieve the average data quantity and realize the equalization processing of the text training data set.

The data augmentation sub-module augments text data to be augmented by a synthetic minority class oversampling algorithm, which is a nearest neighbor-based technique that determines distances between data points in a feature space from euclidean.

And finally obtaining an balanced data set through the cooperation of the data quantity acquisition sub-module, the average calculation sub-module, the data quantity comparison sub-module, the difference calculation sub-module and the data augmentation sub-module so as to facilitate the training of a subsequent model.

The feature extraction module 104 is configured to perform feature extraction on each text in the balanced dataset by using the feature extraction model, so as to obtain a corresponding feature vector;

The feature extraction module 104 performs feature extraction on each text in the balanced dataset by using feature extraction models such as TextCNN, transform or encoder after training is completed, so as to obtain corresponding feature vectors.

Further, the unbalanced text classification device 100 further includes a word segmentation module and a vectorization module;

the word segmentation module is used for carrying out word segmentation processing on the text by utilizing the barking word segmentation to obtain a plurality of corresponding words;

The vectorization module is used for vectorizing the words to obtain word vectors corresponding to the words.

Specifically, the word segmentation module directly utilizes the crust word segmentation kit, and each input text can be segmented by importing the crust word segmentation kit, so that word segmentation processing of the text is realized.

The vectorization module maps each Word to a high-dimensional vector space by using Word2Vec to obtain a Word vector corresponding to each Word.

After word segmentation, each word is converted into a corresponding vector through matching of the word segmentation module and the vectorization module, so that subsequent further processing is facilitated.

Further, the feature extraction module 104 includes a convolution sub-module, an activation sub-module, and a pooling sub-module;

the convolution submodule is used for obtaining a convolution kernel by carrying out convolution on the word vector;

The activation submodule is used for processing the convolution kernel by adopting an activation function to obtain a corresponding characteristic map;

And Chi Huazi module, configured to obtain a feature vector corresponding to the word vector by performing pooling processing on the feature map.

Specifically, when the TextCNN feature extractor is adopted, the convolution submodule carries out convolution on the word vector to obtain a convolution kernel, the width of the convolution kernel is the same as that of the word vector, the convolution kernel only moves in the height direction, and the minimum unit for explaining the sliding of the convolution kernel each time is a character.

After the convolution is carried out, the activation submodule processes the convolution kernel through an activation function to obtain a characteristic spectrum, and the Chi Huazi module is used for splicing the characteristic spectrum to form a new characteristic vector through pooling operation.

The convolution sub-module, the activation sub-module and the pooling sub-module are matched, and the convolution, the activation and the pooling are carried out to extract the characteristic vector of the word, so that the subsequent training of the classifier is facilitated, and the extracted characteristic vector can just represent the text.

A classification model training module 105, configured to train the classifier based on the feature vector to obtain a classification model;

specifically, the classification model training module 105 trains the classifier by using the feature vectors corresponding to the equalization dataset to obtain a final classifier. The classifier in the embodiment of the application can be a Softmax classifier.

Further, the classification model training module comprises a small batch gradient descent training sub-module;

The small-batch gradient descent training sub-module is used for training the classifier by utilizing a small-batch gradient descent algorithm based on the feature vector so as to obtain the classification model.

The small-batch gradient descent training submodule trains the classifier in a small-batch gradient descent algorithm mode, so that the training process is faster, and meanwhile, a better classification model can be obtained through training.

The classification module 106 is configured to obtain text data to be processed, where the text data to be processed is processed by the feature extraction model to obtain a corresponding vector to be processed, and the vector to be processed is processed by the classification model to obtain a category corresponding to the text data to be processed.

Specifically, after the feature extraction model and the classifier are obtained, the classification module 106 may directly receive the text data to be processed input by the user or directly obtain the text data to be processed from the database, to perform processing, directly input the text data to be processed into the feature extraction model to obtain a corresponding vector to be processed, and then input the vector to be processed into the final classifier, that is, the classification model to perform processing, and output a class corresponding to the text data to be processed, where the class is a class with the highest probability obtained by the classifier. The trained feature extraction model is matched with the classification model, so that the accuracy of text classification is improved.

By adopting the device, the unbalanced text classification device 100 improves the accuracy of classifying the data to be processed through the cooperation of the acquisition module 101, the feature extraction model training module 102, the equalization processing module 103, the feature extraction module 104, the classification model training module 105 and the classification module 106.

In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 3, fig. 3 is a basic structural block diagram of a computer device according to the present embodiment.

The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It should be noted that only computer device 4 having components 41-43 is shown in the figures, but it should be understood that not all of the illustrated components are required to be implemented and that more or fewer components may be implemented instead. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and its hardware includes, but is not limited to, a microprocessor, an Application SPECIFIC INTEGRATED Circuit (ASIC), a Programmable gate array (Field-Programmable GATE ARRAY, FPGA), a digital Processor (DIGITAL SIGNAL Processor, DSP), an embedded device, and the like.

The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.

The memory 41 includes at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the storage 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the computer device 4. Of course, the memory 41 may also comprise both an internal memory unit of the computer device 4 and an external memory device. In this embodiment, the memory 41 is typically used to store an operating system and various application software installed on the computer device 4, such as computer readable instructions of an unbalanced text classification method. Further, the memory 41 may be used to temporarily store various types of data that have been output or are to be output.

The processor 42 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute computer readable instructions stored in the memory 41 or process data, such as computer readable instructions for executing the unbalanced text classification method.

The network interface 43 may comprise a wireless network interface or a wired network interface, which network interface 43 is typically used for establishing a communication connection between the computer device 4 and other electronic devices.

According to the method, the steps of the unbalanced text classification method according to the embodiment are realized when a processor executes computer readable instructions stored in a memory, a text training data set is obtained, a feature extractor is trained by using a contrast learning algorithm based on the text training data set to obtain a feature extraction model, and features extracted by the feature extraction model are better and corresponding words can be represented by introducing the contrast learning algorithm; performing equalization processing on the text training data set by using an equalization algorithm to obtain an equalization data set, facilitating subsequent training of a classification model, and performing feature extraction on each text in the equalization data set by using a feature extraction model to obtain a corresponding feature vector; training the classifier based on the feature vector to obtain a classification model; through training the feature extractor and the classifier respectively, better feature extraction models and classification models can be obtained, finally, text data to be processed is obtained, the text data to be processed is processed by the feature extraction models to obtain corresponding vectors to be processed, the vectors to be processed are processed by the classification models to obtain categories corresponding to the text data to be processed, and therefore training based on unbalanced data to obtain the feature extraction models and the classification models can be better achieved, and accurate classification is achieved on the data to be processed by the feature extraction models and the classification models.

The present application also provides another embodiment, namely, provides a computer readable storage medium, where the computer readable storage medium stores computer readable instructions, where the computer readable instructions are executable by at least one processor to cause the at least one processor to perform the steps of the unbalanced text classification method as described above, training a feature extractor by using a contrast learning algorithm based on the text training dataset to obtain a feature extraction model, and introducing the contrast learning algorithm to enable features extracted by the feature extraction model to be better and to characterize corresponding words; performing equalization processing on the text training data set by using an equalization algorithm to obtain an equalization data set, facilitating subsequent training of a classification model, and performing feature extraction on each text in the equalization data set by using a feature extraction model to obtain a corresponding feature vector; training the classifier based on the feature vector to obtain a classification model; through training the feature extractor and the classifier respectively, a better feature extraction model can be obtained for processing to obtain a corresponding vector to be processed, the vector to be processed passes through the classification model, finally, the text data to be processed is obtained, and the text data to be processed is processed through the feature extraction model and the classification model to obtain a category corresponding to the text data to be processed, so that training based on unbalanced data to obtain the feature extraction model and the classification model can be better realized, and accurate classification is realized on the data to be processed by utilizing the feature extraction model and the classification model.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.

It is apparent that the above-described embodiments are only some embodiments of the present application, but not all embodiments, and the preferred embodiments of the present application are shown in the drawings, which do not limit the scope of the patent claims. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a thorough and complete understanding of the present disclosure. Although the application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing description, or equivalents may be substituted for elements thereof. All equivalent structures made by the content of the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the scope of the application.

Claims

1. A method of unbalanced text classification, the method comprising:

acquiring a text training data set, wherein before acquiring the text training data set, the method further comprises: performing data augmentation on each text data to obtain corresponding augmented data, and storing the text data and the augmented data into the text training data set;

Training the feature extractor by using a contrast learning algorithm based on the text training data set to obtain a feature extraction model, wherein the training the feature extractor by using the contrast learning algorithm comprises; based on the augmentation data, combining a contrast learning algorithm into the feature extractor to mutually gather first feature vectors corresponding to the augmentation data of the homology obtained by the feature extractor, wherein the first feature vectors corresponding to the augmentation data of different homology are mutually far away;

Performing equalization processing on the text training data set by using an equalization algorithm to obtain an equalization data set, wherein the performing equalization processing on the text training data set by using the equalization algorithm comprises: acquiring each category in the text training data set and the corresponding text data amount; calculating an average value of text data quantity corresponding to each category to obtain an average data quantity; comparing the text data amount of each category with the average data amount; if the data size is smaller than the average data size, taking the text data corresponding to the category as the text data to be amplified, and calculating the difference value between the text data size corresponding to the category and the average data size to obtain the amplified number; based on the augmentation quantity, adopting a synthetic minority oversampling algorithm or a data augmentation tool to augment the text data to be augmented;

2. The method of unbalanced text classification according to claim 1, wherein said integrating a contrast learning algorithm into said feature extractor gathers first feature vectors corresponding to said augmented data of a same source obtained by said feature extractor with each other, said first feature vectors corresponding to said augmented data of a different source being distant from each other comprises:

3. The unbalanced text classification method of claim 1, further comprising, prior to said feature extraction of each text in said balanced dataset using said feature extraction model:

4. The method of claim 3, wherein the feature extraction of each text in the balanced dataset by using the feature extraction model to obtain a corresponding feature vector comprises:

the convolution kernel is obtained by convolving the word vector;

5. The method of unbalanced text classification of claim 1 wherein training a classifier based on the feature vectors to obtain a classification model comprises:

6. An unbalanced text classification apparatus for implementing the unbalanced text classification method according to any one of claims 1 to 5, the unbalanced text classification apparatus comprising:

the acquisition module is used for acquiring a text training data set;

7. A computer device, the computer device comprising:

At least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores computer readable instructions that when executed by the processor implement the unbalanced text classification method of any one of claims 1 to 5.

8. A computer readable storage medium having stored thereon computer readable instructions which when executed by a processor implement the unbalanced text classification method of any one of claims 1 to 5.