CN113869398A

CN113869398A - Unbalanced text classification method, device, equipment and storage medium

Info

Publication number: CN113869398A
Application number: CN202111128220.9A
Authority: CN
Inventors: 司世景; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-09-26
Filing date: 2021-09-26
Publication date: 2021-12-31

Abstract

The application relates to the technical field of artificial intelligence, and discloses an unbalanced text classification method, device, equipment and storage medium, which comprises the steps of obtaining a text training data set; training the feature extractor by using a contrast learning algorithm based on a text training data set to obtain a feature extraction model; carrying out equalization processing on the text training data set by using an equalization algorithm to obtain an equalization data set; extracting the features of each text in the balanced data set by using a feature extraction model to obtain a feature vector; training the classifier based on the feature vectors to obtain a classification model; and acquiring text data to be processed, and processing the text data to be processed by a feature extraction model and a classification model to obtain corresponding categories. The application also relates to a block chain technology, wherein the text data to be processed and the corresponding class data are stored in the block chain. According to the method and the device, under the unbalanced training set, the feature extraction model and the classification model obtained by training have better text classification accuracy.

Description

Unbalanced text classification method, device, equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to an unbalanced text classification method, apparatus, device, and storage medium.

Background

Currently, text classification is the most common and important type of task in the field of natural language processing. In text classification, the balance of samples has great influence on the final classification result, in practical situations, the problem of data imbalance is very common, and data of different classes are not ideally balanced but unbalanced in most cases. If the samples with unbalanced classes are directly used for learning and classification, the model has good learning effect on the class samples with high occurrence frequency, and has the problems of poor learning effect and poor generalization effect on the class samples with low occurrence frequency. In the prior art, unbalanced samples are usually solved by means of resampling, data synthesis or re-weighting, which can solve the problem of partial data imbalance, but the performance of these methods is not significant. Therefore, how to train the obtained classifier on the basis of the unbalanced training set becomes a problem to be solved urgently, wherein the classifier can improve the accuracy of text classification.

Disclosure of Invention

The application provides an unbalanced text classification method, device, equipment and storage medium, which are used for solving the problem that classification effect of a classification model obtained based on unbalanced training set training in the prior art is poor.

In order to solve the above problem, the present application provides an unbalanced text classification method, including:

acquiring a text training data set;

training a feature extractor by using a contrast learning algorithm based on the text training data set to obtain a feature extraction model;

carrying out equalization processing on the text training data set by using an equalization algorithm to obtain an equalization data set;

extracting the features of each text in the balanced data set by using the feature extraction model to obtain corresponding feature vectors;

training a classifier based on the feature vectors to obtain a classification model;

and acquiring text data to be processed, wherein the text data to be processed is processed by the feature extraction model to obtain a corresponding vector to be processed, and the vector to be processed is processed by the classification model to obtain a category corresponding to the text data to be processed.

Further, before obtaining the text training data set, the method further includes:

performing data augmentation on each text data to obtain corresponding augmented data, and storing the text data and the augmented data into the text training data set;

the training of the feature extractor by using a contrast learning algorithm comprises;

based on the augmented data, a contrast learning algorithm is combined into the feature extractor, so that first feature vectors corresponding to the augmented data of the same source obtained by the feature extractor are mutually gathered, and first feature vectors corresponding to the augmented data of different sources are mutually far away.

Further, the aggregating the first feature vectors corresponding to the augmented data of the same source obtained by the feature extractor by incorporating a contrast learning algorithm into the feature extractor, and the moving away the first feature vectors corresponding to the augmented data of different sources comprises:

and calculating a loss function of the feature extractor by using a noise contrast estimation function and an inner product function in the contrast learning algorithm, so that the homologous first feature vectors are gathered in an embedding space, and the different first feature vectors are far away in the embedding space.

Further, the equalizing the text training data set by using an equalization algorithm includes:

acquiring each category in the text training data set and the corresponding text data volume;

carrying out average calculation on the text data volume corresponding to each category to obtain an average data volume;

comparing the text data amount of each category with the average data amount;

if the text data amount is smaller than the average data amount, the text data corresponding to the category is used as the text data to be augmented, and the difference value between the text data amount corresponding to the category and the average data amount is calculated to obtain the augmentation quantity;

and based on the augmentation quantity, adopting a few-class oversampling algorithm or a data augmentation tool to be synthesized to augment the text data to be augmented.

Further, before the performing feature extraction on each text in the balanced dataset by using the feature extraction model, the method further includes:

carrying out word segmentation processing on the text by utilizing the ending word segmentation to obtain a plurality of corresponding words;

and vectorizing the words to obtain word vectors corresponding to the words.

Further, the extracting features of the texts in the equalized data set by using the feature extraction model to obtain corresponding feature vectors includes:

obtaining a convolution kernel by performing convolution on the word vector;

processing the convolution kernel by adopting an activation function to obtain a corresponding characteristic map;

and performing pooling treatment on the feature map to obtain a feature vector corresponding to the word vector.

Further, the training the classifier based on the feature vector to obtain a classification model includes:

and training the classifier by using a small batch gradient descent algorithm based on the feature vector to obtain the classification model.

In order to solve the above problem, the present application also provides an unbalanced text classification apparatus, the apparatus including:

the acquisition module is used for acquiring a text training data set;

the feature extraction model training module is used for training a feature extractor by using a comparison learning algorithm based on the text training data set to obtain a feature extraction model;

the equalization processing module is used for carrying out equalization processing on the text training data set by using an equalization algorithm to obtain an equalization data set;

the feature extraction module is used for extracting features of each text in the balanced data set by using the feature extraction model to obtain corresponding feature vectors;

the classification model training module is used for training a classifier based on the feature vector to obtain a classification model;

and the classification module is used for acquiring text data to be processed, the text data to be processed is processed by the feature extraction model to obtain a corresponding vector to be processed, and the vector to be processed is processed by the classification model to obtain a category corresponding to the text data to be processed.

In order to solve the above problem, the present application also provides a computer device, including:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the unbalanced text classification method as described above.

To solve the above problem, the present application also provides a non-volatile computer-readable storage medium having computer-readable instructions stored thereon, which when executed by a processor implement the unbalanced text classification method as described above.

Compared with the prior art, the unbalanced text classification method, the device, the equipment and the storage medium provided by the embodiment of the application have the following beneficial effects:

the method comprises the steps of training a feature extractor by using a contrast learning algorithm based on a text training data set to obtain a feature extraction model, and enabling features extracted by the feature extraction model to be better and representing corresponding words by introducing the contrast learning algorithm; carrying out equalization processing on the text training data set by using an equalization algorithm to obtain an equalization data set, facilitating subsequent training of a classification model, and carrying out feature extraction on each text in the equalization data set by using a feature extraction model to obtain a corresponding feature vector; training the classifier based on the feature vectors to obtain a classification model; the method comprises the steps of training a feature extractor and a classifier respectively to obtain a better feature extraction model and a classification model, finally obtaining text data to be processed, processing the text data to be processed by the feature extraction model to obtain a corresponding vector to be processed, processing the vector to be processed by the classification model to obtain a category corresponding to the text data to be processed, and therefore training based on unbalanced data to obtain the feature extraction model and the classification model can be better achieved, and accurate classification of the data to be processed is achieved by the feature extraction model and the classification model.

Drawings

In order to more clearly illustrate the solution of the present application, a brief description will be given below of the drawings required for describing the embodiments of the present application, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without inventive effort.

Fig. 1 is a schematic flowchart of an unbalanced text classification method according to an embodiment of the present application;

fig. 2 is a schematic block diagram of an unbalanced text classification apparatus according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. One skilled in the art will explicitly or implicitly appreciate that the embodiments described herein can be combined with other embodiments.

The application provides an unbalanced text classification method. Fig. 1 is a schematic flow chart of an unbalanced text classification method according to an embodiment of the present application.

In this embodiment, the unbalanced text classification method includes:

s1, acquiring a text training data set;

specifically, the text training data set is obtained from a database, or the text training data set input by the user is directly received, and the text training data set comprises a large amount of text data. And the text data in the text training data set is labeled.

Further, the acquiring the text training data set includes:

sending a calling request to a database, wherein the calling request carries a signature checking token;

and receiving a signature checking result returned by the database, and calling the text training data set in the database when the signature checking result is passed.

The database is encrypted, and the data in the database needs to be checked when being extracted, so that the safety of the data is ensured.

Specifically, when the feature extractor is trained, the feature capturing capability and stability of the feature extractor are improved by introducing a comparison learning algorithm, specifically, comparing sample loss. After a contrast learning algorithm is introduced, data augmentation needs to be performed on text data, for example, synonyms are used for replacing, randomly inserting, randomly exchanging or randomly deleting partial results in an original text to obtain independent augmented data, the vectors processed by a feature extractor are enabled to be mutually aggregated by feature vectors corresponding to homologous texts and different source texts to be mutually far away by introducing the contrast learning algorithm, and finally the obtained feature extraction model can capture feature information of each text to the maximum extent.

Meanwhile, synonym replacement is carried out on part of structures in the sentence, model overfitting caused by direct oversampling is avoided to a certain extent, the familiarity of the model to unbalanced texts is increased, a good initial value is provided for subsequent model learning, and the effect of optimizing the model is achieved.

By introducing a comparison learning algorithm, the feature extraction model obtained by training can capture feature information of each text to the maximum extent.

Still further, the integrating a contrast learning algorithm into the feature extractor to mutually gather the first feature vectors corresponding to the augmented data of the same source obtained by the feature extractor, and the mutually separating the first feature vectors corresponding to the augmented data of different sources includes:

and calculating a loss function of the feature extractor by using a noise contrast estimation function and an inner product function in the contrast learning algorithm, and performing feature extraction on the text data by using a feature extraction model combined with the contrast learning algorithm, so that the first feature vectors of the same source are gathered in an embedding space, and the first feature vectors of different sources are far away in the embedding space. Judging whether the same source belongs to the same text data or not; the method includes the steps that a plurality of different augmentation data are obtained through augmentation of the same text data, and the plurality of different augmentation data belong to the same source.

By combining noise contrast estimation functions L_NCEAnd inner product function S_SimCLRCalculating the loss L of the comparison sample_conThat is to say that,

wherein v is_i ⁽¹⁾For the ith data, v, augmented with the first data⁽²⁾For the data augmented by the second data, i.e. the data set, s is a similarity measure, i.e. the inner product, v_j ⁽²⁾For the j data, v, augmented by the second data_i ⁽²⁾For the ith data augmented with the second data, exp () is an exponential function based on e, v_i ⁽¹⁾For the ith data, v, augmented with the first data⁽²⁾A data set augmented with a second type of data; v. of_-i ⁽¹⁾Representing the data set augmented with the second data, excluding the ith, S_SimCLRThe similarity of SimCLR, i.e. inner product, is consistent with s;

representing the ith data augmented by the first data and the ith data augmented by the second dataBetween individual data

Similarity; denotes v⁽²⁾And

a data set consisting of two data sets;

denotes v⁽¹⁾And

a data set consisting of two data sets; i, j ═ 1,2, … …, N.

By adopting a form of combining a noise comparison estimation function and an inner product function as a loss function of the feature extractor, the feature extraction effect of the feature extraction model obtained by final training can be better, the first feature vectors corresponding to the homologous augmented data are mutually gathered, and the first feature vectors corresponding to the different homologous augmented data are mutually far away.

S2, training a feature extractor by using a contrast learning algorithm based on the text training data set to obtain a feature extraction model;

specifically, feature extraction is performed on each text in the text training data set by a feature extractor such as an encoder in TextCNN or Transform, so as to obtain a feature vector corresponding to the text. By comparing learning algorithms, the feature extractor can extract deeper or better feature vectors of the text, and the feature vectors can better represent the text. And taking the comparison learning algorithm as a loss function of the feature extractor, and continuously training according to the text training data set until the loss function is converged to finally obtain the feature extraction model. And calculating a loss function for each round of training of the feature extractor, updating parameters of the feature extractor based on the extracted vector and the loss function, performing multiple rounds of training until the loss function is converged, and updating to obtain final parameters of the feature extractor, thereby obtaining a feature extraction model.

The guiding principle of the comparative learning algorithm is as follows: by automatically constructing similar instances and dissimilar instances, a representation learning model is learned by which similar instances are relatively close in projection space and dissimilar instances are relatively far apart in projection space.

S3, carrying out equalization processing on the text training data set by using an equalization algorithm to obtain an equalization data set;

specifically, since it is difficult to ensure that the labels corresponding to the data in the training text set are balanced, if the training is performed on the text by using the unbalanced data, the trained classifier has a poor classification effect on low-frequency class samples. Therefore, the text training data set is balanced by using a balance algorithm, and sample balance of each class is realized, so that the classifier has a better overall classification effect. The equalization algorithm can be a few-class over-sampling algorithm, resampling, data synthesis or re-weighting algorithm, and the like, and finally an equalization data set is obtained.

comparing the text data amount of each category with the average data amount;

Specifically, because the text training data set is provided with the labels, the data in the text training data set can be divided into a plurality of categories according to the labels, and the text data amount contained in each category is obtained; calculating the average value of the text data amount of each category to obtain the average data amount, and determining the text data to be augmented and the augmentation quantity based on the average data amount; and based on the augmentation quantity, augmenting the text data to be augmented to achieve average data volume and realize the equalization processing of the text training data set.

Text data to be amplified is amplified by synthesizing a minority class oversampling algorithm, wherein the synthesis of the minority class oversampling algorithm is a nearest neighbor-based technology, and the distance between data points in a feature space is judged by Euclidean. The percentage of oversampling represents the number of synthetic samples to be created, the percentage parameter of oversampling is always a multiple of 100. If the percentage of oversampling is 100, then a new sample will be created for each instance, and therefore the number of few class instances will double.

Synthesizing a minority class oversampling algorithm, and 1) finding corresponding k neighbor neighbors in all samples of the minority class, wherein the neighbor neighbors can be judged through Euclidean distance; 2) randomly selecting a sample from the k neighbors, generating a random number between 0 and 1, calculating a difference between the randomly selected sample and the sample, multiplying the difference by the random number, and summing the samples; repeating the step 2) N times to synthesize N new samples.

Data augmentation can also be performed by using an EDA data augmentation toolkit, wherein EDA is a simple data augmentation technology, and 4 augmentation operations, namely synonym replacement, random insertion, random exchange and random deletion, are performed; synonym replacement is to randomly select n non-stop words in the sentence. For each selected word, replacing it with its randomly selected synonym; random insertion means that a non-stop word is randomly found in a sentence, a synonym of the non-stop word is randomly selected, and the non-stop word is inserted into any position in the sentence. Repeating for n times; the random exchanger randomly selects two words in the sentence and exchanges positions. Repeating for n times; and random deletion means that for each word with the probability of p in the sentence, the word is deleted randomly.

And (3) amplifying the text data to be amplified by adopting a synthesis minority oversampling algorithm to finally obtain a balanced data set so as to facilitate the training of a subsequent model.

S4, extracting the features of each text in the balanced data set by using the feature extraction model to obtain corresponding feature vectors;

and performing feature extraction on each text in the balanced data set by using a trained feature extraction model such as TextCNN, Transform or encoder to obtain a corresponding feature vector.

and vectorizing the words to obtain word vectors corresponding to the words.

The ending word segmentation toolkit directly utilized in the application can segment each input text by importing the ending word segmentation toolkit, so that word segmentation processing of the text is realized.

For example, if "the flow after investigation is damage assessment" is word segmentation processing using the precise pattern of the segmentation of words, the word "investigation/after/flow/damage assessment" will be obtained. In the present embodiment, the words are directly separated.

And then mapping each Word to a high-dimensional vector space by adopting Word2Vec to obtain a Word vector corresponding to each Word.

After word segmentation, each character is converted into a corresponding vector, so that subsequent further processing is facilitated.

Still further, the performing word segmentation processing on the text by using the ending word segmentation to obtain a plurality of corresponding words includes:

scanning the text data based on a preset Trie tree, and identifying various segmentation combinations of words in the text;

constructing a directed acyclic graph based on all identified segmentation combinations, dynamically planning and searching a maximum probability path by using the directed acyclic graph, determining a segmentation combination with a maximum probability, and segmenting words of the text based on the segmentation combination with the maximum probability;

and for the unrecognized words, performing segmentation by adopting a hidden Markov model.

Specifically, the Trie, also called a dictionary tree, is a common data structure and is also a prefix tree, which is used for performing rapid string matching in a string list. And scanning the text based on a preset Trie tree, identifying various segmentation combinations of words in the text, and scanning and matching the text and the Trie tree to generate various segmentation combinations of the words. Combining the multiple segmentations to form a directed acyclic graph, wherein each node in the directed acyclic graph is a segmented word.

And then dynamically planning and searching a maximum probability path by using the directed acyclic graph, and converting the occurrence frequency of each word into frequency when generating a Trie tree by using a dictionary. For a plurality of given segmentation combinations, the occurrence frequency of the segmentation combinations, namely the probability of each node in the directed acyclic graph, is searched for each segmentation combination, the main function for calculating the maximum probability path is calc, and the function calculates the maximum probability path according to the constructed directed acyclic graph. The function calc is a dynamic programming from bottom to top, and calculates the probability logarithm scores of the segmentation combinations of the sentences to be processed in a mode of traversing each word of the sentences to be processed in a reverse order from the last word of the sentences to be processed. And then storing and outputting the case with the highest probability logarithm score in the segmentation combination mode. Namely, the segmentation combination with the maximum probability is obtained, and the sentence to be processed is segmented based on the segmentation combination.

And because the dictionary is limited and can not contain all words, the words which do not appear in the dictionary are segmented by adopting a hidden Markov model, the hidden Markov model marks Chinese words according to four states of BEMS, B refers to a starting position, E refers to an ending position, M refers to a middle position, S refers to a position of a single word, and the ending analysis marks the Chinese words by adopting the four states, for example, Beijing can BE marked as BE, namely Beijing/B Beijing/E, namely Beijing is the starting position, and Beijing is the ending position for splitting. By the method, the sentence to be processed is segmented, and the segmentation combination closest to the real situation can be obtained.

obtaining a convolution kernel by performing convolution on the word vector;

Specifically, when the TextCNN feature extractor is adopted, the word vector is convolved to obtain a convolution kernel, the width of the convolution kernel is the same as the width of the word vector, the convolution kernel only moves in the height direction, and the minimum unit of each convolution kernel sliding is a word.

Before convolution, padding operation is also carried out on the original word vectors so as to make the lengths of the vectors consistent.

After convolution, the convolution kernel is processed through an activation function to obtain a feature map, and the feature map is subjected to pooling operation to form a new feature vector by feature map splicing.

The feature vectors of the words are extracted through convolution, activation processing and pooling processing, so that the classifier can be conveniently trained subsequently, and the extracted feature vectors can just represent texts.

And S5, training the classifier based on the feature vectors to obtain a classification model.

And training the classifier by utilizing the feature vector corresponding to the balanced data set to obtain a final classifier. In the embodiment of the present application, the classifier may be a Softmax classifier.

The small batch gradient descent is a compromise between the batch gradient descent and the random gradient descent, and the idea is mainly to update the parameter by using the batch _ size samples in each iteration. The method has the advantages that through matrix operation, the optimization of the neural network parameters on one batch at a time is not much slower than that of single data, the iteration number required by convergence can be greatly reduced by using one batch at a time, and the convergence result is closer to the gradient reduction effect.

The classifier is trained in a small batch gradient descent algorithm mode, so that a better classification model can be obtained through training while the training process is faster.

And S6, acquiring text data to be processed, wherein the text data to be processed is processed by the feature extraction model to obtain a corresponding vector to be processed, and the vector to be processed is processed by the classification model to obtain a category corresponding to the text data to be processed.

Specifically, after the feature extraction model and the classifier are obtained, text data to be processed input by a user can be directly received or the text data to be processed can be directly obtained from a database for processing, the text data to be processed is directly input into the feature extraction model to obtain a corresponding vector to be processed, the vector to be processed is input into the final classifier, namely the classification model, for processing, the class corresponding to the text data to be processed is output, and the class is the class with the highest probability obtained by the classifier. Through the cooperation of the trained feature extraction model and the classification model, the accuracy of text classification is improved.

It is emphasized that, in order to further ensure the privacy and security of the data, the text data to be processed and all the data of the corresponding category obtained by the text data to be processed can also be stored in the nodes of a block chain.

The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.

The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The method comprises the steps of training a feature extractor by using a contrast learning algorithm based on a text training data set to obtain a feature extraction model, and enabling features extracted by the feature extraction model to be better and representing corresponding words by introducing the contrast learning algorithm; carrying out equalization processing on the text training data set by using an equalization algorithm to obtain an equalization data set, facilitating subsequent training of a classification model, and carrying out feature extraction on each text in the equalization data set by using a feature extraction model to obtain a corresponding feature vector; training the classifier based on the feature vectors to obtain a classification model; the feature extractor and the classifier are trained respectively, so that better feature extraction model processing can be obtained, corresponding vectors to be processed are obtained, the vectors to be processed pass through the classification model, finally, text data to be processed are obtained, the text data to be processed pass through the feature extraction model and the classification model processing, and categories corresponding to the text data to be processed are obtained, therefore, the purpose that training is carried out based on unbalanced data to obtain the feature extraction model and the classification model can be better achieved, and accurate classification of the data to be processed is achieved by means of the feature extraction model and the classification model is achieved.

Fig. 2 is a functional block diagram of the unbalanced text classification apparatus according to the present application.

The unbalanced text classification apparatus 100 of the present application may be installed in an electronic device. According to the realized functions, the unbalanced text classification apparatus 100 may include an obtaining module 101, a feature extraction model training module 102, a balance processing module 103, a feature extraction module 104, a classification model training module 105, and a classification module 106. A module, which may also be referred to as a unit in this application, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.

In the present embodiment, the functions regarding the respective modules/units are as follows:

an obtaining module 101, configured to obtain a text training data set;

the obtaining module 101 obtains a text training data set from a database, or directly receives a text training data set input by a user, where the text training data set includes a large amount of text data.

Further, the unbalanced text classification device 100 includes an augmentation module, and the feature extraction model training module 102 includes an augmentation training sub-module;

the augmentation module is used for augmenting each text data to obtain corresponding augmentation data, and storing the text data and the augmentation data into the text training data set;

the augmentation training submodule is used for combining a contrast learning algorithm into the feature extractor based on the augmentation data to enable first feature vectors corresponding to the augmentation data of the same source obtained by the feature extractor to be mutually gathered and first feature vectors corresponding to the augmentation data of different sources to be mutually far away.

Specifically, the augmented training submodule improves the feature capture capability and stability of the feature extractor by introducing a comparison learning algorithm, specifically comparing sample loss, when the feature extractor is trained. After a contrast learning algorithm is introduced, data augmentation needs to be performed on text data, for example, synonyms are used for replacing, randomly inserting, randomly exchanging or randomly deleting partial results in an original text to obtain independent augmented data, the vectors processed by a feature extractor are enabled to be mutually aggregated by feature vectors corresponding to homologous texts and different source texts to be mutually far away by introducing the contrast learning algorithm, and finally the obtained feature extraction model can capture feature information of each text to the maximum extent.

Through the cooperation of the augmentation module and the augmentation training submodule, the feature extraction model obtained through training can capture feature information of each text to the maximum extent.

Still further, the augmented training submodule comprises a computing unit;

the computing unit is used for computing a loss function of the feature extractor by using a noise contrast estimation function and an inner product function in the contrast learning algorithm, and performing feature extraction on the text data by using a feature extraction model combined with the contrast learning algorithm, so that the homologous first feature vectors are gathered in an embedding space, and the different homologous first feature vectors are far away in the embedding space. Judging whether the same source belongs to the same text data or not; the method includes the steps that a plurality of different augmentation data are obtained through augmentation of the same text data, and the plurality of different augmentation data belong to the same source.

By combining noise contrast estimation functions L_NCEAnd inner product function S_SimCLRCalculating the loss L of the comparison sample_conI.e. by

representing the similarity between the ith data augmented by the first data and the ith data augmented by the second data;

denotes v⁽²⁾And

a data set consisting of two data sets;

denotes v⁽¹⁾And

a data set consisting of two data sets; i, j ═ 1,2, … …, N.

The calculation unit is used as a loss function of the feature extractor by adopting a form of combining a noise contrast estimation function and an inner product function, so that the feature extraction effect of a feature extraction model obtained by final training is better, first feature vectors corresponding to the homologous augmented data are mutually gathered, and first feature vectors corresponding to the different homologous augmented data are mutually far away.

The feature extraction model training module 102 is configured to train a feature extractor by using a contrast learning algorithm based on the text training data set to obtain a feature extraction model;

specifically, the feature extraction model training module 102 performs feature extraction on each text in the text training data set through a feature extractor such as TextCNN or encoder in Transform, to obtain a feature vector corresponding to the text. By comparing learning algorithms, the feature extractor can extract deeper or better feature vectors of the text, and the feature vectors can better represent the text. And taking the comparison learning algorithm as a loss function of the feature extractor, and continuously training according to the text training data set until the loss function is converged to finally obtain the feature extraction model. And calculating a loss function for each round of training of the feature extractor, updating parameters of the feature extractor based on the extracted vector and the loss function, performing multiple rounds of training until the loss function is converged, and updating to obtain final parameters of the feature extractor, thereby obtaining a feature extraction model.

The equalization processing module 103 is configured to perform equalization processing on the text training data set by using an equalization algorithm to obtain an equalization data set;

specifically, the equalization processing module 103 performs equalization processing on the text training data set by using an equalization algorithm, so as to realize equalization of samples of each category, thereby enabling the classifier to have a better overall classification effect. The equalization algorithm can be a few-class over-sampling algorithm, resampling, data synthesis or re-weighting algorithm, and the like, and finally an equalization data set is obtained.

Further, the equalization processing module 103 includes a data amount obtaining sub-module, an average calculating sub-module, a data amount comparing sub-module, a difference calculating sub-module and a data amplifying sub-module;

the data volume acquisition submodule is used for acquiring each category in the text training data set and the corresponding text data volume;

the average calculation submodule is used for carrying out average calculation on the text data volume corresponding to each category to obtain average data volume;

the data quantity comparison submodule is used for comparing the text data quantity of each category with the average data quantity;

the difference value calculation submodule is used for taking the text data corresponding to the category as the text data to be augmented if the difference value is smaller than the average data volume, and calculating the difference value between the text data volume corresponding to the category and the average data volume to obtain the augmentation quantity;

and the data augmentation submodule is used for augmenting the text data to be augmented by adopting a synthesis minority oversampling algorithm or a data augmentation tool based on the augmentation quantity.

Specifically, the text training data set is provided with labels, so that the data in the text training data set can be divided into a plurality of categories according to the labels, and the data volume acquisition submodule acquires the text data volume contained in each category; the average calculation submodule calculates the average value of the text data volume of each category to obtain the average data volume, and the data volume comparison submodule and the difference value calculation submodule determine the text data to be augmented and the augmentation quantity based on the average data volume; and the data augmentation submodule augments the text data to be augmented based on the augmentation quantity so as to achieve average data volume and realize balanced processing on the text training data set.

The data augmentation submodule augments the text data to be augmented by synthesizing a minority class oversampling algorithm, and the synthesizing the minority class oversampling algorithm is a nearest neighbor-based technology, and the Euclidean judges the distance between data points in the feature space.

And finally obtaining a balanced data set through the cooperation of the data quantity obtaining submodule, the average calculating submodule, the data quantity comparing submodule, the difference calculating submodule and the data amplifying submodule so as to facilitate the training of a subsequent model.

A feature extraction module 104, configured to perform feature extraction on each text in the balanced data set by using the feature extraction model to obtain a corresponding feature vector;

the feature extraction module 104 performs feature extraction on each text in the balanced data set by using a trained feature extraction model such as TextCNN, Transform, encoder, or the like to obtain a corresponding feature vector.

Further, the unbalanced text classification apparatus 100 further includes a word segmentation module and a vectorization module;

the word segmentation module is used for performing word segmentation processing on the text by utilizing the ending word segmentation to obtain a plurality of corresponding words;

and the vectorization module is used for vectorizing the words to obtain word vectors corresponding to the words.

Specifically, the word segmentation module directly utilizes the ending word segmentation toolkit in the application, and each input text can be segmented by importing the ending word segmentation toolkit, so that word segmentation processing of the text is realized.

And the vectorization module adopts Word2Vec to map each Word to a high-dimensional vector space to obtain a Word vector corresponding to each Word.

After the word segmentation is performed through the matching of the word segmentation module and the vectorization module, each character is converted into a corresponding vector, so that the subsequent further processing is facilitated.

Further, the feature extraction module 104 includes a convolution sub-module, an activation sub-module, and a pooling sub-module;

the convolution submodule is used for performing convolution on the word vector to obtain a convolution kernel;

the activation submodule is used for processing the convolution kernel by adopting an activation function to obtain a corresponding characteristic map;

and the pooling sub-module is used for performing pooling processing on the feature map to obtain the feature vector corresponding to the word vector.

Specifically, by adopting the TextCNN feature extractor, the convolution submodule performs convolution on the word vector to obtain a convolution kernel, the width of the convolution kernel is the same as the width of the word vector, the convolution kernel only moves in the height direction, and the minimum unit of each convolution kernel sliding is indicated as a word.

After convolution, the activation submodule processes the convolution kernel through an activation function to obtain a feature map, and the pooling submodule performs pooling operation on the feature map to form a new feature vector by splicing the feature map.

Through the matching of the convolution submodule, the activation submodule and the pooling submodule, the feature vectors of the characters are extracted through convolution, activation and pooling, so that the classifier is conveniently trained subsequently, and the extracted feature vectors can exactly represent texts.

A classification model training module 105, configured to train a classifier based on the feature vector to obtain a classification model;

specifically, the classification model training module 105 trains the classifier by using the feature vectors corresponding to the equalized data set to obtain the final classifier. In the embodiment of the present application, the classifier may be a Softmax classifier.

Further, the classification model training module comprises a small batch gradient descent training submodule;

and the small batch gradient descent training submodule is used for training the classifier by using a small batch gradient descent algorithm based on the feature vector so as to obtain the classification model.

The small-batch gradient descent training submodule trains the classifier in a small-batch gradient descent algorithm mode, so that the training process is faster, and a better classification model can be obtained through training.

The classification module 106 is configured to obtain text data to be processed, where the text data to be processed is processed by the feature extraction model to obtain a corresponding vector to be processed, and the vector to be processed is processed by the classification model to obtain a category corresponding to the text data to be processed.

Specifically, after obtaining the feature extraction model and the classifier, the classification module 106 may directly receive the text data to be processed input by the user or directly obtain the text data to be processed from the database for processing, directly input the text data to be processed into the feature extraction model to obtain a corresponding vector to be processed, then input the vector to be processed into the final classifier, that is, the classification model for processing, and output the class corresponding to the text data to be processed, that is, the class with the highest probability obtained by the classifier. Through the cooperation of the trained feature extraction model and the classification model, the accuracy of text classification is improved.

By adopting the device, the unbalanced text classification device 100 improves the accuracy of classifying the data to be processed by matching the acquisition module 101, the feature extraction model training module 102, the equalization processing module 103, the feature extraction module 104, the classification model training module 105 and the classification module 106.

In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 3, fig. 3 is a block diagram of a basic structure of a computer device according to the present embodiment.

The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It is noted that only computer device 4 having components 41-43 is shown, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.

The memory 41 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the memory 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 4. Of course, the memory 41 may also include both internal and external storage devices of the computer device 4. In this embodiment, the memory 41 is generally used for storing an operating system installed in the computer device 4 and various types of application software, such as computer readable instructions of the unbalanced text classification method. Further, the memory 41 may also be used to temporarily store various types of data that have been output or are to be output.

The processor 42 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute computer readable instructions stored in the memory 41 or process data, such as computer readable instructions for executing the unbalanced text classification method.

The network interface 43 may comprise a wireless network interface or a wired network interface, and the network interface 43 is generally used for establishing communication connection between the computer device 4 and other electronic devices.

In this embodiment, the steps of the unbalanced text classification method in the above embodiments are implemented when the processor executes the computer readable instructions stored in the memory, and by obtaining a text training data set, training the feature extractor by using a contrast learning algorithm based on the text training data set to obtain a feature extraction model, and by introducing the contrast learning algorithm, the features extracted by the feature extraction model can be better, and the corresponding words can be more represented; carrying out equalization processing on the text training data set by using an equalization algorithm to obtain an equalization data set, facilitating subsequent training of a classification model, and carrying out feature extraction on each text in the equalization data set by using a feature extraction model to obtain a corresponding feature vector; training the classifier based on the feature vectors to obtain a classification model; the method comprises the steps of training a feature extractor and a classifier respectively to obtain a better feature extraction model and a classification model, finally obtaining text data to be processed, processing the text data to be processed by the feature extraction model to obtain a corresponding vector to be processed, processing the vector to be processed by the classification model to obtain a category corresponding to the text data to be processed, and therefore training based on unbalanced data to obtain the feature extraction model and the classification model can be better achieved, and accurate classification of the data to be processed is achieved by the feature extraction model and the classification model.

The present application further provides another embodiment, which is to provide a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions, where the computer-readable instructions are executable by at least one processor, so that the at least one processor performs the steps of the above-mentioned unbalanced text classification method, by obtaining a text training data set, training a feature extractor by using a contrast learning algorithm based on the text training data set to obtain a feature extraction model, and by introducing the contrast learning algorithm, the features extracted by the feature extraction model can be better, and can better represent corresponding words; carrying out equalization processing on the text training data set by using an equalization algorithm to obtain an equalization data set, facilitating subsequent training of a classification model, and carrying out feature extraction on each text in the equalization data set by using a feature extraction model to obtain a corresponding feature vector; training the classifier based on the feature vectors to obtain a classification model; the feature extractor and the classifier are trained respectively, so that better feature extraction model processing can be obtained, corresponding vectors to be processed are obtained, the vectors to be processed pass through the classification model, finally, text data to be processed are obtained, the text data to be processed pass through the feature extraction model and the classification model processing, and categories corresponding to the text data to be processed are obtained, therefore, the purpose that training is carried out based on unbalanced data to obtain the feature extraction model and the classification model can be better achieved, and accurate classification of the data to be processed is achieved by means of the feature extraction model and the classification model is achieved.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims

1. A method of unbalanced text classification, the method comprising:

acquiring a text training data set;

2. The unbalanced text classification method of claim 1, further comprising, before obtaining the text training data set:

3. The unbalanced text classification method of claim 2, wherein the clustering of the first feature vectors corresponding to the augmented data of the same source obtained by the feature extractor by incorporating a contrast learning algorithm into the feature extractor, wherein the mutually distant first feature vectors corresponding to the augmented data of different sources comprises:

4. The unbalanced text classification method of claim 1, wherein the equalizing the text training data set using an equalization algorithm comprises:

comparing the text data amount of each category with the average data amount;

5. The unbalanced text classification method of claim 1, wherein before the performing the feature extraction on each text in the balanced data set by using the feature extraction model, the method further comprises:

and vectorizing the words to obtain word vectors corresponding to the words.

6. The unbalanced text classification method of claim 5, wherein the performing feature extraction on each text in the balanced data set by using the feature extraction model to obtain a corresponding feature vector comprises:

obtaining a convolution kernel by performing convolution on the word vector;

7. The unbalanced text classification method of claim 1, wherein the training a classifier based on the feature vectors to obtain a classification model comprises:

8. An unbalanced text classification apparatus, the apparatus comprising:

the acquisition module is used for acquiring a text training data set;

9. A computer device, characterized in that the computer device comprises:

at least one processor; and the number of the first and second groups,

the memory stores computer readable instructions which, when executed by the processor, implement the unbalanced text classification method of any of claims 1 to 7.

10. A computer-readable storage medium having computer-readable instructions stored thereon which, when executed by a processor, implement the unbalanced text classification method of any one of claims 1 to 7.