CN112732913B

CN112732913B - Method, device, equipment and storage medium for classifying unbalanced samples

Info

Publication number: CN112732913B
Application number: CN202011617671.4A
Authority: CN
Inventors: 陈昊
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2023-08-22
Anticipated expiration: 2040-12-30
Also published as: WO2022142010A1; CN112732913A

Abstract

The application discloses a method, a device, equipment and a storage medium for classifying unbalanced samples, which belong to the technical field of artificial intelligence. In addition, the application also relates to a blockchain technology, and the corpus to be classified can be stored in the blockchain. According to the application, the loss function of the classification model is inverted through the preset adjustment rule, and then the classification model with exclusivity is obtained through training, so that the accuracy of the unbalanced sample classification model is improved.

Description

Method, device, equipment and storage medium for classifying unbalanced samples

Technical Field

The application belongs to the technical field of artificial intelligence, and particularly relates to a method, a device, equipment and a storage medium for classifying unbalanced samples.

Background

In the field of machine learning, most models are trained based on normally distributed data sets, and the model design and the acquisition method applicable to biased data sets are very few. However, in some specific situations, more biased data sets are provided, for example, in the field of natural language classification, the proportion of offensive language to the whole data is very small, but in actual business situations, accurate classification of offensive language must be performed.

The design and training of classification models for biased data sets is always a difficulty in the academic world and industry, and a currently more common method is to make certain improvement on the loss function of model training, namely adding additional weights to some few training sample classes, which is equivalent to forcing the model to pay more attention to the training samples of the classes. Because the number of training samples of the categories is small, a more robust classification model can be obtained as a whole by setting corresponding weights, so that the classification model obtains a better performance in classification tasks. However, in practical application, sample classification is various, so that sample weight value calculation is extremely complex, and it is difficult to balance weights among samples of various types. And aiming at training samples with particularly similar characteristics, the obtained classification model trained by the weight setting mode is difficult to classify the training samples, namely the classification of the biased data set is realized by the weight setting mode, and the classification precision in many scenes cannot meet the requirement.

Disclosure of Invention

The embodiment of the application aims to provide a method, a device, computer equipment and a storage medium for classifying unbalanced samples, which are used for solving the technical problems that the classification process is complex and the classification precision cannot meet the requirement in the existing classification mode of a biased data set.

In order to solve the above technical problems, an embodiment of the present application provides a method for classifying unbalanced samples, which adopts the following technical scheme:

a method of classifying an unbalanced sample, comprising:

acquiring training corpus from a preset corpus, wherein the training corpus comprises a first training corpus and a second training corpus, the first training corpus is a minority class training corpus, and the second training corpus is a majority class training corpus;

training a preset first classification model through a first training corpus to obtain an initial first classification model;

adjusting the loss function of the initial first classification model based on a preset adjustment rule, importing the second training corpus into the initial first classification model with the loss function adjusted, and carrying out iterative updating on the initial first classification model based on a back propagation algorithm to obtain a first classification model;

receiving a corpus classifying instruction, obtaining a corpus to be classified corresponding to the corpus classifying instruction, and classifying the corpus to be classified through a first classifying model.

Further, the language type of the training corpus is a first language, and after the step of obtaining the training corpus from the preset corpus, the method further includes:

calculating the corpus similarity of the first training corpus and the second training corpus, and comparing the corpus similarity with a preset similarity threshold value;

if the corpus similarity is greater than or equal to a preset similarity threshold, translating the training corpus into a second language, wherein the corpus similarity of the first training corpus and the second training corpus is smaller than the preset similarity threshold in the language environment of the second language.

Further, the preset first classification model includes an encoding layer and a decoding layer, and the step of training the preset first classification model through a first training corpus to obtain an initial first classification model specifically includes:

extracting corpus characteristics of the first training corpus through the coding layer, and carrying out vector coding on the corpus characteristics to obtain feature vectors;

performing feature mapping on the feature vector and feature labels pre-stored in a decoding layer to obtain a feature mapping result;

and iterating a preset first classification model based on the feature mapping result to obtain an initial first classification model.

Further, iterating the preset first classification model based on the feature mapping result to obtain an initial first classification model, which specifically includes:

Constructing a loss function of an initial first classification model to obtain a first loss function, wherein the first loss function comprises an countermeasure factor;

calculating the error between the feature mapping result and a preset mapping result based on the first loss function to obtain a mapping error;

and iterating a preset first classification model based on the mapping error and a back propagation algorithm to obtain an initial first classification model.

Further, the method comprises the steps of adjusting the loss function of the initial first classification model based on a preset adjustment rule, importing the second training corpus into the initial first classification model with the loss function adjusted, and iteratively updating the initial first classification model based on a back propagation algorithm to obtain a first classification model, wherein the method specifically comprises the following steps:

inverting the loss function of the initial first classification model based on a preset adjustment rule to obtain a second loss function;

the second training corpus is imported into an initial first classification model with the loss function reversed, and a classification result is obtained;

calculating the error between the classification result and a preset classification result based on the second loss function to obtain a classification error;

comparing the classification error with a preset classification error threshold, and if the classification error is smaller than or equal to the preset classification error threshold, iteratively updating the initial first classification model after the loss function is inverted through a back propagation algorithm until the classification error is larger than the preset classification error threshold;

And outputting a first classification model with the classification error larger than a preset classification error threshold value.

Further, after the step of adjusting the loss function of the initial first classification model based on the preset adjustment rule, introducing the second training corpus into the initial first classification model with the loss function adjusted, and iteratively updating the initial first classification model based on the back propagation algorithm to obtain the first classification model, the method further comprises:

vectorizing the first training corpus to obtain a corpus vector of the first training corpus;

vector splicing is carried out on the language material vector and the feature vector, and a language material feature matrix is obtained;

importing the corpus feature matrix into a preset second classification model, and performing convolution operation on the corpus feature matrix through convolution check of the second classification model to obtain a convolution operation result;

and carrying out iterative updating on the second classification model based on the convolution operation result, and outputting the trained second classification model.

Further, the step of iteratively updating the second classification model based on the convolution operation result and outputting the trained second classification model specifically includes:

constructing a loss function of a second classification model based on the cross entropy loss function and the Lewenstein distance function to obtain an initial third loss function, wherein the initial third loss function comprises a cross entropy factor and a Lei Wen Sitan factor;

Assigning the same initial weight value to the cross entropy factor and the Lai Wen Sitan factor of the initial third loss function respectively;

adjusting the initial weight values of the cross entropy factor and the Lai Wen Sitan factor based on the convolution operation result until the output of the initial third loss function reaches the minimum value, so as to obtain the third loss function;

and calculating the operation error of the convolution operation result through the third loss function, and carrying out iterative updating on the second classification model by adopting a back propagation algorithm based on the operation error to obtain a trained second classification model.

In order to solve the above technical problems, the embodiment of the present application further provides a device for classifying unbalanced samples, which adopts the following technical scheme:

a sorting apparatus of unbalanced samples, comprising:

the corpus acquisition module is used for acquiring training corpuses from a preset corpus, wherein the training corpuses comprise first training corpuses and second training corpuses, the first training corpuses are minority class training corpuses, and the second training corpuses are majority class training corpuses;

the first model training module is used for training a preset first classification model through a first training corpus to obtain an initial first classification model;

The inversion training module is used for adjusting the loss function of the initial first classification model based on a preset adjustment rule, importing the second training corpus into the initial first classification model with the loss function adjusted, and carrying out iterative updating on the initial first classification model based on a back propagation algorithm to obtain a first classification model;

the corpus classifying module is used for receiving the corpus classifying instruction, acquiring the corpus to be classified corresponding to the corpus classifying instruction, and classifying the corpus to be classified through the first classifying model.

In order to solve the above technical problems, the embodiment of the present application further provides a computer device, which adopts the following technical schemes:

an apparatus comprising a memory having stored therein computer readable instructions which when executed by the processor perform the steps of a method for classifying non-uniform samples as described above.

In order to solve the above technical problems, an embodiment of the present application further provides a computer readable storage medium, which adopts the following technical schemes:

a computer readable storage medium having computer readable instructions stored thereon which when executed by a processor perform the steps of a method of classifying non-equalized samples as described above.

Compared with the prior art, the embodiment of the application has the following main beneficial effects:

the application discloses a method, a device, equipment and a storage medium for classifying unbalanced samples, which belong to the technical field of artificial intelligence. According to the application, the loss function of the classification model is inverted through the preset adjustment rule, and then the classification model with exclusivity is obtained through training, when the classification task is carried out, the classification model can only respond and output the data types corresponding to the minority training corpus, but can not respond to the data types corresponding to the majority training corpus, and the accuracy of the unbalanced sample classification model is improved.

Drawings

In order to more clearly illustrate the solution of the present application, a brief description will be given below of the drawings required for the description of the embodiments of the present application, it being apparent that the drawings in the following description are some embodiments of the present application, and that other drawings may be obtained from these drawings without the exercise of inventive effort for a person of ordinary skill in the art.

FIG. 1 illustrates an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 illustrates a flow chart of one embodiment of a method of classifying unbalanced samples according to the present application;

FIG. 3 shows a schematic structural view of an embodiment of a sorting apparatus for unbalanced samples according to the present application;

fig. 4 shows a schematic structural diagram of an embodiment of a computer device according to the application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the applications herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having" and any variations thereof in the description of the application and the claims and the description of the drawings above are intended to cover a non-exclusive inclusion. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

In order to make the person skilled in the art better understand the solution of the present application, the technical solution of the embodiment of the present application will be clearly and completely described below with reference to the accompanying drawings.

As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as a web browser application, a shopping class application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc., may be installed on the terminal devices 101, 102, 103.

The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablet computers, electronic book readers, MP3 players (Moving Picture ExpertsGroup Audio Layer III, dynamic video expert compression standard audio plane 3), MP4 (Moving PictureExperts Group Audio Layer IV, dynamic video expert compression standard audio plane 4) players, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.

It should be noted that, the method for classifying the unbalanced samples provided in the embodiments of the present application is generally executed by a server, and accordingly, the device for classifying the unbalanced samples is generally disposed in the server.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to fig. 2, a flow chart of one embodiment of a method of classifying unbalanced samples according to the present application is shown. The method for classifying the unbalanced samples comprises the following steps:

S201, acquiring training corpus from a preset corpus, wherein the training corpus comprises a first training corpus and a second training corpus, the first training corpus is a minority class training corpus, and the second training corpus is a majority class training corpus.

Specifically, a training corpus is obtained from a preset corpus, wherein the training corpus comprises a first training corpus and a second training corpus, the first training corpus is a minority class training corpus, and the second training corpus is a majority class training corpus. In a specific embodiment of the application, the number of the training corpora in most classes can be more than 10 ten thousand, and the number of the training corpora in few classes can be only 10. The training corpus can be any language, and in a specific embodiment of the present application, the training corpus is a Chinese corpus.

S202, training a preset first classification model through a first training corpus to obtain an initial first classification model.

Aiming at the classification scene of the training sample with serious unbalance of the sample, the classification accuracy of the classification model trained by a mode of randomly combining most training corpus and minority training corpus can not meet the requirement, and the accurate classification can not be realized. According to the application, an exclusive classification model is obtained through training, and when a classification task is carried out, the classification model can only respond and output corresponding data types of a few classes of training corpus, but cannot respond to corresponding data types of a plurality of classes of training corpus, so that the accuracy of the unbalanced sample classification model is improved.

Specifically, the preset first classification model may be a classical transducer model, and the transducer model is trained by a few training corpora to obtain an initial first classification model. The transducer model structure includes an encoder (encoding layer) and a decoder (decoding layer). Extracting corpus characteristics of minority training corpus through an encoding layer, carrying out vector encoding on the corpus characteristics to obtain feature vectors, carrying out feature mapping on the feature vectors and feature labels pre-stored in a decoding layer to obtain feature mapping results, iterating a preset first classification model based on the feature mapping results to obtain an initial first classification model, and carrying out corresponding response and output aiming at the minority training corpus corresponding data types.

S203, adjusting the loss function of the initial first classification model based on a preset adjustment rule, importing the second training corpus into the initial first classification model with the loss function adjusted, and iteratively updating the initial first classification model based on a back propagation algorithm to obtain the first classification model.

Wherein, the adjustment of the loss function of the initial first classification model based on the preset adjustment rule refers to inverting the loss function of the initial first classification model, for example, the loss function of the initial first classification model, i.e. the first loss function is L ₁ In a specific embodiment of the application, the loss function L of the initial first classification model after inversion ₂ Is 1-L ₁ 。

Specifically, after the initial first classification model is trained, the loss function of the initial first classification model is inverted based on a preset adjustment rule, a plurality of types of training corpuses are imported into the initial first classification model with the inverted loss function, the initial first classification model with the inverted loss function is iterated reversely based on a back propagation algorithm, the first classification model is obtained, namely, the inverted loss function is utilized to encourage the initial first classification model to learn towards the direction of excluding the plurality of types of training corpuses, so that a first classification model with exclusivity is obtained, the first classification model cannot respond to the data types corresponding to the plurality of types of training corpuses, and the accuracy of the unbalanced sample classification model is improved through the data types corresponding to the minority of training corpuses and the data types corresponding to the plurality of types of training corpuses of the first classification model.

S204, receiving a corpus classifying instruction, obtaining the corpus to be classified corresponding to the corpus classifying instruction, and classifying the corpus to be classified through a first classifying model.

Specifically, when a corpus classifying instruction of a user is received, the corpus to be classified corresponding to the corpus classifying instruction is obtained, and the corpus to be classified is classified through a trained first classifying model. If the first classification model does not output, the input corpus is considered to belong to the data types corresponding to the majority of the class data, if the first classification model outputs, the input corpus is considered to belong to the data types corresponding to the minority of the class data, and the minority of the class training corpus corresponds to the data types and the majority of the class training corpus corresponds to the data types through the first classification model.

In this embodiment, the electronic device (e.g., the server/terminal device shown in fig. 1) on which the method for classifying the unbalanced sample operates may receive the corpus classification instruction through a wired connection manner or a wireless connection manner. It should be noted that the wireless connection may include, but is not limited to, 3G/4G connections, wiFi connections, bluetooth connections, wiMAX connections, zigbee connections, UWB (ultra wideband) connections, and other now known or later developed wireless connection means.

The application discloses a classification method of an unbalanced sample, which belongs to the technical field of artificial intelligence, and comprises the steps of obtaining training corpus, wherein the training corpus comprises a first training corpus and a second training corpus, the first training corpus is a minority training corpus, the second training corpus is a majority training corpus, an initial first classification model is obtained through training of the minority training corpus, a loss function of the initial first classification model is inverted through a preset adjustment rule, then the majority training corpus is imported into the initial first classification model with the inverted loss function, iteration is conducted on the initial first classification model to obtain a first classification model, the corpus to be classified is obtained, and the corpus to be classified is classified through the first classification model. According to the application, the loss function of the classification model is inverted through the preset adjustment rule, and then the classification model with exclusivity is obtained through training, when the classification task is carried out, the classification model can only respond and output the data types corresponding to the minority training corpus, but can not respond to the data types corresponding to the majority training corpus, and the accuracy of the unbalanced sample classification model is improved.

In a specific embodiment of the present application, the training data set may be a data set with serious unbalance of samples, for example, 10 ten thousand corpora are in class a, while at the same time, only several live tens of corpora are in class B, and some corpora may exist in class a and be very similar to those in class B in grammatical structure. For example, the number of the cells to be processed,

corpus one: "today weather is very hot" refers to the corpus type.

Corpus II: "does today weather is very hot", which is a type of query corpus.

In view of the above, when the conventional classification model is trained, the classification model may have difficulty in obtaining the capability of distinguishing the two corpus due to the higher similarity of the two corpus, so that the model accuracy is insufficient, and in the present application, the above problem can be solved by corpus translation.

Firstly, the Chinese language material is translated into language capable of enlarging the language material characteristic distinction, and in the specific embodiment of the application, the Chinese language material is translated into the non-German language material.

For example, the translation of the two corpus into german is "Es ist heute sehr hei" and "Wie hei es heute ist", respectively, and it can be seen here that translating the corpus into german effectively increases the distinction between the two while ensuring the original information.

Then, the classification problem is changed into a similarity measurement problem, namely, the original traditional classification problem is changed into a problem of measuring whether the input corpus is similar to a certain type of corpus after being processed, and the method is illustrated as follows:

corpus II: "is today very hot? "

Corpus three: "do you get happy today? "

The second corpus and the third corpus are unclassified corpora, but the classified corpora are four corpora: "is today cold? It is obvious that the second corpus is closer to the fourth corpus than the third corpus, so that the second corpus belongs to the data set of the fourth corpus, and the third corpus does not belong to the data set of the fourth corpus.

In a specific embodiment of the present application, a language translation mapping table is preset in the server, where a translation mapping relationship between a first language and a second language, for example, "chinese-german", etc., is recorded in the language translation mapping table, where the translation mapping relationship between the first language and the second language may be determined according to a language structure difference between the languages, and the first language and the second language having the translation mapping relationship have a larger difference in language structure. In a specific embodiment of the present application, after identifying a first language of the training corpus, the server may query a language translation mapping table for a second language corresponding to the first language, and then translate the obtained first language into the queried second language. For example, the server identifies the first language of the corpus as chinese, queries the second language corresponding to the "chinese" language in the language translation mapping table as "german", and then translates the chinese corpus into german.

In the above embodiment, the feature difference between the most-class training corpus and the minority-class training corpus in the training corpus is enlarged by performing the translation processing on the training corpus, so that the trained first classification model only obtains the data response to the corresponding type of the minority-class training corpus. In a specific embodiment of the application, since the input corpus is Chinese, languages with relatively large differences in grammar structure and the like relative to Chinese are needed to be selected for translation.

Specifically, the first preset classification model comprises an encoding layer and a decoding layer, the core thought of the first classification model is to encode the vector of the input corpus by using the encoding layer, the vector is changed into a feature vector, and the decoding layer is used for constructing the feature vectors to perform feature mapping with feature labels in the decoding layer.

In the above embodiment, if training is performed using only a few types of corpus, the encoding layer and decoding layer of the transducer model only learn the feature vector extraction method and mapping applicable to these few types, and when the input corpus does not belong to the few types, the encoding layer cannot acquire the feature vectors and cannot construct the condition of reasonable mapping.

The method comprises the steps of constructing a loss function of an initial first classification model to obtain a first loss function, wherein the first loss function comprises an countermeasure factor, and the purpose of adding the countermeasure factor is to facilitate that when the initial first classification model is subjected to reverse iteration updating later, the mapping error is larger than a preset error threshold, so that the initial first classification model learns towards the direction of eliminating a plurality of types of training corpuses, and a first classification model with exclusivity is obtained, and cannot respond to the data types corresponding to the plurality of types of training corpuses.

The first loss function expression is as follows:

L ₁ ＝L _ori +λD

wherein L is herein _ori Represented is a standard transducer loss function, λ is a constant, D is a classical contrast loss function, and in the present application the contrast loss function D is constructed according to the following formula:

wherein D is a determiner in the loss-of-challenge network, G is a generator in the loss-of-challenge network, in the present application, G may be the above-mentioned transducer model, x and z are input training data, x-p _data (x)、z～p _noise (z) means that the training data x and z obeys a certain distribution, E means taking the mean of the loss-resistant network output.

Specifically, calculating errors of a feature mapping result and a preset mapping result based on the constructed first loss function, obtaining a mapping error, comparing the mapping error with a preset mapping error threshold, comparing the mapping error with the preset mapping error threshold, if the mapping error is larger than the preset mapping error threshold, iteratively updating an initial first mapping error model through a back propagation algorithm until the mapping error is smaller than or equal to the preset mapping error threshold, and outputting an initial first classification model with the mapping error larger than the preset mapping error threshold.

Wherein, the adjustment of the loss function of the initial first classification model based on the preset adjustment rule means inverting the loss function of the initial first classification model, for example, the first loss function is L ₁ In a specific embodiment of the present application, the loss function of the initial first classification model after inversion is L ₂ ＝1-L ₁ 。

Specifically, inverting the loss function of the initial first classification model based on a preset adjustment rule to obtain a second loss function;

and importing the second training corpus into the initial first classification model after the loss function inversion, acquiring a classification result, calculating the error of the classification result and the preset classification result based on the second loss function to obtain a classification error, comparing the classification error with a preset classification error threshold, and if the classification error is smaller than or equal to the preset classification error threshold, iteratively updating the initial first classification model after the loss function inversion through a back propagation algorithm until the classification error is larger than the preset classification error threshold, and outputting the first classification model with the classification error larger than the preset classification error threshold.

In the above embodiment, the loss function of the initial first classification model is inverted, so that the initial first classification model learns towards the direction of excluding the majority training corpus, and a first classification model with exclusivity is obtained, the first classification model cannot respond to the data types corresponding to the majority training corpus, and the accuracy of the unbalanced sample classification model is improved through the data types corresponding to the minority training corpus and the data types corresponding to the majority training corpus of the first classification model.

According to the high-dimensional manifold classification principle, if a few kinds of training corpus which cannot be linearly classified under the low-dimensional condition can be linearly classified under the high-dimensional condition, the method is not only used for explaining, but also samples under the high-dimensional condition can be classified more easily. Since the number of the minority training corpuses is usually small, it is difficult to form a sufficient data set for training a complete classifier, but under the high-dimensional condition, the samples are used as initial points of the clustering centers, so that the clustering centers of the minority training corpuses should be easier to find. In the above scheme, a simple classification model is trained on an incomplete data set through dimension lifting operation, and the clustering center point is the model self-adaptive behavior under the guidance of the model loss function.

Specifically, after the first classification model is obtained, the minority training corpus can be further classified by a second classification model, and the second classification model can be a CNN deep neural network model. When the second classification model is trained, the input vector comprises the corpus vector corresponding to the minority training corpus and the feature vector output by the coding layer of the transducer model, and the reason for using the feature vector is that the feature vector is utilized to obtain the feature with higher dimension, and the high dimension feature is beneficial to the classification of the minority training corpus.

In the above embodiment, the first corpus is vectorized to obtain a corpus vector of the first corpus, the corpus vector and the feature vector are vector-spliced to obtain a corpus feature matrix, the corpus feature matrix is imported into a preset second classification model, the corpus feature matrix is subjected to convolution operation by a convolution check of the second classification model to obtain a convolution operation result, the second classification model is iteratively updated based on the convolution operation result, a trained second classification model is output, and the second classification model can classify minority class of corpus.

Specifically, the form of the initial third loss function is as follows:

L ₃ ＝αL _cls +βL _lev

where α and β are weight coefficients, in a specific embodiment of the present application, the sum of α and β is 1, and the initial values of α and β may be set to 0.5 and 0.5.L (L) _cls Is a standard cross entropy loss function, L _lve Is a standard lycenstant distance function, where the cross entropy loss function is used to guide classification tasks, and the lycenstant distance function is used to ensure clustering and form cluster centers. In the training process of the second classification model, the weight coefficients alpha and beta can be continuously adjusted according to the convolution operation result until the output of the initial third loss function reaches the minimum value, the initial third loss function when the output reaches the minimum value is determined to be the loss function of the second classification model, namely the third loss function, the operation error of the convolution operation result is calculated through the third loss function, and the second classification model is iteratively updated by adopting a back propagation algorithm based on the operation error, so that the trained second classification model is obtained.

In the practical application process, the sample classification is various, so that the calculation of the sample weight value is extremely complex, and the weight among various samples is difficult to balance. The obtained classification model trained by the method of setting the weights is difficult to classify the training samples with particularly similar characteristics, namely, the classification accuracy of the biased data set is not required in many scenes by the method of setting the weights, so that the existing classification of the unbalanced samples weighted by the samples of different types has certain defects.

Aiming at the technical problems, the application discloses a method, a device, equipment and a storage medium for classifying unbalanced samples, which belong to the technical field of artificial intelligence. According to the application, the loss function of the classification model is inverted through the preset adjustment rule, and then the classification model with exclusivity is obtained through training, when the classification task is carried out, the classification model can only respond and output the data types corresponding to the minority training corpus, but can not respond to the data types corresponding to the majority training corpus, and the accuracy of the unbalanced sample classification model is improved.

It should be emphasized that, to further ensure the privacy and security of the corpus to be classified, the corpus to be classified may also be stored in a node of a blockchain.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.

Those skilled in the art will appreciate that implementing all or part of the processes of the methods of the embodiments described above may be accomplished by way of computer readable instructions, stored on a computer readable storage medium, which when executed may comprise processes of embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

With further reference to fig. 3, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a device for classifying unbalanced samples, where the embodiment of the device corresponds to the embodiment of the method shown in fig. 2, and the device is particularly applicable to various electronic devices.

As shown in fig. 3, the apparatus for classifying unbalanced samples according to the present embodiment includes:

the corpus acquisition module 301 is configured to acquire a training corpus from a preset corpus, where the training corpus includes a first training corpus and a second training corpus, the first training corpus is a minority class training corpus, and the second training corpus is a majority class training corpus;

The first model training module 302 is configured to train a preset first classification model through a first training corpus to obtain an initial first classification model;

the inversion training module 303 is configured to adjust a loss function of the initial first classification model based on a preset adjustment rule, import a second training corpus into the initial first classification model with the loss function adjusted, and iteratively update the initial first classification model based on a back propagation algorithm to obtain a first classification model;

the corpus classifying module 304 is configured to receive a corpus classifying instruction, obtain a corpus to be classified corresponding to the corpus classifying instruction, and classify the corpus to be classified through a first classification model.

Further, the language type of the training corpus is a first language, and the device for classifying the unbalanced sample further includes:

the corpus similarity calculation module is used for calculating the corpus similarity of the first training corpus and the second training corpus and comparing the corpus similarity with a preset similarity threshold value;

and the corpus translation module is used for translating the training corpus into a second language when the corpus similarity is greater than or equal to a preset similarity threshold, wherein the corpus similarity of the first training corpus and the second training corpus is smaller than the preset similarity threshold in the language environment of the second language.

Further, the preset first classification model includes an encoding layer and a decoding layer, and the first model training module 302 specifically includes:

the feature extraction unit is used for extracting corpus features of the first training corpus through the coding layer and carrying out vector coding on the corpus features to obtain feature vectors;

the feature mapping unit is used for performing feature mapping on the feature vector and the feature label pre-stored in the decoding layer to obtain a feature mapping result;

the first iteration unit is used for iterating a preset first classification model based on the feature mapping result to obtain an initial first classification model.

Further, the first iteration unit specifically includes:

a first loss function construction subunit, configured to construct a loss function of the initial first classification model, to obtain a first loss function, where the first loss function includes an countermeasure factor;

the mapping error calculation subunit is used for calculating errors of the characteristic mapping result and a preset mapping result based on the first loss function to obtain a mapping error;

and the first iteration subunit is used for iterating the preset first classification model based on the mapping error and the back propagation algorithm to obtain an initial first classification model.

Further, the inversion training module 303 specifically includes:

the function inversion unit is used for inverting the loss function of the initial first classification model based on a preset adjustment rule to obtain a second loss function;

the corpus classifying unit is used for importing the second training corpus into an initial first classifying model with the loss function inverted to obtain a classifying result;

the classification error calculation unit is used for calculating the error between the classification result and the preset classification result based on the second loss function to obtain a classification error;

the second iteration unit is used for comparing the classification error with a preset classification error threshold value, and if the classification error is smaller than or equal to the preset classification error threshold value, the initial first classification model with the inverted loss function is subjected to iterative updating through a back propagation algorithm until the classification error is larger than the preset classification error threshold value;

the model output unit is used for outputting a first classification model with the classification error larger than a preset classification error threshold value.

Further, the apparatus for classifying unbalanced samples further includes:

the vectorization module is used for vectorizing the first training corpus to obtain a corpus vector of the first training corpus;

the vector splicing module is used for carrying out vector splicing on the language material vector and the feature vector to obtain a language material feature matrix;

The convolution operation module is used for guiding the corpus feature matrix into a preset second classification model, and carrying out convolution operation on the corpus feature matrix through convolution check of the second classification model to obtain a convolution operation result;

and the model iteration module is used for carrying out iteration update on the second classification model based on the convolution operation result and outputting the trained second classification model.

Further, the model iteration module specifically includes:

the third loss function construction unit is used for constructing a loss function of the second classification model based on the cross entropy loss function and the Lewenstein distance function to obtain an initial third loss function, wherein the initial third loss function comprises a cross entropy factor and a Legen Wen Sitan factor;

the weight assignment unit is used for respectively assigning the same initial weight value to the cross entropy factor and the Lai Wen Sitan factor of the initial third loss function;

the third loss function optimizing unit is used for adjusting the initial weight values of the cross entropy factor and the Lai Wen Sitan factor based on the convolution operation result until the output of the initial third loss function reaches the minimum value, so as to obtain the third loss function;

and the third iteration unit is used for calculating the operation error of the convolution operation result through a third loss function, and carrying out iterative updating on the second classification model by adopting a back propagation algorithm based on the operation error to obtain a trained second classification model.

The application discloses a classification device of an unbalanced sample, which belongs to the technical field of artificial intelligence, and comprises a training corpus, wherein the training corpus comprises a first training corpus and a second training corpus, the first training corpus is a minority training corpus, the second training corpus is a majority training corpus, an initial first classification model is obtained through training of the minority training corpus, a loss function of the initial first classification model is inverted through a preset adjustment rule, then the majority training corpus is imported into the initial first classification model with the inverted loss function, the initial first classification model is iterated to obtain a first classification model, the corpus to be classified is obtained, and the corpus to be classified is classified through the first classification model. According to the application, the loss function of the classification model is inverted through the preset adjustment rule, and then the classification model with exclusivity is obtained through training, when the classification task is carried out, the classification model can only respond and output the data types corresponding to the minority training corpus, but can not respond to the data types corresponding to the majority training corpus, and the accuracy of the unbalanced sample classification model is improved.

In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 4, fig. 4 is a basic structural block diagram of a computer device according to the present embodiment.

The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It should be noted that only computer device 4 having components 41-43 is shown in the figures, but it should be understood that not all of the illustrated components are required to be implemented and that more or fewer components may be implemented instead. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculations and/or information processing in accordance with predetermined or stored instructions, the hardware of which includes, but is not limited to, microprocessors, application specific integrated circuits (Application Specific Integrated Circuit, ASICs), programmable gate arrays (fields-Programmable Gate Array, FPGAs), digital processors (Digital Signal Processor, DSPs), embedded devices, etc.

The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.

The memory 41 includes at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the storage 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the computer device 4. Of course, the memory 41 may also comprise both an internal memory unit of the computer device 4 and an external memory device. In this embodiment, the memory 41 is typically used for storing an operating system installed on the computer device 4 and various types of application software, such as computer readable instructions of a classification method of unbalanced samples. Further, the memory 41 may be used to temporarily store various types of data that have been output or are to be output.

The processor 42 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute computer readable instructions stored in the memory 41 or process data, such as computer readable instructions for executing a method for classifying the unbalanced sample.

The network interface 43 may comprise a wireless network interface or a wired network interface, which network interface 43 is typically used for establishing a communication connection between the computer device 4 and other electronic devices.

The present application also provides another embodiment, namely, a computer-readable storage medium storing computer-readable instructions executable by at least one processor to cause the at least one processor to perform the steps of a method for classifying unbalanced samples as described above.

The application discloses a classification storage medium of an unbalanced sample, which belongs to the technical field of artificial intelligence, and comprises a first training corpus and a second training corpus, wherein the first training corpus comprises a minority training corpus, the second training corpus comprises a majority training corpus, an initial first classification model is obtained through training of the minority training corpus, a loss function of the initial first classification model is inverted through a preset adjustment rule, then the majority training corpus is imported into the initial first classification model after the loss function is inverted, the initial first classification model is iterated to obtain a first classification model, the corpus to be classified is obtained, and the corpus to be classified is classified through the first classification model. According to the application, the loss function of the classification model is inverted through the preset adjustment rule, and then the classification model with exclusivity is obtained through training, when the classification task is carried out, the classification model can only respond and output the data types corresponding to the minority training corpus, but can not respond to the data types corresponding to the majority training corpus, and the accuracy of the unbalanced sample classification model is improved.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.

It is apparent that the above-described embodiments are only some embodiments of the present application, but not all embodiments, and the preferred embodiments of the present application are shown in the drawings, which do not limit the scope of the patent claims. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a thorough and complete understanding of the present disclosure. Although the application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing description, or equivalents may be substituted for elements thereof. All equivalent structures made by the content of the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the scope of the application.

Claims

1. A method for classifying an unbalanced sample, comprising:

obtaining training corpus from a preset corpus, wherein the training corpus comprises a first training corpus and a second training corpus, the first training corpus is a minority training corpus, and the second training corpus is a majority training corpus;

training a preset first classification model through the first training corpus to obtain an initial first classification model;

adjusting the loss function of the initial first classification model based on a preset adjustment rule, importing the second training corpus into the initial first classification model with the loss function adjusted, and iteratively updating the initial first classification model based on a back propagation algorithm to obtain a first classification model;

receiving a corpus classification instruction, obtaining a corpus to be classified corresponding to the corpus classification instruction, and classifying the corpus to be classified through the first classification model;

the language type of the training corpus is a first language, and after the step of obtaining the training corpus from the preset corpus, the method further comprises the following steps:

calculating the corpus similarity of the first training corpus and the second training corpus, and comparing the corpus similarity with a preset similarity threshold;

If the corpus similarity is greater than or equal to a preset similarity threshold, translating the training corpus into a second language, wherein the corpus similarity of the first training corpus and the second training corpus is smaller than the preset similarity threshold in a language environment of the second language;

the method specifically comprises the steps of adjusting a loss function of an initial first classification model based on a preset adjustment rule, importing the second training corpus into the initial first classification model with the loss function adjusted, and iteratively updating the initial first classification model based on a back propagation algorithm to obtain a first classification model, wherein the step specifically comprises the following steps:

the second training corpus is imported into the initial first classification model with the loss function reversed, and classification results are obtained;

calculating the error between the classification result and a preset classification result based on a second loss function to obtain a classification error;

Outputting a first classification model with a classification error greater than a preset classification error threshold;

inverting the loss function of the initial first classification model based on a preset adjustment rule to obtain a second loss function, wherein the method specifically comprises the following steps:

subtracting the loss function of the initial first classification model from the value 1 to obtain the second loss function.

2. The method for classifying unbalanced samples of claim 1, wherein the predetermined first classification model includes an encoding layer and a decoding layer, and the step of training the predetermined first classification model by the first training corpus to obtain an initial first classification model specifically includes:

performing feature mapping on the feature vector and feature labels pre-stored in the decoding layer to obtain a feature mapping result;

and iterating the preset first classification model based on the feature mapping result to obtain an initial first classification model.

3. The method for classifying an unbalanced sample of claim 2, wherein the step of iterating the preset first classification model based on the feature mapping result to obtain an initial first classification model specifically comprises:

Constructing a loss function of the initial first classification model to obtain a first loss function, wherein the first loss function comprises an countermeasure factor;

calculating the error between the characteristic mapping result and a preset mapping result based on a first loss function to obtain a mapping error;

and iterating the preset first classification model based on the mapping error and a back propagation algorithm to obtain an initial first classification model.

4. The method for classifying an unbalanced sample of claim 2, wherein after the step of adjusting the loss function of the initial first classification model based on a preset adjustment rule, introducing the second training corpus into the initial first classification model with the loss function adjusted, and iteratively updating the initial first classification model based on a back propagation algorithm to obtain a first classification model, the method further comprises:

vector stitching is carried out on the corpus vector and the feature vector, so that a corpus feature matrix is obtained;

5. The method for classifying unbalanced samples of claim 4 wherein the step of iteratively updating the second classification model based on the convolution operation result and outputting the trained second classification model comprises the steps of:

constructing a loss function of the second classification model based on a cross entropy loss function and a Lewentstan distance function to obtain an initial third loss function, wherein the initial third loss function comprises a cross entropy factor and a Lai Wen Sitan factor;

adjusting the initial weight values of the cross entropy factor and the Lewenstein factor based on the convolution operation result until the output of the initial third loss function reaches the minimum value, so as to obtain the third loss function;

and calculating the operation error of the convolution operation result through the third loss function, and carrying out iterative updating on the second classification model by adopting a back propagation algorithm based on the operation error to obtain the trained second classification model.

6. A classification apparatus for unbalanced samples, wherein the classification apparatus for unbalanced samples implements the classification method for unbalanced samples according to any one of claims 1 to 5, the classification apparatus for unbalanced samples comprising:

the first model training module is used for training a preset first classification model through the first training corpus to obtain an initial first classification model;

the inversion training module is used for adjusting the loss function of the initial first classification model based on a preset adjustment rule, importing the second training corpus into the initial first classification model with the loss function adjusted, and carrying out iterative update on the initial first classification model based on a back propagation algorithm to obtain a first classification model;

7. An apparatus comprising a memory having stored therein computer readable instructions which when executed by a processor implement the steps of the method of classifying unbalanced samples as claimed in any one of claims 1 to 5.

8. A computer readable storage medium having stored thereon computer readable instructions which when executed by a processor implement the steps of the method of classifying unbalanced samples according to any of the claims 1 to 5.