CN112070162A - Multi-class processing task training sample construction method, device and medium - Google Patents

Multi-class processing task training sample construction method, device and medium Download PDF

Info

Publication number
CN112070162A
CN112070162A CN202010936484.6A CN202010936484A CN112070162A CN 112070162 A CN112070162 A CN 112070162A CN 202010936484 A CN202010936484 A CN 202010936484A CN 112070162 A CN112070162 A CN 112070162A
Authority
CN
China
Prior art keywords
preset
category
sample data
loss
categories
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010936484.6A
Other languages
Chinese (zh)
Inventor
张超
吴海山
殷磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WeBank Co Ltd
Original Assignee
WeBank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WeBank Co Ltd filed Critical WeBank Co Ltd
Priority to CN202010936484.6A priority Critical patent/CN112070162A/en
Publication of CN112070162A publication Critical patent/CN112070162A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method, equipment and a medium for constructing multi-class processing task training samples, wherein the method comprises the following steps: acquiring real probability distribution of sample data on a plurality of preset categories, and determining the prediction probability distribution of the sample data; determining a loss list of the sample data on a plurality of preset categories according to the real probability distribution and the prediction probability distribution; determining a mask list according to the loss list and the real probability distribution, and determining a positive category loss and a plurality of negative category losses of the sample data according to the mask list and the loss list; and determining a positive category and a plurality of negative categories to which the sample data belongs in the plurality of preset categories according to the positive category and the plurality of negative categories, and respectively constructing positive and negative samples of the plurality of preset categories according to the positive category and the plurality of negative categories. According to the invention, the positive and negative samples in a plurality of preset categories are constructed according to the positive category loss and the negative category losses of a plurality of pieces of sample data, so that the balance of the sample data among the preset categories is facilitated.

Description

Multi-class processing task training sample construction method, device and medium
Technical Field
The invention relates to the technical field of financial technology (Fintech), in particular to a method, equipment and medium for constructing multi-class processing task training samples.
Background
With the continuous development of financial technology (Fintech), especially internet technology and finance, more and more technologies (such as artificial intelligence, big data, cloud storage and the like) are applied to the financial field, but the financial field also puts higher requirements on various technologies, and for example, sample data in the artificial intelligence is required to be more balanced.
At present, for a multi-class classification task, collected data of each class is directly used as sample data of the multi-class classification task, but the quantity of the data collected by different classes is usually difficult to balance, the quantity of the data collected by the classes related to privacy or cold is small, and the quantity of the data collected by the classes related to hot is large. Therefore, when a multi-class classification task is executed according to the unbalanced samples, the accuracy of the class with the small number of samples is low.
Therefore, how to construct a balanced sample in a multi-class classification task to ensure the accuracy of the multi-class classification task is a technical problem to be solved at present.
Disclosure of Invention
The invention mainly aims to provide a method, equipment and a medium for constructing multi-class processing task training samples, and aims to solve the technical problem of how to construct balanced samples in multi-class classification tasks in the prior art.
In order to achieve the above object, the present invention provides a method for constructing multi-class processing task training samples, comprising the steps of:
acquiring real probability distribution of sample data on a plurality of preset categories, and determining the predicted probability distribution of the sample data on the plurality of preset categories;
determining a loss list of the sample data on a plurality of preset categories according to the real probability distribution and the prediction probability distribution;
determining a mask list according to the loss list and the real probability distribution, and determining a positive category loss and a plurality of negative category losses of the sample data according to the mask list and the loss list;
and according to the positive category loss and the negative category losses, determining a positive category and a plurality of negative categories to which the sample data belongs in a plurality of preset categories, respectively constructing positive and negative samples of the plurality of preset categories according to the positive category and the plurality of negative categories, and generating a multi-category classification model based on the positive and negative samples of the plurality of preset categories to classify the categories.
Optionally, after the step of constructing positive and negative samples of a plurality of preset categories according to the positive category and the negative categories, respectively, the method further includes:
training a preset multi-class model based on positive and negative item samples of a plurality of preset classes to generate a multi-class classification model;
and when data to be classified is received, performing class classification on the data to be classified based on the multi-class classification model, and determining the class to which the data to be classified belongs.
Optionally, the step of determining a mask list according to the loss list and the true probability distribution includes:
sequencing all numerical values in the loss list to obtain a probability sequence, and selecting a target probability arranged at a front preset position from the probability sequence;
determining the arrangement position of each target probability in the loss list, and updating the loss list according to the arrangement position;
and performing summation operation on the updated loss list and the real probability distribution to generate a mask list.
Optionally, the step of determining a positive category loss and a plurality of negative category losses of the sample data according to the mask list and the loss list includes:
performing product operation on the mask list and the loss list before updating to generate a product result list;
determining the forward category loss from the product result list according to the position of the true probability in the true probability distribution;
and determining a plurality of negative category losses from the product result according to the arrangement position.
Optionally, the step of determining a loss list of the sample data in a plurality of preset categories according to the true probability distribution and the predicted probability distribution comprises:
respectively determining a real probability value and a prediction probability value of the sample data on each preset category according to the real probability distribution and the prediction probability distribution;
calculating the real probability value and the prediction probability value of the sample data on each preset category based on a preset loss formula to obtain a loss value of the sample data on each preset category;
and arranging each loss value according to the arrangement sequence of each preset category in the plurality of preset categories to obtain the loss list.
Optionally, the step of determining a predictive probability distribution of the sample data over a plurality of preset categories comprises:
and obtaining output values corresponding to the sample data and a plurality of preset categories respectively, and mapping the output values respectively according to a preset function to obtain the predicted probability distribution of the sample data on the preset categories.
Optionally, before the step of obtaining the true probability distribution of the sample data on the plurality of preset categories, the method further includes:
the method comprises the steps of obtaining sample data and a sample label corresponding to the sample data, coding the sample label based on a preset coding mode, and generating real probability distribution of the sample data on a plurality of preset categories.
Optionally, the step of determining, according to the positive category loss and the multiple negative category losses, a positive category and multiple negative categories to which the sample data belongs in multiple preset categories includes:
training an initial model according to the positive category loss and the negative category losses, and acquiring a gradient corresponding to the initial model;
and when the gradient is smaller than a preset threshold value, finishing the training of the initial model, and determining the positive category and the negative categories according to the mapping attributes of the sample data in a plurality of preset categories when the training of the initial model is finished.
Further, to achieve the above object, the present invention further provides a multi-class processing task training sample constructing device, including:
the system comprises an acquisition module, a prediction module and a processing module, wherein the acquisition module is used for acquiring the real probability distribution of sample data on a plurality of preset categories and determining the prediction probability distribution of the sample data on the plurality of preset categories;
a first determining module, configured to determine a loss list of the sample data in multiple preset categories according to the true probability distribution and the predicted probability distribution;
a second determining module, configured to determine a mask list according to the loss list and the true probability distribution, and determine a positive category loss and multiple negative category losses of the sample data according to the mask list and the loss list;
and the construction module is used for determining the positive category and the negative categories to which the sample data belongs in the plurality of preset categories according to the positive category loss and the plurality of negative category losses, respectively constructing positive and negative samples of the plurality of preset categories according to the positive category and the plurality of negative categories, and generating a multi-category classification model based on the positive and negative samples of the plurality of preset categories to classify the categories.
Further, in order to achieve the above object, the present invention further provides a multi-class processing task training sample construction device, where the multi-class processing task training sample construction device includes a memory, a processor, and a multi-class processing task training sample construction program stored in the memory and operable on the processor, and when executed by the processor, the multi-class processing task training sample construction device implements the steps of the multi-class processing task training sample construction method as described above.
Further, to achieve the above object, the present invention also provides a medium, in which a multi-class processing task training sample construction program is stored, and when being executed by a processor, the multi-class processing task training sample construction program implements the steps of the multi-class processing task training sample construction method as described above.
Compared with the prior art that the multi-class classification task is inaccurate in execution of the multi-class classification task due to the fact that sample data of each class is unbalanced, the multi-class processing task training sample construction method, the multi-class processing task training sample construction equipment and the multi-class processing task training medium firstly acquire the real probability distribution of the sample data on a plurality of preset classes and determine the prediction probability distribution of the sample data on the plurality of preset classes; determining a loss list of the sample data on a plurality of preset categories according to the real probability distribution and the prediction probability distribution; determining a mask list according to the loss list and the real probability distribution; and determining a positive category loss and a plurality of negative category losses of the sample data by a mask list and a loss list, determining a positive category and a plurality of negative categories to which the sample data belongs in a plurality of preset categories by the positive category loss and the plurality of negative category losses, and respectively constructing positive and negative samples of the plurality of preset categories according to the positive category and the plurality of negative categories so as to generate a multi-category classification model for category classification according to the positive and negative samples of the plurality of preset categories. The mask list reflects the positive probability that the sample data may belong to a certain category of a plurality of preset categories and the negative probability that the sample data may not belong to the certain category; the loss list is filtered to obtain a positive category loss and a plurality of negative category losses, a positive category and a plurality of negative categories to which the sample data belongs in a plurality of preset categories are determined, different sample data belong to different positive categories and negative categories, each preset category corresponds to a respective positive sample and a plurality of negative samples, the positive and negative samples in the plurality of preset categories are constructed, each preset category comprises the positive samples and the plurality of negative samples, and the balance of the sample data among the preset categories is realized. Therefore, the defect of sample data imbalance among all classes in the multi-class classification task in the prior art is overcome, and the accuracy of executing the multi-class classification task is improved.
Drawings
FIG. 1 is a schematic structural diagram of an apparatus hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a multi-class processing task training sample construction method according to a first embodiment of the present invention;
FIG. 3 is a functional block diagram of an apparatus for constructing multi-class processing task training samples according to a preferred embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention provides multi-class processing task training sample construction equipment, which comprises wind control equipment and at least one consumption equipment in communication connection with the wind control equipment, and referring to fig. 1, fig. 1 is a schematic structural diagram of an equipment hardware operating environment related to an embodiment scheme of the multi-class processing task training sample construction equipment.
As shown in fig. 1, the multi-class processing task training sample construction device may include: a processor 1001, such as a CPU, a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a memory device separate from the processor 1001 described above.
Those skilled in the art will appreciate that the hardware architecture of the multi-class processing task training sample construction apparatus illustrated in FIG. 1 does not constitute a limitation of the multi-class processing task training sample construction apparatus, and may include more or fewer components than those illustrated, or some components in combination, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a medium, may include therein an operating system, a network communication module, a user interface module, and a multi-class processing task training sample builder. The operating system is a program for managing and controlling multi-class processing task training sample construction equipment and software resources, and supports the operation of a network communication module, a user interface module, a multi-class processing task training sample construction program and other programs or software; the network communication module is used to manage and control the network interface 1004; the user interface module is used to manage and control the user interface 1003.
In the hardware structure of the multi-class processing task training sample construction device shown in fig. 1, the network interface 1004 is mainly used for connecting to a background server and performing data communication with the background server; the user interface 1003 is mainly used for connecting a client (user side) and performing data communication with the client; the processor 1001 may call the multi-class processing task training sample builder stored in the memory 1005 and perform the following operations:
acquiring real probability distribution of sample data on a plurality of preset categories, and determining the predicted probability distribution of the sample data on the plurality of preset categories;
determining a loss list of the sample data on a plurality of preset categories according to the real probability distribution and the prediction probability distribution;
determining a mask list according to the loss list and the real probability distribution, and determining a positive category loss and a plurality of negative category losses of the sample data according to the mask list and the loss list;
and according to the positive category loss and the negative category losses, determining a positive category and a plurality of negative categories to which the sample data belongs in a plurality of preset categories, respectively constructing positive and negative samples of the plurality of preset categories according to the positive category and the plurality of negative categories, and generating a multi-category classification model based on the positive and negative samples of the plurality of preset categories to classify the categories.
Further, after the step of constructing the positive and negative samples of the preset categories according to the positive category and the negative categories, respectively, the processor 1001 may call a multi-category processing task training sample construction program stored in the memory 1005, and perform the following operations:
training a preset multi-class model based on positive and negative item samples of a plurality of preset classes to generate a multi-class classification model;
and when data to be classified is received, performing class classification on the data to be classified based on the multi-class classification model, and determining the class to which the data to be classified belongs.
Further, the step of determining a mask list according to the loss list and the true probability distribution includes:
sequencing all numerical values in the loss list to obtain a probability sequence, and selecting a target probability arranged at a front preset position from the probability sequence;
determining the arrangement position of each target probability in the loss list, and updating the loss list according to the arrangement position;
and performing summation operation on the updated loss list and the real probability distribution to generate a mask list.
Further, the step of determining a positive-going class penalty and a plurality of negative-going class penalties for the sample data based on the mask list and the penalty list comprises:
performing product operation on the mask list and the loss list before updating to generate a product result list;
determining the forward category loss from the product result list according to the position of the true probability in the true probability distribution;
and determining a plurality of negative category losses from the product result according to the arrangement position.
Further, the step of determining a loss list of the sample data on a plurality of preset categories according to the true probability distribution and the predicted probability distribution comprises:
respectively determining a real probability value and a prediction probability value of the sample data on each preset category according to the real probability distribution and the prediction probability distribution;
calculating the real probability value and the prediction probability value of the sample data on each preset category based on a preset loss formula to obtain a loss value of the sample data on each preset category;
and arranging each loss value according to the arrangement sequence of each preset category in the plurality of preset categories to obtain the loss list.
Further, the step of determining a predictive probability distribution of the sample data over a plurality of preset classes comprises:
and obtaining output values corresponding to the sample data and a plurality of preset categories respectively, and mapping the output values respectively according to a preset function to obtain the predicted probability distribution of the sample data on the preset categories.
Further, before the step of obtaining the true probability distribution of the sample data on the multiple preset categories, the processor 1001 may call a multi-category processing task training sample construction program stored in the memory 1005, and perform the following operations:
the method comprises the steps of obtaining sample data and a sample label corresponding to the sample data, coding the sample label based on a preset coding mode, and generating real probability distribution of the sample data on a plurality of preset categories.
Further, the step of determining, according to the positive category loss and the negative category losses, a positive category and negative categories to which the sample data belongs in a plurality of preset categories includes:
training an initial model according to the positive category loss and the negative category losses, and acquiring a gradient corresponding to the initial model;
and when the gradient is smaller than a preset threshold value, finishing the training of the initial model, and determining the positive category and the negative categories according to the mapping attributes of the sample data in a plurality of preset categories when the training of the initial model is finished.
The specific implementation of the multi-class processing task training sample construction device of the present invention is substantially the same as the following embodiments of the multi-class processing task training sample construction method, and is not described herein again.
The invention also provides a multi-class processing task training sample construction method.
Referring to fig. 2, fig. 2 is a flowchart illustrating a method for constructing a multi-class processing task training sample according to a first embodiment of the present invention.
While a logical order is shown in the flow chart, in some cases, the steps shown or described may be performed in an order different than that shown. Specifically, the method for constructing the training samples of the multi-class processing tasks in this embodiment includes:
step S10, acquiring the true probability distribution of the sample data on a plurality of preset categories, and determining the predicted probability distribution of the sample data on the plurality of preset categories;
in the method for constructing the multi-class processing task training sample in this embodiment, a positive-negative sample including a positive-direction sample and a negative-direction sample is constructed for each class of the multi-class classification task. In a multi-class classification task, each sample belongs to only one of a plurality of classes. Presetting each category related to the multi-category classification task as a preset category, and presetting a label representing the category of each sample data. And for each preset category, representing the attribution category of the sample data by true probability distribution, wherein the true probability distribution comprises a plurality of probability values. The category of the label representation sample data is a certain probability value, other categories are another probability value, and a plurality of probability values integrally form a real probability distribution; therefore, the category of the sample data is represented by each probability value. If the sample data belongs to class i, and the belonging class is represented by probability value 1 and the other classes are represented by probability value 0, as for C classes [1.. i.. C ], then the true probability distribution formed is [0.. 1.. 0 ].
Understandably, the sample data characterizes the class to which it belongs in the form of a label, and the true probability distribution exists in the form of a value, so that the label needs to be converted into a value to form the true probability distribution. Specifically, before the step of obtaining the true probability distribution of the sample data on the multiple preset categories, the method further includes:
step a, obtaining sample data and a sample label corresponding to the sample data, coding the sample label based on a preset coding mode, and generating the true probability distribution of the sample data on a plurality of preset categories.
And further, the sample data and the sample tags carried by the sample data are obtained, and the obtained sample data is coded in a preset coding mode according to the real sample tags on a plurality of preset categories to obtain the real probability distribution of the sample data on the plurality of preset categories. Wherein, the preset coding mode is preferably a one-hot method; the real probability distribution vector is obtained by a one-hot method, and can be called a one-hot vector. For the C preset categories, the true probability distribution is a vector with a length of C, and only the value of the ith position of the true category i is 1, and the values of the other positions are 0.
Further, the true probability distribution of the generated sample data on a plurality of preset categories is obtained, and the predicted probability distribution of the sample data on the plurality of preset categories is determined. And predicting the sample data by the prediction probability distribution representation, and determining the possibility of each preset category to which the sample data belongs. Specifically, the step of determining the predicted probability distribution of the sample data on a plurality of preset categories includes:
and b, acquiring output values corresponding to the sample data and a plurality of preset categories respectively, and mapping the output values respectively according to a preset function to acquire the predicted probability distribution of the sample data on the preset categories.
Furthermore, a model for prediction is preset, and the model predicts the class to which the sample data belongs to obtain output values corresponding to each of a plurality of preset classes. And, a preset function for numerically mapping the output value is preset. And acquiring each generated output value, respectively mapping each output value by a preset function, mapping each output value to 0-1 to obtain each real number, and forming the predicted probability distribution of the sample data on a plurality of preset categories. The preset function is preferably a softmax function, namely, the softmax function maps each output value to a real number between 0 and 1 to form a prediction probability distribution. And the normalized sum of the mapped real numbers is 1 so as to ensure that the sum of the probabilities of the sample data on the multiple classes is also 1, and the representation sample data belongs to one of the multiple classes. For example, for the C preset classes, the formed prediction probability distribution can be represented by [ y1, y2... yi... yc ].
Step S20, determining a loss list of the sample data on a plurality of preset categories according to the real probability distribution and the prediction probability distribution;
further, after the true probability distribution and the predicted probability distribution of the sample data on each preset category are obtained, a loss list of the sample data on the plurality of preset categories can be determined according to the true probability distribution and the predicted probability distribution. And the loss list represents the predicted error of the sample data on a plurality of preset categories. Specifically, the step of determining a loss list of sample data in a plurality of preset categories according to the true probability distribution and the predicted probability distribution includes:
step S21, respectively determining a real probability value and a predicted probability value of the sample data on each preset category according to the real probability distribution and the predicted probability distribution;
step S22, calculating the real probability value and the prediction probability value of the sample data on each preset category based on a preset loss formula to obtain a loss value of the sample data on each preset category;
step S23, ranking each loss value according to the ranking order of each preset category in the preset categories, to obtain the loss list.
Understandably, the true probability distribution includes true probability values of the sample data on the respective preset types, and the predicted probability distribution includes predicted probability values of the sample data on the respective preset types. And determining a real probability value of the sample data from the real probability distribution and determining a predicted probability value of the sample data from the predicted probability distribution aiming at each preset category. And then, calculating the real probability value and the predicted probability value of the sample data on each preset category through a preset loss formula, such as a cross entropy formula, so as to obtain the loss value of the sample data on each preset category. If for the ith category in the C preset categories, the value of the ith position of the sample data in the real probability distribution is determined to be y, and the value of the ith position in the prediction probability distribution is y ', the real probability value of the sample data on the ith category is determined to be y, the prediction probability value is y ', and then a preset loss formula is called to calculate y and y ', so that the loss value of the sample data on the ith category is obtained. The preset loss formula comprises a basic formula and a deformation formula of the basic formula, wherein the basic formula is shown as the following formula (1):
Figure BDA0002672100320000111
where L represents the loss value, y represents the true probability value, and y' represents the predicted probability value.
The above formula (1) is modified to obtain the following formulas (2), (3) and (4):
Figure BDA0002672100320000112
Figure BDA0002672100320000113
CE(pt)=-log(pt) (4);
wherein, CE (p)t)、ptAnd CE (p)t) Each represents a loss value and p represents a prediction probability value.
Through the above deformation formula, an expression of the loss value of the ith class in the preset class of the sample data is obtained as shown in the following formula (5):
CE(pi,t)=-log(pi,t) (5);
further, the loss values calculated for the preset categories are arranged according to the arrangement sequence of each preset category in the preset categories, and the obtained sequence is a loss list of the sample data in the preset categories. Wherein, the expression of the loss list is shown in the following formula (6):
SoftmaxCrossEntropyLossList=[-log(p1,t),…,-log(pi,t),…,-log(pc,t)] (6);
step S30, determining a mask list according to the loss list and the true probability distribution, and determining a positive category loss and a plurality of negative category losses of the sample data according to the mask list and the loss list;
understandably, the larger the error is, the less likely the sample data is to belong to the preset category because the loss list characterizes the error size. Therefore, the negative samples of the preset category can be determined according to the probability, and the sample data is formed into the negative samples of the preset category with low probability. After being generated as a loss list, a mask list is generated by combining the loss list with the true probability distribution. The mask list contains values from the loss list characterizing negative samples and values from the true probability distribution characterizing positive samples. After the mask list is determined, the loss list is further processed through the mask list, and a positive-direction category loss and a plurality of negative-direction category losses of the sample data are generated. Representing the real category to which the sample data belongs in the multi-category classification task by the forward category loss, namely a forward sample; and characterizing classes to which the sample data is unlikely to belong in the multi-class classification task, namely negative samples, by a plurality of negative class losses.
Step S40, according to the positive category loss and the negative category losses, determining a positive category and a plurality of negative categories to which the sample data belongs in a plurality of preset categories, and according to the positive category and the negative categories, respectively constructing positive and negative samples of the preset categories, so as to generate a multi-category classification model based on the positive and negative samples of the preset categories for category classification.
Further, positive and negative samples of a plurality of preset categories are constructed according to the preset categories to which the positive category loss characterization belongs and the preset categories to which the negative category characterization cannot belong. And determining the attributive category of the sample data in the plurality of preset categories according to the forward category loss, wherein the attributive category is the forward category, and further forming the sample data into the forward sample of the attributive forward category. And meanwhile, determining the category to which the sample data cannot belong in a plurality of preset categories according to the loss of the negative categories, wherein the category to which the sample data cannot belong is the negative category, and further forming the sample data into the negative sample of the negative category to which the sample data cannot belong. After a large amount of sample data is processed, positive categories to which the large amount of sample data belongs can be determined, positive samples of all the positive categories and negative categories to which the large amount of sample data cannot belong are formed, and negative samples of all the negative categories are formed, so that all the preset categories comprise the positive samples and multiple negative samples, positive and negative samples of multiple preset categories are constructed according to the positive categories and the multiple negative categories, and balance of the samples among the preset categories is guaranteed. And then, training the preset multi-class model by constructing positive and negative samples of each preset class to obtain the multi-class classification model, executing a multi-class classification task to perform class classification, and accurately classifying due to sample balance among the preset classes.
Further, the step of determining a positive category and a plurality of negative categories to which the sample data belongs in a plurality of preset categories according to the positive category loss and the plurality of negative category losses includes:
step c1, training an initial model according to the positive category loss and the negative category losses, and acquiring a gradient corresponding to the initial model;
and c2, finishing the training of the initial model when the gradient is smaller than a preset threshold value, and determining the positive category and the negative categories according to the mapping attributes of the sample data in a plurality of preset categories when the training of the initial model is finished.
Further, the embodiment is implemented in the dynamic process of model training for the multi-class processing task training sample construction. And training an initial model which is preset for executing the multi-class classification task according to the determined positive-direction class loss and the plurality of negative-direction class losses, and acquiring the gradient after the initial model is trained. And comparing the gradient with a preset threshold value, judging whether the gradient is smaller than the preset threshold value, and finishing the training of the initial model if the gradient is smaller than the preset threshold value. Otherwise, if the gradient is determined to be not smaller than the preset threshold value through comparison, the training is continued until the gradient is smaller than the preset threshold value, and the training of the initial model is not completed. The training samples in the initial model after training are finished, so that the initial model presents a convergence characteristic, and the task of executing classification has a better effect; therefore, each training sample is used as a mapping attribute according to the category to which each training sample belongs and the category to which each training sample does not belong, and further a positive category and a plurality of negative categories are determined according to the mapping attribute; and taking the class to which the sample data belongs when the initial model training is completed as a positive-direction class, and taking the class to which the sample data is not belonged as a negative-direction class. And then dividing the positive samples and the negative samples according to the positive categories and the negative categories to construct and obtain the positive and negative samples of each preset category, which is favorable for the balance of the samples among the preset categories.
Further, after the step of constructing positive and negative samples of a plurality of preset categories according to the positive category and the plurality of negative categories, respectively, the method further includes:
d1, training a preset multi-class model based on the positive and negative item samples of the preset classes to generate a multi-class classification model;
and d1, when the data to be classified is received, classifying the data to be classified based on the multi-class classification model, and determining the class to which the data to be classified belongs.
Further, a preset multi-class model for training is preset to generate a multi-class classification model by training the preset multi-class model. Specifically, the constructed positive and negative samples in each preset category are transmitted to a preset multi-category model, and the preset multi-category model is trained. The preset multi-class model comprises a training end condition, when the trained preset multi-class model reaches the training end condition, the training is ended, and a multi-class classification model is generated for class classification.
Furthermore, when the data to be classified is received and the requirement for executing the multi-class classification task is represented, a multi-class classification model is called to classify the data to be classified, the class to which the data to be classified belongs is determined, the data to be classified is classified into the class to which the data to be classified belongs, and the data to be classified is accurately classified.
Compared with the prior art that the multi-class classification task is inaccurate in execution of the multi-class classification task due to the fact that sample data of each class is unbalanced, the multi-class processing task training sample construction method firstly obtains the real probability distribution of the sample data on a plurality of preset classes and determines the prediction probability distribution of the sample data on the plurality of preset classes; determining a loss list of the sample data on a plurality of preset categories according to the real probability distribution and the prediction probability distribution; determining a mask list according to the loss list and the real probability distribution; and determining a positive category loss and a plurality of negative category losses of the sample data by a mask list and a loss list, determining a positive category and a plurality of negative categories to which the sample data belongs in a plurality of preset categories by the positive category loss and the plurality of negative category losses, and respectively constructing positive and negative samples of the plurality of preset categories according to the positive category and the plurality of negative categories so as to generate a multi-category classification model for category classification according to the positive and negative samples of the plurality of preset categories. The mask list reflects the positive probability that the sample data may belong to a certain category of a plurality of preset categories and the negative probability that the sample data may not belong to the certain category; the loss list is filtered to obtain a positive category loss and a plurality of negative category losses, a positive category and a plurality of negative categories to which the sample data belongs in a plurality of preset categories are determined, different sample data belong to different positive categories and negative categories, each preset category corresponds to a respective positive sample and a plurality of negative samples, the positive and negative samples in the plurality of preset categories are constructed, each preset category comprises the positive samples and the plurality of negative samples, and the balance of the sample data among the preset categories is realized. Therefore, the defect of sample data imbalance among all classes in the multi-class classification task in the prior art is overcome, and the accuracy of executing the multi-class classification task is improved.
Further, based on the first embodiment of the multi-class processing task training sample construction method of the present invention, a second embodiment of the multi-class processing task training sample construction method of the present invention is proposed.
The second embodiment of the method for constructing multi-class processing task training samples is different from the first embodiment of the method for constructing multi-class processing task training samples in that the step of determining a mask list according to the loss list and the true probability distribution comprises:
step S31, sequencing all numerical values in the loss list to obtain a probability sequence, and selecting a target probability arranged in a front preset position from the probability sequence;
step S32, determining the arrangement position of each target probability in the loss list, and updating the loss list according to the arrangement position;
and step S33, performing a sum operation on the updated loss list and the real probability distribution to generate a mask list.
The present embodiment determines a mask list in conjunction with the loss list and the true probability distribution. Specifically, sorted operation is performed on each numerical value in the loss list, and through sorted operation, the numerical values in the loss list are sorted in descending order to obtain a probability sequence. And then, carrying out topn operation on the probability sequence, wherein the topn operation is the first n bits of data arranged in the search sequence, and n is a hyper-parameter set in advance according to requirements and is used as a preset bit and is usually set to be a numerical value less than or equal to 5. Therefore, the numerical value of the front preset bit is selected from the probability sequence as the target probability through the topn operation. If n is set to be 3, selecting the numerical values arranged in the first three bits as target probabilities to form a sublist: [ -log (p)j,t),-log(pk,t),-log(pl,t)]Wherein j, k, l characterize the location of the loss list from which the three values originate. And determining the positions of the loss lists from which the target probabilities originate as arrangement positions, and updating the loss lists according to the arrangement positions. The updating process is to set the value of the arrangement position to a certain value, such as a value of 1; and the value of the other position than the arrangement position is set to another value, such as 0. If the preset category C is equal to 10, namely 10 preset categories, the arrangement positions of j, k and l are respectively the 3 rd bit, the 4 th bit and the 8 th bit, and the loss values of the three positions in the characterization loss list are maximum; the values of the three positions are set to 1 and the values of the other positions are set to 0, and the resulting update loss list is [0, 0, 1, 1, 0, 0, 0, 1, 0, 0]。
In addition, the sorting and selecting operations performed on the loss list in this embodiment, i.e., sorted operation and topn operation, can be expressed by the following formula (7), where the formula (7) is:
SoftmaxCrossEntropyLossList=topn(sorted(SoftmaxCrossEntropyLossList)) (7);
and further, performing addition operation on the updated loss list and the real probability distribution to generate a mask list. The mask list contains the real probability values in the real probability distribution, in addition to the values of the loss list after updating. For the loss list with updated values of terms [0, 0, 1, 1, 0, 0, 0, 1, 0, 0], if the 6 th bit of the true probability distribution is true probability, the true probability distribution [0, 0, 0, 0, 0, 1, 0, 0, 0] and the updated loss list [0, 0, 1, 1, 0, 0, 0, 1, 0, 0] are added to obtain a mask list of [0, 0, 1, 1, 0, 1, 0, 1, 0, 0 ].
Further, the step of determining a positive-going class penalty and a plurality of negative-going class penalties for the sample data from the mask list and the penalty list comprises:
step S34, performing a product operation on the mask list and the loss list before updating to generate a product result list;
step S35, determining the forward category loss from the multiplication result list according to the position of the true probability in the true probability distribution;
step S36, determining a plurality of negative-going category losses from the product result according to the arrangement position.
Further, the mask list and the loss list before updating are subjected to multiplication operation, and the result obtained by multiplying the mask list and the loss list before updating is used as a multiplication result list. The multiplication operation is to multiply the value at the same position in the list, for example, the value of the first bit in the mask list is multiplied by the value of the first bit in the loss list before updating. Since the numerical value of each position in the mask list is the numerical value 1 or the numerical value 0, during multiplication, invalid negative samples in the loss list are filtered through multiplication of the numerical value 0, and positive samples and valid samples in the loss list are saved through multiplication of the numerical value 1. The product operation can be expressed by the following equation (8), where equation (8) is:
FinalLossList=MaskedLossList*SoftmaxCrossEntropyLossList (8);
wherein, FinalLossList represents the multiplication result list, and MaskedLossList represents the mask list.
Further, after generating the filtered list of product results, a forward category penalty is determined from the list of product results based on where the true probability in the true probability distribution is located. If the 6 th bit in the true probability distribution is true probability, the value arranged at the 6 th bit in the multiplication result list is obtained as the forward class loss. And simultaneously, determining a plurality of negative direction category losses from the product result according to the arrangement position of the target probability arranged at the previous preset position. If the arrangement positions of the target probabilities are respectively the 3 rd bit, the 4 th bit and the 8 th bit, the numerical values respectively arranged at the 3 rd bit, the 4 th bit and the 8 th bit in the multiplication result list are obtained as the negative category loss.
In the embodiment, the loss list is updated through sorting and selecting the loss list, and a mask list is generated by combining the updated loss list and the real probability; and filtering the loss list before updating by using the mask list to generate a product result list for determining the positive-direction category loss and a plurality of negative-direction category losses. Thus, a positive and negative sample is constructed through the obtained positive category loss and a plurality of negative category losses; through the processing and construction of a plurality of sample data, each preset type comprises a plurality of positive and negative samples, and the balance of the samples of each preset type is facilitated.
The invention also provides a multi-class processing task training sample construction device.
Referring to fig. 3, fig. 3 is a functional block diagram of a multi-class processing task training sample construction apparatus according to a first embodiment of the present invention. The multi-class processing task training sample construction device comprises:
the obtaining module 10 is configured to obtain true probability distribution of sample data in multiple preset categories, and determine predicted probability distribution of the sample data in the multiple preset categories;
a first determining module 20, configured to determine a loss list of the sample data in multiple preset categories according to the true probability distribution and the predicted probability distribution;
a second determining module 30, configured to determine a mask list according to the loss list and the true probability distribution, and determine a positive category loss and multiple negative category losses of the sample data according to the mask list and the loss list;
and the building module 40 is configured to determine, according to the positive category loss and the multiple negative category losses, a positive category and multiple negative categories to which the sample data belongs among multiple preset categories, and respectively build positive and negative samples of the multiple preset categories according to the positive category and the multiple negative categories, so as to generate a multi-category classification model based on the positive and negative samples of the multiple preset categories, and perform category classification.
Further, the multi-class processing task training sample constructing device comprises:
the training module is used for training a preset multi-class model based on positive and negative item samples of a plurality of preset classes to generate a multi-class classification model;
and the classification module is used for carrying out class classification on the data to be classified based on the multi-class classification model when the data to be classified is received, and determining the class to which the data to be classified belongs.
Further, the second determining module 30 further includes:
the sequencing unit is used for sequencing all numerical values in the loss list to obtain a probability sequence, and selecting a target probability arranged in a front preset position from the probability sequence;
the updating unit is used for determining the arrangement position of each target probability in the loss list and updating the loss list according to the arrangement position;
and the addition operation unit is used for performing addition operation on the updated loss list and the updated real probability distribution to generate a mask list.
Further, the second determining module 30 further includes:
a product operation unit, configured to perform product operation on the mask list and the loss list before updating, and generate a product result list;
a first determining unit, configured to determine the forward category loss from the product result list according to a position of a true probability in the true probability distribution;
the first determining unit is further configured to determine a plurality of negative-going class losses from the product result according to the arrangement position.
Further, the first determining module 20 further includes:
the second determining unit is used for respectively determining a real probability value and a prediction probability value of the sample data on each preset category according to the real probability distribution and the prediction probability distribution;
the calculation unit is used for calculating the real probability value and the prediction probability value of the sample data on each preset category based on a preset loss formula to obtain a loss value of the sample data on each preset category;
and the arrangement unit is used for arranging each loss value according to the arrangement sequence of each preset category in the plurality of preset categories to obtain the loss list.
Further, the obtaining module 10 further includes:
and the obtaining unit is used for obtaining output values corresponding to the sample data and a plurality of preset categories respectively, mapping the output values respectively according to a preset function, and obtaining the predicted probability distribution of the sample data on the preset categories.
Further, the multi-class processing task training sample constructing device further includes:
the encoding module is used for acquiring sample data and a sample label corresponding to the sample data, encoding the sample label based on a preset encoding mode, and generating the real probability distribution of the sample data on a plurality of preset categories.
Further, the building module 40 further includes:
the training unit is used for training an initial model according to the positive-direction category loss and the negative-direction category losses and acquiring a gradient corresponding to the initial model;
and the third determining unit is used for finishing the training of the initial model when the gradient is smaller than a preset threshold value, and determining the positive category and the negative categories according to the mapping attributes of the sample data in a plurality of preset categories when the training of the initial model is finished.
The specific implementation of the multi-class processing task training sample construction device of the present invention is substantially the same as that of each embodiment of the multi-class processing task training sample construction method described above, and details are not repeated herein.
In addition, the embodiment of the invention also provides a medium.
The medium stores a multi-class processing task training sample construction program, and the multi-class processing task training sample construction program realizes the steps of the multi-class processing task training sample construction method when being executed by the processor.
The medium of the present invention may be a computer medium, and the specific implementation manner of the medium of the present invention is substantially the same as that of each embodiment of the above-mentioned multi-class processing task training sample construction method, and will not be described herein again.
The present invention is described in connection with the accompanying drawings, but the present invention is not limited to the above embodiments, which are only illustrative and not restrictive, and those skilled in the art can make various changes without departing from the spirit and scope of the invention as defined by the appended claims, and all changes that come within the meaning and range of equivalency of the specification and drawings that are obvious from the description and the attached claims are intended to be embraced therein.

Claims (10)

1. A multi-class processing task training sample construction method is characterized by comprising the following steps:
acquiring real probability distribution of sample data on a plurality of preset categories, and determining the predicted probability distribution of the sample data on the plurality of preset categories;
determining a loss list of the sample data on a plurality of preset categories according to the real probability distribution and the prediction probability distribution;
determining a mask list according to the loss list and the real probability distribution, and determining a positive category loss and a plurality of negative category losses of the sample data according to the mask list and the loss list;
and according to the positive category loss and the negative category losses, determining a positive category and a plurality of negative categories to which the sample data belongs in a plurality of preset categories, respectively constructing positive and negative samples of the plurality of preset categories according to the positive category and the plurality of negative categories, and generating a multi-category classification model based on the positive and negative samples of the plurality of preset categories to classify the categories.
2. The method of claim 1, wherein after the step of constructing positive and negative samples of a plurality of predetermined classes based on the positive and negative classes, respectively, the method further comprises:
training a preset multi-class model based on positive and negative item samples of a plurality of preset classes to generate a multi-class classification model;
and when data to be classified is received, performing class classification on the data to be classified based on the multi-class classification model, and determining the class to which the data to be classified belongs.
3. The method of claim 1, wherein the step of determining a mask list based on the loss list and the true probability distribution comprises:
sequencing all numerical values in the loss list to obtain a probability sequence, and selecting a target probability arranged at a front preset position from the probability sequence;
determining the arrangement position of each target probability in the loss list, and updating the loss list according to the arrangement position;
and performing summation operation on the updated loss list and the real probability distribution to generate a mask list.
4. The method of claim 3, wherein the step of determining the positive-going class loss and the plurality of negative-going class losses of the sample data according to the mask list and the loss list comprises:
performing product operation on the mask list and the loss list before updating to generate a product result list;
determining the forward category loss from the product result list according to the position of the true probability in the true probability distribution;
and determining a plurality of negative category losses from the product result according to the arrangement position.
5. The method of claim 1, wherein the step of determining the loss list of the sample data on the plurality of preset classes according to the true probability distribution and the predicted probability distribution comprises:
respectively determining a real probability value and a prediction probability value of the sample data on each preset category according to the real probability distribution and the prediction probability distribution;
calculating the real probability value and the prediction probability value of the sample data on each preset category based on a preset loss formula to obtain a loss value of the sample data on each preset category;
and arranging each loss value according to the arrangement sequence of each preset category in the plurality of preset categories to obtain the loss list.
6. The method of any of claims 1-5, wherein the step of determining the predicted probability distribution of the sample data over a plurality of predetermined classes comprises:
and obtaining output values corresponding to the sample data and a plurality of preset categories respectively, and mapping the output values respectively according to a preset function to obtain the predicted probability distribution of the sample data on the preset categories.
7. The method for constructing multi-class processing task training samples according to any one of claims 1-5, wherein the step of obtaining the true probability distribution of the sample data on a plurality of preset classes is preceded by the method further comprising:
the method comprises the steps of obtaining sample data and a sample label corresponding to the sample data, coding the sample label based on a preset coding mode, and generating real probability distribution of the sample data on a plurality of preset categories.
8. The method of any of claims 1-5, wherein the step of determining the positive category and the negative categories to which the sample data belongs among the plurality of preset categories according to the positive category loss and the plurality of negative category losses comprises:
training an initial model according to the positive category loss and the negative category losses, and acquiring a gradient corresponding to the initial model;
and when the gradient is smaller than a preset threshold value, finishing the training of the initial model, and determining the positive category and the negative categories according to the mapping attributes of the sample data in a plurality of preset categories when the training of the initial model is finished.
9. A multi-class processing task training sample construction device, characterized in that the multi-class processing task training sample construction device comprises a memory, a processor and a multi-class processing task training sample construction program stored on the memory and operable on the processor, which, when executed by the processor, implements the steps of the multi-class processing task training sample construction method according to any of claims 1-8.
10. A medium having stored thereon a multi-class processing task training sample construction program which, when executed by a processor, implements the steps of the multi-class processing task training sample construction method of any one of claims 1-8.
CN202010936484.6A 2020-09-08 2020-09-08 Multi-class processing task training sample construction method, device and medium Pending CN112070162A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010936484.6A CN112070162A (en) 2020-09-08 2020-09-08 Multi-class processing task training sample construction method, device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010936484.6A CN112070162A (en) 2020-09-08 2020-09-08 Multi-class processing task training sample construction method, device and medium

Publications (1)

Publication Number Publication Date
CN112070162A true CN112070162A (en) 2020-12-11

Family

ID=73664571

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010936484.6A Pending CN112070162A (en) 2020-09-08 2020-09-08 Multi-class processing task training sample construction method, device and medium

Country Status (1)

Country Link
CN (1) CN112070162A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114416943A (en) * 2021-12-29 2022-04-29 北京百度网讯科技有限公司 Training method and device for dialogue model, electronic equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114416943A (en) * 2021-12-29 2022-04-29 北京百度网讯科技有限公司 Training method and device for dialogue model, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109902708B (en) Recommendation model training method and related device
CN108536784B (en) Comment information sentiment analysis method and device, computer storage medium and server
JP2019507398A (en) Collaborative filtering method, apparatus, server, and storage medium for fusing time factors
CN110245310B (en) Object behavior analysis method, device and storage medium
CN112686371A (en) Network structure search method, device, equipment, storage medium and program product
CN111460384A (en) Policy evaluation method, device and equipment
CN110992124A (en) House resource recommendation method and system
CN113902010A (en) Training method of classification model, image classification method, device, equipment and medium
CN114676279B (en) Image retrieval method, device, equipment and computer readable storage medium
CN112766402A (en) Algorithm selection method and device and electronic equipment
CN109460398A (en) Complementing method, device and the electronic equipment of time series data
CN111210022A (en) Backward model selection method, device and readable storage medium
CN112070162A (en) Multi-class processing task training sample construction method, device and medium
CN112860736A (en) Big data query optimization method and device and readable storage medium
CN113095508A (en) Regression model construction optimization method, device, medium, and computer program product
CN112700003A (en) Network structure search method, device, equipment, storage medium and program product
CN112541556A (en) Model construction optimization method, device, medium, and computer program product
CN111241746A (en) Forward model selection method, apparatus and readable storage medium
CN111898766A (en) Ether house fuel limitation prediction method and device based on automatic machine learning
CN110705889A (en) Enterprise screening method, device, equipment and storage medium
CN114491093B (en) Multimedia resource recommendation and object representation network generation method and device
CN115730152A (en) Big data processing method and big data processing system based on user portrait analysis
CN112052903A (en) Multi-label processing task training sample construction method, device and medium
CN111984637B (en) Missing value processing method and device in data modeling, equipment and storage medium
CN113591979A (en) Industry category identification method, equipment, medium and computer program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination