CN111079811A

CN111079811A - Sampling method for multi-label classified data imbalance problem

Info

Publication number: CN111079811A
Application number: CN201911245293.9A
Authority: CN
Inventors: 白夏颖; 翟得胜; 冯子豪
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2019-12-06
Filing date: 2019-12-06
Publication date: 2020-04-28

Abstract

The invention discloses a sampling method for the unbalanced problem of multi-label classified data, which comprises the steps that A is defined as a label matrix with the size of m x n, m is equal to the number of training samples, n is equal to the number of attributes of each image, a PA-100K data set comprises 100000 pedestrian images to be trained, each image has 26 classification attributes, A is a binary matrix with the size of 100000 x 26, r is a weight vector with the length equal to m and represents the number of samples of a generated sub data set after DAI is applied to the whole training data set, and after the weight is re-weighted, labels with a few positive samples are more balanced through re-sampling. The invention enables each label sample to reach a relative balance state, thereby improving the accuracy of attribute identification, integrates oversampling and undersampling by weighting the samples, and is beneficial to the model to well learn the attributes occupying a small part in a data set.

Description

Sampling method for multi-label classified data imbalance problem

Technical Field

The invention relates to the field of sampling methods, in particular to a sampling method for solving the problem of unbalanced multi-label classified data.

Background

Pedestrian attribute identification plays a crucial role in intelligent video monitoring, the aim is to excavate the attribute of a target pedestrian, the wide application is pedestrian re-identification, face verification, the traditional pedestrian attribute identification technology is realized by manually extracting features and using a robust classifier, but the traditional method can not extract high-order features of the pedestrian, the identification rate is low, along with the development of deep learning, the extraction of pedestrian features is realized by using a multilayer nonlinear convolution network, along with the deepening of network depth, the identification rate of the pedestrian attribute continuously rises, the eye and the accessory such fine-grained features can be well identified, many intelligent security fields begin to use the technology, and common problems are as follows:

(1) pedestrian attribute discernment belongs to many labels classification problem, and many rare attributes cause the identification rate low because the sample quantity is few, greatly reduced security protection field pedestrian's discernment's accuracy.

(2) The attributes have strong correlation, the traditional sampling method is specific to a single label, and the sampling method of the multi-label classification problem needs to consider the strong correlation between labels.

(3) The traditional method for solving the data imbalance problem, such as cost-sensitive and sampling, cannot improve the balance and the recognition capability of the rare attributes in a concentrated manner.

Pedestrian attribute identification is an important multi-label classification problem, although the convolutional neural network is prominent in learning distinguishing features from images, data imbalance for fine-grained tasks in multi-label setting is still an unsolved problem, and a new sampling algorithm is provided: DATA enhancement IMBALANCE (DAI for short) is used to enhance the recognition capability of the attributes by increasing the proportion of a small number of labels, fundamentally, the number of multi-sample attributes is properly reduced and the proportion of few-sample attributes is increased by simultaneously applying oversampling and undersampling to a multi-label DATA set, and the correlation between the attributes is not damaged by simultaneously predicting the attributes of a certain pedestrian.

Disclosure of Invention

The present invention provides a sampling method for solving the problem of unbalanced multi-label classification data, so as to solve the problems in the background art.

In order to achieve the purpose, the invention provides the following technical scheme: a sampling method for the multi-label classification data imbalance problem comprises the following steps:

s1: a is defined as a label matrix of size m n, m equals the number of training samples, n equals the number of attributes each image has, the PA-100K dataset has 100000 pedestrian images to be trained, each image has 26 classification attributes, A is a binary matrix of size 100000 26, r is a weight vector of length m, which represents the number of samples present in the resulting sub-dataset after DAI is applied to the entire training dataset, 1 in equation (1) represents a vector with all terms equal to 1, and after re-weighting, the proportion of positive samples in each label can be expressed as equation (1), and $ circ represents the Hada mard product:

s2: labels with positive samples are more balanced by resampling, as a function of the minimization objective, equation (2):

f(r)＝max(0，(p_ideal-p)³) for r≥0 (2)

{ p } _{ ideal } represents the ideal proportion of positive samples in each tag, $ { p } _{ ideal } $ is set to 0.6 in size, and it makes no sense for r to be limited to greater than or equal to a negative value.

S3: in order to minimize the change of function (2) to an unconstrained version of equation (3), it is minimized by gradient descent:

f(r)＝max(0，(p_ideal-p)³)-λ_rλ＞0 (3)

lambda is the regularization parameter, after r is calculated, it is rescaled by multiplying it by a constant, taking the value as 5 or 10, and then getting its integer, which represents the number of each sample present in the sub-data set.

Preferably, the human attribute identification execution process after the DAI is added is as follows: and (4) generating a subdata set after the original data set passes through a DAI algorithm, transmitting the subdata set to a prediction network, and testing the prediction network.

Preferably, the original data set and the generated sub data sets are alternately trained, the prediction network is a classification network, and the output dimension is the number of labels.

The invention has the technical effects and advantages that:

1. the present invention can convert a multi-label unbalanced data set into a more balanced sub data set. The algorithm well solves the problem of unbalanced data sets in a multi-label scene, particularly in the field of pedestrian attribute identification;

2. the invention integrates oversampling and undersampling by weighting the samples, which is helpful for the model to learn the attributes occupying a small part in the data set. Directly increasing the sample scale of fewer attributes is useful for pedestrian attribute identification compared to the latest technologies that have been released using attention models. In addition to providing convincing predictions of attributes, the DAI may also help people identify whether pedestrians under different cameras are the same for tracking purposes.

Drawings

FIG. 1 is a diagram illustrating the proportional variation of sample properties before and after the DAI algorithm of the present invention.

Fig. 2 is a schematic structural diagram of a process of pedestrian attribute identification after adding a DAI according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides a sampling method for the unbalanced problem of multi-label classified data, which is shown in figures 1-2 and comprises the following steps:

f(r)＝max(0，(p_ideal-p)³) for r≥0 (2)

f(r)＝max(0，(p_ideal-p)³)-λ_rλ＞0 (3)

As shown in fig. 2, the human attribute identification execution process after adding the DAI is as follows: the original data set generates sub data sets after being processed by a DAI algorithm, the sub data sets are transmitted to a prediction network, the prediction network is tested, the original data set and the generated sub data sets are alternately trained, the prediction network is a classification network, and the output dimension is the number of labels.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments or portions thereof without departing from the spirit and scope of the invention.

Claims

1. A sampling method for solving the problem of imbalance of multi-label classified data is characterized by comprising the following steps:

f(r)max(0，(p_ideal-p)³)for r≥0 (2)

{_pit makes no sense that { ideal } represents the ideal proportion of positive samples in each label, $ { p } _{ ideal } $ is set to 0.6 in size, and r is limited to be greater than or equal to a negative value.

f(r)＝max(0，(p_ideal-p)³)-λr λ＞0 (3)

2. The sampling method for the multi-label classification data imbalance problem according to claim 1, wherein: the execution process of the person attribute identification after the DAI is added is as follows: and (4) generating a subdata set after the original data set passes through a DAI algorithm, transmitting the subdata set to a prediction network, and testing the prediction network.

3. The sampling method for the multi-label classification data imbalance problem according to claim 2, wherein: and alternately training the original data set and the generated subdata sets, wherein the prediction network is a classification network, and the output dimension is the number of labels.