CN111079811A - Sampling method for multi-label classified data imbalance problem - Google Patents

Sampling method for multi-label classified data imbalance problem Download PDF

Info

Publication number
CN111079811A
CN111079811A CN201911245293.9A CN201911245293A CN111079811A CN 111079811 A CN111079811 A CN 111079811A CN 201911245293 A CN201911245293 A CN 201911245293A CN 111079811 A CN111079811 A CN 111079811A
Authority
CN
China
Prior art keywords
label
samples
data set
ideal
attributes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911245293.9A
Other languages
Chinese (zh)
Inventor
白夏颖
翟得胜
冯子豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN201911245293.9A priority Critical patent/CN111079811A/en
Publication of CN111079811A publication Critical patent/CN111079811A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a sampling method for the unbalanced problem of multi-label classified data, which comprises the steps that A is defined as a label matrix with the size of m x n, m is equal to the number of training samples, n is equal to the number of attributes of each image, a PA-100K data set comprises 100000 pedestrian images to be trained, each image has 26 classification attributes, A is a binary matrix with the size of 100000 x 26, r is a weight vector with the length equal to m and represents the number of samples of a generated sub data set after DAI is applied to the whole training data set, and after the weight is re-weighted, labels with a few positive samples are more balanced through re-sampling. The invention enables each label sample to reach a relative balance state, thereby improving the accuracy of attribute identification, integrates oversampling and undersampling by weighting the samples, and is beneficial to the model to well learn the attributes occupying a small part in a data set.

Description

Sampling method for multi-label classified data imbalance problem
Technical Field
The invention relates to the field of sampling methods, in particular to a sampling method for solving the problem of unbalanced multi-label classified data.
Background
Pedestrian attribute identification plays a crucial role in intelligent video monitoring, the aim is to excavate the attribute of a target pedestrian, the wide application is pedestrian re-identification, face verification, the traditional pedestrian attribute identification technology is realized by manually extracting features and using a robust classifier, but the traditional method can not extract high-order features of the pedestrian, the identification rate is low, along with the development of deep learning, the extraction of pedestrian features is realized by using a multilayer nonlinear convolution network, along with the deepening of network depth, the identification rate of the pedestrian attribute continuously rises, the eye and the accessory such fine-grained features can be well identified, many intelligent security fields begin to use the technology, and common problems are as follows:
(1) pedestrian attribute discernment belongs to many labels classification problem, and many rare attributes cause the identification rate low because the sample quantity is few, greatly reduced security protection field pedestrian's discernment's accuracy.
(2) The attributes have strong correlation, the traditional sampling method is specific to a single label, and the sampling method of the multi-label classification problem needs to consider the strong correlation between labels.
(3) The traditional method for solving the data imbalance problem, such as cost-sensitive and sampling, cannot improve the balance and the recognition capability of the rare attributes in a concentrated manner.
Pedestrian attribute identification is an important multi-label classification problem, although the convolutional neural network is prominent in learning distinguishing features from images, data imbalance for fine-grained tasks in multi-label setting is still an unsolved problem, and a new sampling algorithm is provided: DATA enhancement IMBALANCE (DAI for short) is used to enhance the recognition capability of the attributes by increasing the proportion of a small number of labels, fundamentally, the number of multi-sample attributes is properly reduced and the proportion of few-sample attributes is increased by simultaneously applying oversampling and undersampling to a multi-label DATA set, and the correlation between the attributes is not damaged by simultaneously predicting the attributes of a certain pedestrian.
Disclosure of Invention
The present invention provides a sampling method for solving the problem of unbalanced multi-label classification data, so as to solve the problems in the background art.
In order to achieve the purpose, the invention provides the following technical scheme: a sampling method for the multi-label classification data imbalance problem comprises the following steps:
s1: a is defined as a label matrix of size m n, m equals the number of training samples, n equals the number of attributes each image has, the PA-100K dataset has 100000 pedestrian images to be trained, each image has 26 classification attributes, A is a binary matrix of size 100000 26, r is a weight vector of length m, which represents the number of samples present in the resulting sub-dataset after DAI is applied to the entire training dataset, 1 in equation (1) represents a vector with all terms equal to 1, and after re-weighting, the proportion of positive samples in each label can be expressed as equation (1), and $ circ represents the Hada mard product:
Figure BDA0002307365500000021
s2: labels with positive samples are more balanced by resampling, as a function of the minimization objective, equation (2):
f(r)=max(0,(pideal-p)3) for r≥0 (2)
{ p } _{ ideal } represents the ideal proportion of positive samples in each tag, $ { p } _{ ideal } $ is set to 0.6 in size, and it makes no sense for r to be limited to greater than or equal to a negative value.
S3: in order to minimize the change of function (2) to an unconstrained version of equation (3), it is minimized by gradient descent:
f(r)=max(0,(pideal-p)3)-λrλ>0 (3)
lambda is the regularization parameter, after r is calculated, it is rescaled by multiplying it by a constant, taking the value as 5 or 10, and then getting its integer, which represents the number of each sample present in the sub-data set.
Preferably, the human attribute identification execution process after the DAI is added is as follows: and (4) generating a subdata set after the original data set passes through a DAI algorithm, transmitting the subdata set to a prediction network, and testing the prediction network.
Preferably, the original data set and the generated sub data sets are alternately trained, the prediction network is a classification network, and the output dimension is the number of labels.
The invention has the technical effects and advantages that:
1. the present invention can convert a multi-label unbalanced data set into a more balanced sub data set. The algorithm well solves the problem of unbalanced data sets in a multi-label scene, particularly in the field of pedestrian attribute identification;
2. the invention integrates oversampling and undersampling by weighting the samples, which is helpful for the model to learn the attributes occupying a small part in the data set. Directly increasing the sample scale of fewer attributes is useful for pedestrian attribute identification compared to the latest technologies that have been released using attention models. In addition to providing convincing predictions of attributes, the DAI may also help people identify whether pedestrians under different cameras are the same for tracking purposes.
Drawings
FIG. 1 is a diagram illustrating the proportional variation of sample properties before and after the DAI algorithm of the present invention.
Fig. 2 is a schematic structural diagram of a process of pedestrian attribute identification after adding a DAI according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention provides a sampling method for the unbalanced problem of multi-label classified data, which is shown in figures 1-2 and comprises the following steps:
s1: a is defined as a label matrix of size m n, m equals the number of training samples, n equals the number of attributes each image has, the PA-100K dataset has 100000 pedestrian images to be trained, each image has 26 classification attributes, A is a binary matrix of size 100000 26, r is a weight vector of length m, which represents the number of samples present in the resulting sub-dataset after DAI is applied to the entire training dataset, 1 in equation (1) represents a vector with all terms equal to 1, and after re-weighting, the proportion of positive samples in each label can be expressed as equation (1), and $ circ represents the Hada mard product:
Figure BDA0002307365500000041
s2: labels with positive samples are more balanced by resampling, as a function of the minimization objective, equation (2):
f(r)=max(0,(pideal-p)3) for r≥0 (2)
{ p } _{ ideal } represents the ideal proportion of positive samples in each tag, $ { p } _{ ideal } $ is set to 0.6 in size, and it makes no sense for r to be limited to greater than or equal to a negative value.
S3: in order to minimize the change of function (2) to an unconstrained version of equation (3), it is minimized by gradient descent:
f(r)=max(0,(pideal-p)3)-λrλ>0 (3)
lambda is the regularization parameter, after r is calculated, it is rescaled by multiplying it by a constant, taking the value as 5 or 10, and then getting its integer, which represents the number of each sample present in the sub-data set.
As shown in fig. 2, the human attribute identification execution process after adding the DAI is as follows: the original data set generates sub data sets after being processed by a DAI algorithm, the sub data sets are transmitted to a prediction network, the prediction network is tested, the original data set and the generated sub data sets are alternately trained, the prediction network is a classification network, and the output dimension is the number of labels.
Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments or portions thereof without departing from the spirit and scope of the invention.

Claims (3)

1. A sampling method for solving the problem of imbalance of multi-label classified data is characterized by comprising the following steps:
s1: a is defined as a label matrix of size m n, m equals the number of training samples, n equals the number of attributes each image has, the PA-100K dataset has 100000 pedestrian images to be trained, each image has 26 classification attributes, A is a binary matrix of size 100000 26, r is a weight vector of length m, which represents the number of samples present in the resulting sub-dataset after DAI is applied to the entire training dataset, 1 in equation (1) represents a vector with all terms equal to 1, and after re-weighting, the proportion of positive samples in each label can be expressed as equation (1), and $ circ represents the Hada mard product:
Figure FDA0002307365490000011
s2: labels with positive samples are more balanced by resampling, as a function of the minimization objective, equation (2):
f(r)max(0,(pideal-p)3)for r≥0 (2)
{pit makes no sense that { ideal } represents the ideal proportion of positive samples in each label, $ { p } _{ ideal } $ is set to 0.6 in size, and r is limited to be greater than or equal to a negative value.
S3: in order to minimize the change of function (2) to an unconstrained version of equation (3), it is minimized by gradient descent:
f(r)=max(0,(pideal-p)3)-λr λ>0 (3)
lambda is the regularization parameter, after r is calculated, it is rescaled by multiplying it by a constant, taking the value as 5 or 10, and then getting its integer, which represents the number of each sample present in the sub-data set.
2. The sampling method for the multi-label classification data imbalance problem according to claim 1, wherein: the execution process of the person attribute identification after the DAI is added is as follows: and (4) generating a subdata set after the original data set passes through a DAI algorithm, transmitting the subdata set to a prediction network, and testing the prediction network.
3. The sampling method for the multi-label classification data imbalance problem according to claim 2, wherein: and alternately training the original data set and the generated subdata sets, wherein the prediction network is a classification network, and the output dimension is the number of labels.
CN201911245293.9A 2019-12-06 2019-12-06 Sampling method for multi-label classified data imbalance problem Pending CN111079811A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911245293.9A CN111079811A (en) 2019-12-06 2019-12-06 Sampling method for multi-label classified data imbalance problem

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911245293.9A CN111079811A (en) 2019-12-06 2019-12-06 Sampling method for multi-label classified data imbalance problem

Publications (1)

Publication Number Publication Date
CN111079811A true CN111079811A (en) 2020-04-28

Family

ID=70313215

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911245293.9A Pending CN111079811A (en) 2019-12-06 2019-12-06 Sampling method for multi-label classified data imbalance problem

Country Status (1)

Country Link
CN (1) CN111079811A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112085080A (en) * 2020-08-31 2020-12-15 北京百度网讯科技有限公司 Sample equalization method, device, equipment and storage medium
WO2022166325A1 (en) * 2021-02-05 2022-08-11 华为技术有限公司 Multi-label class equalization method and device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112085080A (en) * 2020-08-31 2020-12-15 北京百度网讯科技有限公司 Sample equalization method, device, equipment and storage medium
CN112085080B (en) * 2020-08-31 2024-03-08 北京百度网讯科技有限公司 Sample equalization method, device, equipment and storage medium
WO2022166325A1 (en) * 2021-02-05 2022-08-11 华为技术有限公司 Multi-label class equalization method and device

Similar Documents

Publication Publication Date Title
CN111126258B (en) Image recognition method and related device
CN108921051B (en) Pedestrian attribute identification network and technology based on cyclic neural network attention model
CN109255289B (en) Cross-aging face recognition method based on unified generation model
CN108509833B (en) Face recognition method, device and equipment based on structured analysis dictionary
CN111582095B (en) Light-weight rapid detection method for abnormal behaviors of pedestrians
CN111179244B (en) Automatic crack detection method based on cavity convolution
CN107301376B (en) Pedestrian detection method based on deep learning multi-layer stimulation
CN109902662B (en) Pedestrian re-identification method, system, device and storage medium
CN111738054B (en) Behavior anomaly detection method based on space-time self-encoder network and space-time CNN
CN114724222B (en) AI digital human emotion analysis method based on multiple modes
CN115131627B (en) Construction and training method of lightweight plant disease and pest target detection model
Javad Shafiee et al. Embedded motion detection via neural response mixture background modeling
CN109670457A (en) A kind of driver status recognition methods and device
CN111079811A (en) Sampling method for multi-label classified data imbalance problem
Liu et al. Development of face recognition system based on PCA and LBP for intelligent anti-theft doors
CN113033665A (en) Sample expansion method, training method and system, and sample learning system
CN113920472A (en) Unsupervised target re-identification method and system based on attention mechanism
CN111008570B (en) Video understanding method based on compression-excitation pseudo-three-dimensional network
CN111126155A (en) Pedestrian re-identification method for generating confrontation network based on semantic constraint
CN114937298A (en) Micro-expression recognition method based on feature decoupling
CN112085540A (en) Intelligent advertisement pushing system and method based on artificial intelligence technology
CN111721770A (en) Automatic crack detection method based on frequency division convolution
CN116168274A (en) Object detection method and object detection model training method
CN116721458A (en) Cross-modal time sequence contrast learning-based self-supervision action recognition method
CN115661539A (en) Less-sample image identification method embedded with uncertainty information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200428