CN113705439B

CN113705439B - Pedestrian attribute identification method based on weak supervision and metric learning

Info

Publication number: CN113705439B
Application number: CN202110994829.8A
Authority: CN
Inventors: 谢晓华; 彭其阳; 杨凌霄; 赖剑煌; 冯展祥
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2021-08-27
Filing date: 2021-08-27
Publication date: 2023-09-08
Anticipated expiration: 2041-08-27
Also published as: CN113705439A

Abstract

The application discloses a pedestrian attribute identification method based on weak supervision and metric learning, which comprises the following steps: acquiring an original data set; training an attribute interested region positioning network based on the attribute tag information in the original data set to obtain a trained attribute interested region positioning network; taking the parameters of the trained attribute region of interest positioning network as pre-training parameters of the pedestrian attribute identification network, and training the pedestrian attribute identification network based on the original data set to obtain a trained attribute identification network; and inputting the image to be detected, and carrying out attribute identification based on the attribute identification network after training is completed, so as to obtain an attribute identification result. The application has better performance in attribute identification of pedestrians. The application can be widely applied to the field of image attribute identification.

Description

Pedestrian attribute identification method based on weak supervision and metric learning

Technical Field

The application relates to the field of image attribute identification, in particular to a pedestrian attribute identification method based on weak supervision and metric learning.

Background

Pedestrian attribute recognition is an important task for image attribute recognition. In recent years, with rapid development of video monitoring, the effect of video monitoring in the public security field is effectively exerted to become an important research content, and pedestrian attributes are taken as semantic features of pedestrians, generally including age, gender, clothing attributes and the like, so that a connection can be established between the bottom features of pedestrian images and advanced semantics, and further, the video monitoring system can be widely assisted in multiple applications such as pedestrian re-recognition and pedestrian retrieval. The accurate positioning of the attribute interested region becomes one of important factors restricting the attribute recognition performance of pedestrians, the current attribute interested region positioning often divides the image into region blocks roughly according to the spatial position of the attribute, and then inputs the region blocks into the feature extraction network corresponding to the attribute respectively, but the divided region blocks are difficult to accurately position the attribute interested region position, and some methods use the existing spatial transformation network, but additional network structures are required to be added and affine transformation parameters are required to be learned.

Disclosure of Invention

In order to solve the technical problems, the application aims to provide a pedestrian attribute identification method based on weak supervision and metric learning, which has better performance in pedestrian attribute identification.

The technical scheme adopted by the application is as follows: the pedestrian attribute identification method based on weak supervision and metric learning comprises the following steps:

acquiring an original data set;

training an attribute interested region positioning network based on the attribute tag information in the original data set to obtain a trained attribute interested region positioning network;

taking the parameters of the trained attribute region of interest positioning network as pre-training parameters of the pedestrian attribute identification network, and training the pedestrian attribute identification network based on the original data set to obtain a trained attribute identification network;

and inputting the image to be detected, and carrying out attribute identification based on the attribute identification network after training is completed, so as to obtain an attribute identification result.

Further, the attribute region of interest positioning network is of a multi-level network structure, each level comprises a residual error module, an attribute prediction module and a pooling layer, the network structure of the residual error module adopts a residual error structure of a residual error network resnet, and the attribute prediction module comprises a convolution layer and a batch normalization layer.

Further, the step of training the attribute interested area positioning network based on the attribute tag information in the original data set to obtain a trained attribute interested area positioning network specifically includes:

inputting pedestrian images in the original data set into an attribute region of interest positioning network;

carrying out attribute feature extraction and attribute prediction through a residual error module and an attribute prediction module of each layer of the attribute region of interest positioning network, and calculating a maximum response value of a corresponding feature map through a pooling layer;

positioning the spatial position of the region of interest according to the maximum response value of the feature map;

and monitoring based on the attribute tag information in the original data set, and completing training of the attribute region-of-interest positioning network by minimizing the cross entropy loss function to obtain the trained attribute region-of-interest positioning network.

Further, the attribute identification network comprises a feature extractor and a classifier, wherein the network structure of the feature extractor is consistent with that of the attribute region-of-interest positioning network, and the classifier adopts a classification neural network.

Further, the step of training the pedestrian attribute recognition network based on the original data set by taking the parameters of the trained attribute region of interest positioning network as the pre-training parameters of the pedestrian attribute recognition network to obtain the trained attribute recognition network specifically comprises the following steps:

taking the parameters of the attribute interest region positioning network after training as pre-training parameters of a feature extractor in the pedestrian attribute identification network;

inputting the pedestrian image in the original data set to a feature extractor;

feature extraction is carried out by a residual error module based on a feature extractor to obtain feature X _l L represents hierarchy information;

feature X _l The attribute prediction module is input to the feature extractor to obtain a predicted attribute feature A _l ；

The predictive attribute feature A _l The dimension of (1) is n multiplied by w multiplied by h, n is the number of attributes, and w and h are the width and height of the feature map respectively;

will predict attribute feature A _l Input deviceTo a pooling layer of the feature extractor, obtaining the maximum response position of the attribute n in the feature mapAnd->

According to the position of maximum responseFor feature X _l Sampling to obtain characteristic->Representing the feature expression corresponding to the nth attribute of the first layer;

carrying out average pooling operation on the output of the residual error module of the final layer of feature extractor to obtain the pedestrian global feature f _gobal ；

Features obtained by sampling each levelGlobal features f with pedestrians _gobal Splicing to obtain fusion characteristic f ⁿ ；

Will fuse feature f ⁿ Inputting the attribute prediction scores into a classifier to obtain attribute prediction scores;

and constraining the classifier based on the attribute prediction score, the real labels in the original data set and the contrast loss function to obtain the attribute identification network after training.

Further, the maximum response position of the acquired attribute n on the feature mapAnd->The formula of (2) is as follows:

in the above formula, w and h are the width and height of the feature map, x is the abscissa value of the maximum response position, and y is the ordinate value of the maximum response position.

Further, the formula of the contrast loss function is as follows:

in the above formula, E represents an Euclidean distance calculation formula, y _i Label, center, representing the i-th attribute _n And (3) representing attribute center feature expression, wherein margin is a preset interval threshold value, and representing the expected minimum value of Euclidean distance between the negative sample feature and the attribute center feature.

The method and the system have the beneficial effects that: according to the application, accurate attribute interested region positioning can be performed without the aid of a trained pedestrian gesture estimation model and pedestrian gesture label information, and unlike a traditional attribute interested region positioning method, the traditional image-level region block positioning problem is converted into the feature map-level point positioning problem, so that the accuracy of attribute interested region positioning is ensured while introducing additional network parameters is avoided.

Drawings

FIG. 1 is a schematic step diagram of the pedestrian attribute identification method based on weak supervision and metric learning of the present application;

FIG. 2 is a flow chart of a pedestrian attribute identification method based on weak supervision and metric learning of the present application.

Detailed Description

The application will now be described in further detail with reference to the drawings and to specific examples. The step numbers in the following embodiments are set for convenience of illustration only, and the order between the steps is not limited in any way, and the execution order of the steps in the embodiments may be adaptively adjusted according to the understanding of those skilled in the art.

Referring to fig. 1 and 2, the application provides a pedestrian attribute identification method based on weak supervision and metric learning, comprising the following steps:

acquiring an original data set;

Specifically, it consists of two-stage training: the first stage: training an attribute region of interest positioning network through attribute tag supervision information; and a second stage: and initializing a pre-training parameter of the first-stage network as a second-stage network parameter, fusing the extracted pedestrian attribute characteristics with the pedestrian global characteristics, and finally training an attribute identification network under the supervision of a comparison loss and a traditional classification cross entropy loss function to finish final attribute prediction.

Further as a preferred embodiment of the method, the attribute region of interest positioning network is a multi-level network structure, each level includes a residual module, an attribute prediction module and a pooling layer, the network structure of the residual module adopts a residual structure of a residual network resnet, and the attribute prediction module includes a convolution layer and a batch normalization layer.

In particular, in order to achieve the function of locating the region of interest of accurate attribute without introducing additional labels and spatial transformation network structures, the traditional classification neural network is modified, namely, a full-connection layer used for classification is replaced by a convolution layer and a maximum pooling layer, wherein the number of convolution kernels is the number of attributes to be identified, each convolution kernel is responsible for extracting the characteristics of one attribute, and finally, attribute prediction scores are obtained through the maximum pooling operation, and the finally predicted attribute scores are only from the maximum response point in a feature map, so that the response point is considered to be the spatial mapping point of the attribute on the feature map. Since the different levels of characteristics of the neural network are intended to encode different levels of information. The high-level features pay more attention to semantic information, less attention to detail information, and the low-level features pay more attention to detail information, and the attributes to be predicted have semantic information of various levels, for example, gender belongs to the high-level semantic features, and clothing textures belong to the bottom-level semantic information, so that in order to extract attribute features of different levels, the attribute interesting region positioning network is of a multi-level network structure.

Further as a preferred embodiment of the method, the step of training the attribute interested area positioning network based on the attribute tag information in the original dataset to obtain a trained attribute interested area positioning network specifically includes:

Further as a preferred embodiment of the method, the attribute identification network comprises a feature extractor and a classifier, the network structure of the feature extractor is consistent with the network structure of the attribute region of interest location network, and the classifier adopts a classification neural network.

Specifically, the classifier adopts a traditional classification neural network, namely, the classifier consists of a maximum pooling layer, a full connection layer and a batch normalization layer, and the attribute prediction score can be obtained through the classifier.

Further as a preferred embodiment of the method, the step of training the pedestrian attribute recognition network based on the original dataset by using the parameters of the trained attribute interest area positioning network as the pre-training parameters of the pedestrian attribute recognition network to obtain the trained attribute recognition network specifically includes:

inputting the pedestrian image in the original data set to a feature extractor;

The predictive attribute feature A _l The dimension of (1) is n multiplied by w multiplied by h, n is the number of attributes, and w and h are the width and height of the feature map respectively; will predict attribute feature A _l The pooling layer is input to the feature extractor to obtain the maximum response position of the attribute n in the feature mapAnd->

Specifically, the splice formula is as follows:

Specifically, to prevent subsequent training from breaking the existing attribute region of interest localization function, we tune the learning rate of the feature extractor to 0.0001, while the learning rate of the classifier is 0.01. The trained regional positioning network parameters of the attribute interest are used as the pre-training parameters of the second-stage attribute identification network feature extractor, the spatial position of the regional positioning of the attribute interest can be positioned through the maximum response of the attribute prediction module output feature map, and the residual error module output features of each layer are sampled according to the regional positioning information of the attribute interest, so that the features corresponding to each attribute are obtained.

Further as a preferred embodiment of the method, the maximum response position of the acquired attribute n on the feature mapAnd->The formula of (2) is as follows:

Further as a preferred embodiment of the method, the formula of the contrast loss function is as follows:

Finally, under the double constraint of the contrast loss and the traditional cross entropy loss function, the pedestrian attribute identification network provided by the user has higher identification performance.

While the preferred embodiment of the present application has been described in detail, the application is not limited to the embodiment, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the application, and these equivalent modifications and substitutions are intended to be included in the scope of the present application as defined in the appended claims.

Claims

1. The pedestrian attribute identification method based on weak supervision and metric learning is characterized by comprising the following steps of:

acquiring an original data set;

inputting an image to be detected, and carrying out attribute identification based on the attribute identification network after training to obtain an attribute identification result;

the attribute interested area positioning network is of a multi-level network structure, each level comprises a residual error module, an attribute prediction module and a pooling layer, the network structure of the residual error module adopts a residual error structure of a residual error network resnet, and the attribute prediction module comprises a convolution layer and a batch normalization layer;

the step of training the pedestrian attribute recognition network based on the original data set to obtain the trained attribute recognition network specifically comprises the following steps:

inputting the pedestrian image in the original data set to a feature extractor;

will predict attribute feature A _l The pooling layer is input to the feature extractor to obtain the maximum response position of the attribute n in the feature mapAnd->

According to the position of maximum responseFor feature X _l Sampling to obtain characteristic-> Representing the feature expression corresponding to the nth attribute of the first layer;

2. The pedestrian attribute identification method based on weak supervision and metric learning according to claim 1, wherein the step of training the attribute region of interest positioning network based on the attribute tag information in the original dataset to obtain a trained attribute region of interest positioning network specifically comprises the following steps:

3. The pedestrian attribute identification method based on weak supervision and metric learning according to claim 2, wherein the attribute identification network comprises a feature extractor and a classifier, the network structure of the feature extractor is consistent with the network structure of the attribute region-of-interest location network, and the classifier adopts a classification neural network.

4. A pedestrian attribute recognition method based on weak supervision and metric learning as defined in claim 3, wherein the maximum response position of the acquired attribute n on the feature mapAnd->The formula of (2) is as follows:

5. The pedestrian attribute identification method based on weak supervision and metric learning of claim 4, wherein the formula of the contrast loss function is as follows: