CN110705474A

CN110705474A - Pedestrian attribute identification method and device

Info

Publication number: CN110705474A
Application number: CN201910943815.6A
Authority: CN
Inventors: 胡晓林; 唐楚峰
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2019-09-30
Filing date: 2019-09-30
Publication date: 2020-01-17
Anticipated expiration: 2039-09-30
Also published as: CN110705474B

Abstract

The disclosure relates to a pedestrian attribute identification method and device. The method comprises the following steps: performing feature extraction on the sample image according to the convolutional neural network to obtain a plurality of initial feature layers, wherein the sample image is an image in a preset pedestrian attribute sample set, and the image in the preset pedestrian attribute sample set has a plurality of pedestrian attributes; performing layer-by-layer feature fusion on the plurality of initial feature layers from top to bottom to obtain a plurality of composite feature layers; according to the space transformation network and the multiple composite characteristic layers, the attribute positioning identification module for positioning and identifying the attribute of each pedestrian on each composite characteristic layer is determined, so that the accuracy and efficiency of pedestrian attribute identification can be improved.

Description

Pedestrian attribute identification method and device

Technical Field

The present disclosure relates to the field of computer vision technologies, and in particular, to a method and an apparatus for identifying a pedestrian attribute.

Background

Pedestrian Attribute Recognition (Pedestrian Attribute Recognition) refers to the use of a computer to predict and analyze various types of Attribute information related to pedestrians in an image. Common pedestrian attribute identification includes identifying macroscopic attributes of a pedestrian such as gender, skin color, age, posture and the like, and also includes specific character attributes such as a backpack type, a clothes type and color, a trousers type and color, a current action and the like. In recent years, the pedestrian attribute identification technology has attracted much attention, and research work on the pedestrian attribute identification technology has a high value in academic research, industrial application and the like. The existing pedestrian attribute identification algorithm based on deep learning mainly comprises a global type and a local type. The integral algorithm considers the pedestrian attribute identification problem as a multi-label classification problem, usually a Convolutional Neural Network (CNN) is used for extracting required features on the whole input picture, a full connection layer is used for attribute prediction at the top end of the network, and all attributes share the features. The local type algorithm focuses more on some local regions in the input image which are important for classification prediction, a human body posture estimation model is trained in advance to predict human body key points of the input image, and then local regions such as the head, the upper body and the lower body of a human body are roughly positioned according to the key points, so that attribute classification is guided.

Disclosure of Invention

In view of this, the present disclosure provides a method and an apparatus for identifying a pedestrian attribute, so as to effectively improve the accuracy and efficiency of identifying the pedestrian attribute.

According to a first aspect of the present disclosure, there is provided a pedestrian attribute identification method, including: performing feature extraction on a sample image according to a convolutional neural network to obtain a plurality of initial feature layers, wherein the sample image is an image in a preset pedestrian attribute sample set, and the image in the preset pedestrian attribute sample set has a plurality of pedestrian attributes; performing layer-by-layer feature fusion on the plurality of initial feature layers from top to bottom to obtain a plurality of composite feature layers; and determining an attribute positioning and identifying module for positioning and identifying the attribute of each pedestrian on each composite characteristic layer according to the space transformation network and the plurality of composite characteristic layers.

In a possible implementation manner, performing layer-by-layer feature fusion on the plurality of initial feature layers from top to bottom to obtain a plurality of composite feature layers includes: directly determining the highest-level initial feature layer as a corresponding composite feature layer aiming at the highest-level initial feature layer; and performing feature fusion on the non-highest level initial feature layer and a composite feature layer corresponding to a previous level initial feature layer aiming at the non-highest level initial feature layer to obtain the composite feature layer corresponding to the non-highest level initial feature layer.

In one possible implementation, the non-highest level initial feature level is φ_iSaid initial characteristic layer phi_iUpper part ofOne level of initial feature level is phi_i+1Said initial characteristic layer phi_i+1Corresponding composite feature layer is X_i+1(ii) a Aiming at the initial feature layer of the non-highest level, performing feature fusion on the initial feature layer of the non-highest level and the composite feature layer corresponding to the initial feature layer of the previous level to obtain the composite feature layer corresponding to the initial feature layer of the non-highest level, wherein the method comprises the following steps: combining the composite feature layer X_i+1Up-sampling and amplifying to the initial characteristic layer phi_iAre the same, and the up-sampled composite feature layer X is obtained_i+1(ii) a The up-sampled composite characteristic layer X is processed_i+1With respect to the initial feature layer phi_iSplicing according to the number of channels to obtain the initial characteristic layer phi_iCorresponding composite feature layer X_iWherein the composite feature layer X_iIs the initial feature layer phi_i+1And the initial feature layer phi_iThe sum of the number of channels.

In a possible implementation manner, determining, according to the spatial transformation network and the plurality of composite feature layers, an attribute location identification module for performing location identification on each pedestrian attribute on each composite feature layer, includes: for any pedestrian attribute, determining a positioning identification result of the pedestrian attribute on each composite characteristic layer according to an attribute positioning identification module which carries out positioning identification on the pedestrian attribute on each composite characteristic layer; aiming at any pedestrian attribute, determining a global recognition result of the pedestrian attribute according to the highest level initial feature layer; according to the positioning recognition result of each pedestrian attribute on each composite feature layer, the global recognition result of each pedestrian attribute and the real attribute mark of each pedestrian attribute in the sample image, training an attribute positioning recognition module for positioning and recognizing each pedestrian attribute on each composite feature layer to obtain an attribute positioning recognition module for positioning and recognizing each pedestrian attribute on each composite feature layer.

In one possible implementation, for any pedestrian attribute, the pedestrian is attributed according to the attribute of each composite feature layerThe attribute positioning and identifying module for performing positioning and identifying determines a positioning and identifying result of the pedestrian attribute on each composite feature layer, and the attribute positioning and identifying module comprises: for any composite characteristic layer, an attribute positioning and identifying module for positioning and identifying the pedestrian attribute on the composite characteristic layer determines a positioning and identifying result of the pedestrian attribute on the composite characteristic layer through the following steps: passing the composite characteristic layer through a first full-connection layer to obtain a transformation parameter s_x、s_y、t_xAnd t_yWherein the parameter s is transformed_xFor scaling the transformation parameters in the horizontal direction, transformation parameters s_yFor scaling the transformation parameter in the vertical direction, transformation parameter t_xFor translation transformation parameters in the horizontal direction, transformation parameter t_yTranslation transformation parameters in the vertical direction; according to said transformation parameter s_x、s_y、t_xAnd t_yDetermining local features corresponding to the pedestrian attributes in the composite feature layer; and enabling the local features corresponding to the pedestrian attributes to pass through a second full-connection layer, and obtaining a positioning identification result of the pedestrian attributes on the composite feature layer.

In one possible implementation, the method further includes: performing the following feature calibration process on the composite feature layer before passing the composite feature layer through a first fully connected layer: sequentially passing the composite feature layer through a global average pooling layer, a 1 × 1 convolution layer, a ReLU active layer, a 1 × 1 convolution layer and a Sigmoid active layer to obtain a first calibration vector; multiplying the composite feature layer and the first calibration vector one by one according to channels to obtain a second calibration vector; and adding the composite characteristic layer and the second calibration vector element by element to obtain a calibrated composite characteristic layer.

In a possible implementation, the transformation parameter s is determined according to_x、s_y、t_xAnd t_yDetermining local features corresponding to the pedestrian attributes in the composite feature layer, including: according to said transformation parameter s_x、s_y、t_xAnd t_yDetermining a rectangular bounding box in the composite feature layer; in the compoundingAnd extracting the features in the rectangular bounding box from the feature layer, and determining the features as local features corresponding to the pedestrian attributes.

In a possible implementation manner, according to the positioning recognition result of each pedestrian attribute on each composite feature layer, the global recognition result of each pedestrian attribute, and the real attribute label of each pedestrian attribute in the sample image, training an attribute positioning recognition module that performs positioning recognition on each pedestrian attribute on each composite feature layer to obtain an attribute positioning recognition module that performs positioning recognition on each pedestrian attribute on each composite feature layer, includes: and training an attribute positioning and identifying module for positioning and identifying each pedestrian attribute on each composite characteristic layer through the following cross entropy loss functions:

wherein the content of the first and second substances,

the training loss of the ith feature layer is M is the number of the multiple pedestrian attributes, y^mLabeling the real attribute of the mth pedestrian attribute,

as a result of identifying the m-th pedestrian attribute on the i-th feature layer, γ^mAnd the weight of the mth pedestrian attribute is shown, sigma is a preset parameter, and the ith characteristic layer is a composite characteristic layer or an initial characteristic layer.

In one possible implementation, the method further includes: for any pedestrian attribute, determining a positioning identification result of the pedestrian attribute in the test image on each composite feature layer according to an attribute positioning identification module which positions and identifies the pedestrian attribute on each composite feature layer; determining a global identification result of the pedestrian attribute in the test image according to the highest level initial feature layer; and determining the attribute recognition result of the pedestrian attribute in the test image according to the positioning recognition result of the pedestrian attribute in the test image on each composite feature layer and the global recognition result of the pedestrian attribute in the test image.

According to a second aspect of the present disclosure, there is provided a pedestrian attribute identification device including: the characteristic extraction module is used for extracting characteristics of a sample image according to a convolutional neural network to obtain a plurality of initial characteristic layers, wherein the sample image is an image in a preset pedestrian attribute sample set, and the image in the preset pedestrian attribute sample set has a plurality of pedestrian attributes; the characteristic fusion module is used for carrying out layer-by-layer characteristic fusion on the plurality of initial characteristic layers from top to bottom to obtain a plurality of composite characteristic layers; and the attribute positioning and identifying module is used for determining an attribute positioning and identifying module for positioning and identifying the attribute of each pedestrian on each composite characteristic layer according to the space transformation network and the plurality of composite characteristic layers.

Performing feature extraction on the sample image according to the convolutional neural network to obtain a plurality of initial feature layers, wherein the sample image is an image in a preset pedestrian attribute sample set, and the image in the preset pedestrian attribute sample set has a plurality of pedestrian attributes; performing layer-by-layer feature fusion on the plurality of initial feature layers from top to bottom to obtain a plurality of composite feature layers; and determining an attribute positioning and identifying module for positioning and identifying the attribute of each pedestrian on each composite characteristic layer according to the space transformation network and the plurality of composite characteristic layers. The method and the device can locate the corresponding local area of each pedestrian attribute in the image, and then identify the pedestrian attribute based on the local features, thereby improving the accuracy and efficiency of pedestrian attribute identification.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.

Fig. 1 shows a flow diagram of a pedestrian attribute identification method according to an embodiment of the disclosure;

FIG. 2 illustrates a schematic diagram of a pedestrian attribute identification system of an embodiment of the present disclosure;

FIG. 3 shows a schematic diagram of an attribute location module of an embodiment of the present disclosure;

FIG. 4 is a schematic diagram illustrating a pedestrian attribute zone location result according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a pedestrian attribute identification device according to an embodiment of the present disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

Fig. 1 shows a flowchart of a pedestrian attribute identification method according to an embodiment of the present disclosure. As shown in fig. 1, the method may include:

and step S11, extracting the characteristics of the sample image according to the convolutional neural network to obtain a plurality of initial characteristic layers, wherein the sample image is an image in a preset pedestrian attribute sample set, and the preset pedestrian attribute sample set comprises a plurality of pedestrian attributes.

And step S12, performing layer-by-layer feature fusion on the plurality of initial feature layers from top to bottom to obtain a plurality of composite feature layers.

And step S13, determining an attribute positioning and identifying module for positioning and identifying the attribute of each pedestrian on each composite characteristic layer according to the space transformation network and the plurality of composite characteristic layers.

The method and the device can position the corresponding local area of each pedestrian attribute in the image, and then identify the pedestrian attribute based on the local features, thereby not only improving the accuracy and efficiency of pedestrian attribute identification, but also positioning the semantic area of each pedestrian attribute in the image in a display manner, and improving the interpretability of a pedestrian attribute identification algorithm.

Fig. 2 shows a schematic diagram of a pedestrian attribute identification system of an embodiment of the present disclosure. As shown in fig. 2, the preprocessed sample images in the preset sample set of pedestrian attributes are input into the pedestrian attribute identification system, where the sample image contains only one pedestrian and has a resolution of a preset value (e.g., a resolution of 256 × 128). The pedestrian property identification system may output identification results for a plurality of pedestrian properties, wherein the number of the plurality of pedestrian properties may be set by a preset pedestrian property sample set. For example, a common set of pedestrian attribute samples includes 51 pedestrian attributes. For any attribute of the pedestrian, the identification result is whether the attribute exists in the original image. The preset pedestrian attribute sample set comprises real attribute labels of each pedestrian attribute in the sample images, namely, the real identification result of each pedestrian attribute is known in advance aiming at the sample images.

The identification process of the pedestrian attribute identification system shown in fig. 2 will be described in detail below.

And performing feature extraction on the input sample image according to the convolutional neural network to obtain a plurality of initial feature layers. Also taking the above fig. 2 as an example, as shown in fig. 2, feature extraction is performed at three different positions of the basic backbone network to obtain an initial feature layer phi with a resolution of 32 × 16₁Initial feature layer phi with resolution of 16 x 8₂And an initial feature layer phi with a resolution of 8 x 4₃。

In a deep convolutional neural network, features at a high layer generally have stronger semantic information, but at the same time, the resolution is lower, so that the detail information is lacked; the features of the lower layers typically contain rich detail and resolution high enough, but low level of abstraction, semantic information is relatively weak. Therefore, feature fusion from top to bottom is performed on multiple initial feature layers.

In one possible implementation, performing layer-by-layer feature fusion on a plurality of initial feature layers from top to bottom to obtain a plurality of composite feature layers includes: directly determining the highest-level initial feature layer as a corresponding composite feature layer aiming at the highest-level initial feature layer; and performing feature fusion on the non-highest level initial feature layer and the composite feature layer corresponding to the previous level initial feature layer to obtain the composite feature layer corresponding to the non-highest level initial feature layer.

In one possible implementation, the non-highest level initial feature level is φ_iInitial feature layer phi_iThe last level of initial feature layer is phi_i+1Initial feature layer phi_i+1Corresponding composite feature layer is X_i+1(ii) a Aiming at the initial feature layer of the non-highest level, performing feature fusion on the initial feature layer of the non-highest level and the composite feature layer corresponding to the initial feature layer of the previous level to obtain the composite feature layer corresponding to the initial feature layer of the non-highest level, wherein the method comprises the following steps: combining the feature layers X_i+1Up-sampling and amplifying to the initial characteristic layer phi_iAre the same, and the up-sampled composite feature layer X is obtained_i+1(ii) a The up-sampled composite characteristic layer X_i+1Phi from the initial feature layer_iSplicing according to the number of channels to obtain an initial characteristic layer phi_iCorresponding composite feature layer X_iWherein the characteristic layer X is compounded_iThe number of channels is the initial feature layer phi_i+1And the initial feature layer phi_iThe sum of the number of channels.

Also as an example in FIG. 2 above, due to the initial feature layer φ₃Is the initial feature layer of the highest level, and the initial feature layer phi is integrated with other feature layers without being fused₃Directly determining as corresponding composite feature layer X₃(ii) a For the initial feature layer phi₂First, compound feature layer X₃(initial feature layer phi)₃) Up-sampling and amplifying to the initial characteristic layer phi₂Resolution ofThe ratio (16 x 8) is the same, and the initial feature layer phi is further adjusted₂With the up-sampled composite feature layer X₃Splicing according to the number of channels to obtain an initial characteristic layer phi₂Corresponding composite feature layer X₂(ii) a For the initial feature layer phi₁First, compound feature layer X₂Up-sampling and amplifying to the initial characteristic layer phi₁Is equal to (32 x 16), thereby forming an initial feature layer phi₁With the up-sampled composite feature layer X₂Splicing according to the number of channels to obtain an initial characteristic layer phi₁Corresponding composite feature layer X₁. Assuming an initial feature layer phi₁Initial characteristic layer phi₂And initial feature layer phi₃If the number of channels is 256, the composite feature layer X₁Has a channel number of 768 and a composite characteristic layer X₂Has a channel number of 512, a composite feature layer X₃The number of channels of (2) is 256.

In one possible implementation manner, an attribute positioning and identifying module for determining, according to a Spatial Transformation Network (STN) and a plurality of composite feature layers, on each composite feature layer, a positioning and identifying attribute of each pedestrian is determined, and includes: aiming at any pedestrian attribute, determining a positioning identification result of the pedestrian attribute on each composite characteristic layer according to an attribute positioning identification module which carries out positioning identification on the pedestrian attribute on each composite characteristic layer; aiming at any pedestrian attribute, determining a global recognition result of the pedestrian attribute according to the highest level initial feature layer; and training an attribute positioning and identifying module for positioning and identifying the attributes of the pedestrians on each composite characteristic layer according to the positioning and identifying result of the attributes of the pedestrians on each composite characteristic layer, the overall identifying result of the attributes of the pedestrians and the real attribute label of the attributes of the pedestrians in the sample image to obtain the attribute positioning and identifying module for positioning and identifying the attributes of the pedestrians on each composite characteristic layer.

In a possible implementation manner, for any pedestrian attribute, determining a positioning recognition result of the pedestrian attribute on each composite feature layer according to an attribute positioning recognition module that performs positioning recognition on the pedestrian attribute on each composite feature layer includes: for any composite feature layer, the compositeThe attribute positioning and identifying module for positioning and identifying the pedestrian attribute on the composite characteristic layer determines the positioning and identifying result of the pedestrian attribute on the composite characteristic layer through the following steps: passing the composite characteristic layer through a first full-connection layer to obtain a transformation parameter s_x、s_y、t_xAnd t_yWherein the parameter s is transformed_xFor scaling the transformation parameters in the horizontal direction, transformation parameters s_yFor scaling the transformation parameter in the vertical direction, transformation parameter t_xFor translation transformation parameters in the horizontal direction, transformation parameter t_yTranslation transformation parameters in the vertical direction; according to transformation parameters s_x、s_y、t_xAnd t_yDetermining local features corresponding to the pedestrian attributes in the composite feature layer; and passing the local features corresponding to the pedestrian attributes through a second full-connection layer to obtain a positioning identification result of the pedestrian attributes on the composite feature layer.

In one possible implementation, the method further includes: before passing the composite feature layer through the first fully connected layer, performing the following feature calibration process on the composite feature layer: sequentially passing the composite characteristic layer through a global average pooling layer, a 1 × 1 convolution layer, a ReLU active layer, a 1 × 1 convolution layer and a Sigmoid active layer to obtain a first calibration vector; multiplying the composite feature layer and the first calibration vector one by one according to the channel to obtain a second calibration vector; and adding the composite characteristic layer and the second calibration vector element by element to obtain a calibrated composite characteristic layer.

In one possible implementation, the transformation parameter s is determined based on_x、s_y、t_xAnd t_yAnd determining local features corresponding to the attributes of the pedestrians in the composite feature layer, wherein the local features comprise: according to transformation parameters s_x、s_y、t_xAnd t_yDetermining a rectangular bounding box in the composite feature layer; and extracting the features in the rectangular bounding box from the composite feature layer, and determining the features as local features corresponding to the attributes of the pedestrians.

Still taking the above fig. 2 as an example, as shown in fig. 2, each composite feature layer corresponds to M attribute location identification modules (ALMs), where M is the number of multiple pedestrian attributes. The pedestrian attribute identification system shown in fig. 2 includes a total of 3M attribute locator modules, each of which acts on only one composite feature layer and one pedestrian attribute.

The specific positioning identification process of any attribute positioning module in any composite feature layer is described in detail below.

FIG. 3 shows a schematic diagram of an attribute location module of an embodiment of the present disclosure. As shown in FIG. 3, the input composite feature layer X_iFirstly, reducing the resolution to 1 × 1 through a global average pooling layer, then sequentially performing feature processing on the 1 × 1 convolution layer, the ReLU activation layer and the 1 × 1 convolution layer, and then obtaining a first calibration vector through a Sigmoid activation layer (a commonly used Sigmoid activation function); composite feature layer X input thereby_iMultiplying the first calibration vector by the first calibration vector one by one according to the channel to obtain a second calibration vector; last entered composite feature layer X_iAnd adding the second calibration vector element by element to obtain a calibrated composite feature layer.

Calibrated composite feature layer X_iFirstly, four transformation parameters s are obtained through a first full connection layer (FC)_x、s_y、t_xAnd t_yWherein the parameter s is transformed_xFor scaling the transformation parameters in the horizontal direction, transformation parameters s_yFor scaling the transformation parameter in the vertical direction, transformation parameter t_xFor translation transformation parameters in the horizontal direction, transformation parameter t_yTranslation transformation parameters in the vertical direction; according to four transformation parameters s_x、s_y、t_xAnd t_yComposite feature layer X that may be after calibration_iDetermining a rectangular bounding box R; finally, the calibrated composite feature layer X is obtained according to the rectangular bounding box R_iThe feature in the rectangular bounding box R is extracted and determined as the local feature corresponding to the attribute of the pedestrian, and then the local feature passes through a second full connection layer (FC) to obtain the positioning and identifying result of the attribute positioning and identifying module.

The rectangular bounding box R can be used for positioning the semantic region of the pedestrian attribute in the image in a display mode, and the interpretability of the pedestrian attribute recognition algorithm is improved. The attribute positioning and identifying module can position the corresponding local area of the pedestrian attribute in the image, and then identify the pedestrian attribute based on the local feature, and compared with global identification, the calculation amount can be reduced, so that the accuracy and efficiency of pedestrian attribute identification are improved.

Fig. 4 is a schematic diagram illustrating a positioning result of a pedestrian attribute region according to an embodiment of the disclosure. As shown in fig. 4, each composite signature layer can be displayed with the region-locating results of the pedestrian attribute (plastic bag).

Also taking the above FIG. 2 as an example, as shown in FIG. 2, based on the composite feature layer X₁The positioning identification results of M pedestrian attributes can be obtained based on the composite feature layer X₂The positioning identification results of M pedestrian attributes can be obtained based on the composite feature layer X₃And obtaining the positioning identification results of M pedestrian attributes. Based on the initial feature layer phi₁Initial characteristic layer phi₂And initial feature layer phi₃And obtaining global identification results of M pedestrian attributes.

In the training process, in order to train the attribute positioning modules on different feature layers more fully, the recognition results of all the attribute positioning modules are used to participate in the training together.

In a possible implementation manner, according to the positioning recognition result of each pedestrian attribute on each composite feature layer, the global recognition result of each pedestrian attribute, and the real attribute label of each pedestrian attribute in the sample image, the attribute positioning recognition module for positioning and recognizing each pedestrian attribute on each composite feature layer is trained to obtain the attribute positioning recognition module for positioning and recognizing each pedestrian attribute on each composite feature layer, which includes: and training an attribute positioning and identifying module for positioning and identifying the attribute of each pedestrian on each composite characteristic layer through the following cross entropy loss function:

wherein the content of the first and second substances,

is the ithTraining loss of the feature layer, M is the number of the plurality of pedestrian attributes, y^mLabeling the real attribute of the mth pedestrian attribute,

Also as an example in FIG. 2, above, based on the composite feature layer X₁Obtaining the positioning identification result of M pedestrian attributes and the real attribute labels of the M pedestrian attributes, and obtaining a composite characteristic layer X based on the cross entropy function₁Training loss L₁(ii) a Based on composite feature layer X₂Obtaining the positioning identification result of M pedestrian attributes and the real attribute labels of the M pedestrian attributes, and obtaining a composite characteristic layer X based on the cross entropy function₂Training loss L₂(ii) a Based on composite feature layer X₃Obtaining the positioning identification result of M pedestrian attributes and the real attribute labels of the M pedestrian attributes, and obtaining a composite characteristic layer X based on the cross entropy function₃Training loss L₃(ii) a Based on the initial feature layer phi₁Initial characteristic layer phi₂And initial feature layer phi₃Obtaining the global identification result of the M pedestrian attributes and the real attribute labels of the M pedestrian attributes, and obtaining the training loss L based on the cross entropy function₄. Finally, the loss L of the attribute positioning recognition system is the sum of training losses of a plurality of feature layers, namely L ═ L₁+L₂+L₃+L₄。

In one possible implementation, the method further includes: aiming at any pedestrian attribute, determining a positioning identification result of each composite characteristic layer on the pedestrian attribute in the test image according to an attribute positioning identification module which is used for positioning and identifying the pedestrian attribute on each composite characteristic layer; determining a global identification result of the pedestrian attribute in the test image according to the highest level initial feature layer; and determining the attribute recognition result of the pedestrian attribute in the test image according to the positioning recognition result of the pedestrian attribute in the test image on each composite characteristic layer and the global recognition result of the pedestrian attribute in the test image.

And aiming at the test image, performing feature extraction on the test image according to the convolutional neural network, wherein the extraction position is the same as the feature extraction position of the sample image, and obtaining a plurality of initial feature layers. And performing layer-by-layer feature fusion on the plurality of initial feature layers from top to bottom to obtain a plurality of composite feature layers.

For any composite feature layer, attribute positioning identification can be carried out on the basis of an attribute positioning identification module corresponding to each pedestrian attribute on the composite feature layer, and a positioning identification result of each pedestrian attribute in the test image on the composite feature is obtained. Based on the highest-level initial feature layer, the global identification result of each pedestrian attribute in the test image can be determined, and then the attribute identification result of each pedestrian attribute in the test image is determined by taking the maximum value element by element according to the positioning identification result and the global identification result of each pedestrian attribute in the test image.

For example, for pedestrian property A, composite feature layer X₁The obtained positioning identification result is 60 percent (namely the probability of existence of the pedestrian attribute A in the original image is 60 percent), and the composite characteristic layer X₂The obtained positioning identification result is 65%, and the characteristic layer X is compounded₃The obtained positioning recognition result was 55%, and the initial feature layer phi was obtained₃If the obtained global positioning result is 48%, the maximum value of the final attribute identification result of the pedestrian attribute is 65%, and if the attribute identification result is greater than 50%, the original image contains the attribute, and the original image contains the pedestrian attribute a according to the final attribute identification result.

The identification index pair ratio of the pedestrian attribute identification method provided by the disclosure and other four existing pedestrian attribute identification methods (deep mar, GRL, VeSPA, PGDM) is shown in table 1:

TABLE 1

The average accuracy and the F1 value are common indexes for evaluating a pedestrian attribute identification algorithm, and the higher the value is, the higher the precision is; the model size and recognition speed reflect the efficiency of the algorithm, the lower the better. Therefore, compared with the existing pedestrian attribute identification algorithm, the pedestrian attribute identification algorithm can improve the accuracy and efficiency.

Fig. 5 is a schematic structural diagram of a pedestrian attribute identification device according to an embodiment of the present disclosure. As shown in fig. 5, the apparatus 50 includes:

the feature extraction module 51 is configured to perform feature extraction on the sample image according to the convolutional neural network to obtain a plurality of initial feature layers, where the sample image is an image in a preset pedestrian attribute sample set, and the image included in the preset pedestrian attribute sample set has a plurality of pedestrian attributes;

the feature fusion module 52 is configured to perform feature fusion layer by layer from top to bottom on the plurality of initial feature layers to obtain a plurality of composite feature layers;

and the attribute positioning and identifying module 53 is configured to determine, according to the spatial transform network and the multiple composite feature layers, an attribute positioning and identifying module that performs positioning and identifying on each pedestrian attribute on each composite feature layer.

In one possible implementation, the feature fusion module 52 includes:

the first determining submodule is used for directly determining the highest-level initial characteristic layer as a corresponding composite characteristic layer aiming at the highest-level initial characteristic layer;

and the feature fusion submodule is used for performing feature fusion on the non-highest-level initial feature layer and the composite feature layer corresponding to the previous-level initial feature layer aiming at the non-highest-level initial feature layer to obtain the composite feature layer corresponding to the non-highest-level initial feature layer.

In one possible implementation, the non-highest level initial feature level is φ_iInitial feature layer phi_iThe last level of initial feature layer is phi_i+1Initial feature layer phi_i+1Corresponding composite feature layer is X_i+1；

The feature fusion submodule is specifically configured to:

combining the feature layers X_i+1Up-sampling and amplifying to the initial characteristic layer phi_iAre the same, and the up-sampled composite feature layer X is obtained_i+1；

The up-sampled composite characteristic layer X_i+1Phi from the initial feature layer_iSplicing according to the number of channels to obtain an initial characteristic layer phi_iCorresponding composite feature layer X_iWherein the characteristic layer X is compounded_iThe number of channels is the initial feature layer phi_i+1And initial feature layer phi_iThe sum of the number of channels.

In one possible implementation, the attribute positioning and identifying module 53 includes:

the second determining submodule is used for determining a positioning and identifying result of the pedestrian attribute on each composite characteristic layer according to the attribute positioning and identifying module which positions and identifies the pedestrian attribute on each composite characteristic layer aiming at any pedestrian attribute;

the third determining submodule is used for determining a global recognition result of the pedestrian attribute according to the highest level initial feature layer aiming at any pedestrian attribute;

and the fourth determining submodule is used for training the attribute positioning and identifying module for positioning and identifying the attributes of each pedestrian on each composite characteristic layer according to the positioning and identifying result of each pedestrian attribute on each composite characteristic layer, the global identifying result of each pedestrian attribute and the real attribute label of each pedestrian attribute in the sample image, so as to obtain the attribute positioning and identifying module for positioning and identifying the attributes of each pedestrian on each composite characteristic layer.

In a possible implementation manner, the second determining submodule is specifically configured to:

for any composite characteristic layer, an attribute positioning and identifying module for positioning and identifying the pedestrian attribute on the composite characteristic layer determines the positioning and identifying result of the pedestrian attribute on the composite characteristic layer through the following steps:

passing the composite characteristic layer through a first full-connection layer to obtain a transformation parameter s_x、s_y、t_xAnd t_yWherein the parameter s is transformed_xFor scaling the transformation parameters in the horizontal direction, transformation parameters s_yFor scaling the transformation parameter in the vertical direction, transformation parameter t_xFor translation transformation parameters in the horizontal direction, transformation parameter t_yTranslation transformation parameters in the vertical direction;

according to transformation parameters s_x、s_y、t_xAnd t_yDetermining local features corresponding to the pedestrian attributes in the composite feature layer;

and passing the local features corresponding to the pedestrian attributes through a second full-connection layer to obtain a positioning identification result of the pedestrian attributes on the composite feature layer.

In one possible implementation, the apparatus 50 further includes: a feature calibration submodule;

the feature calibration submodule is specifically configured to: before passing the composite feature layer through the first fully connected layer, performing the following feature calibration process on the composite feature layer:

sequentially passing the composite characteristic layer through a global average pooling layer, a 1 × 1 convolution layer, a ReLU active layer, a 1 × 1 convolution layer and a Sigmoid active layer to obtain a first calibration vector;

multiplying the composite feature layer and the first calibration vector one by one according to the channel to obtain a second calibration vector;

and adding the composite characteristic layer and the second calibration vector element by element to obtain a calibrated composite characteristic layer.

according to transformation parameters s_x、s_y、t_xAnd t_yDetermining a rectangular bounding box in the composite feature layer;

and extracting the features in the rectangular bounding box from the composite feature layer, and determining the features as local features corresponding to the attributes of the pedestrians.

In a possible implementation manner, the fourth determining submodule is specifically configured to:

and training an attribute positioning and identifying module for positioning and identifying the attribute of each pedestrian on each composite characteristic layer through the following cross entropy loss function:

wherein the content of the first and second substances,

for the training loss of the ith feature layer, M is the number of multiple pedestrian attributes, y^mLabeling the real attribute of the mth pedestrian attribute,

In one possible implementation, the apparatus 50 further includes: the test module is specifically configured to:

aiming at any pedestrian attribute, determining a positioning identification result of each composite characteristic layer on the pedestrian attribute in the test image according to an attribute positioning identification module which is used for positioning and identifying the pedestrian attribute on each composite characteristic layer;

determining a global identification result of the pedestrian attribute in the test image according to the highest level initial feature layer;

and determining the attribute recognition result of the pedestrian attribute in the test image according to the positioning recognition result of the pedestrian attribute in the test image on each composite characteristic layer and the global recognition result of the pedestrian attribute in the test image.

The apparatus 50 provided in the present disclosure can implement each step in the method embodiments shown in fig. 1 to fig. 3, and implement the same technical effect, and for avoiding repetition, details are not described here again.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terms used herein were chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the techniques in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A pedestrian attribute identification method is characterized by comprising the following steps:

performing feature extraction on a sample image according to a convolutional neural network to obtain a plurality of initial feature layers, wherein the sample image is an image in a preset pedestrian attribute sample set, and the image in the preset pedestrian attribute sample set has a plurality of pedestrian attributes;

performing layer-by-layer feature fusion on the plurality of initial feature layers from top to bottom to obtain a plurality of composite feature layers;

and determining an attribute positioning and identifying module for positioning and identifying the attribute of each pedestrian on each composite characteristic layer according to the space transformation network and the plurality of composite characteristic layers.

2. The method of claim 1, wherein performing top-down layer-by-layer feature fusion on the plurality of initial feature layers to obtain a plurality of composite feature layers comprises:

directly determining the highest-level initial feature layer as a corresponding composite feature layer aiming at the highest-level initial feature layer;

and performing feature fusion on the non-highest level initial feature layer and a composite feature layer corresponding to a previous level initial feature layer aiming at the non-highest level initial feature layer to obtain the composite feature layer corresponding to the non-highest level initial feature layer.

3. The method of claim 2, wherein the non-highest level initial feature level is φ_iSaid initial characteristic layer phi_iThe last level of initial feature layer is phi_i+1Said initial characteristic layer phi_i+1Corresponding composite feature layer is X_i+1；

Aiming at a non-highest level initial feature layer, performing feature fusion on the non-highest level initial feature layer and a composite feature layer corresponding to a previous level initial feature layer to obtain the composite feature layer corresponding to the non-highest level initial feature layer, wherein the method comprises the following steps:

combining the composite feature layer X_i+1Up-sampling and amplifying to the initial characteristic layer phi_iAre the same, and the up-sampled composite feature layer X is obtained_i+1；

The up-sampled composite characteristic layer X is processed_i+1With respect to the initial feature layer phi_iSplicing according to the number of channels to obtain the initial characteristic layer phi_iCorresponding composite feature layer X_iWherein the composite feature layer X_iIs the initial feature layer phi_i+1And the initial feature layer phi_iThe sum of the number of channels.

4. The method of claim 1, wherein determining an attribute location identification module for location identification of each pedestrian attribute on each composite feature layer according to the spatial transformation network and the plurality of composite feature layers comprises:

for any pedestrian attribute, determining a positioning identification result of the pedestrian attribute on each composite characteristic layer according to an attribute positioning identification module which carries out positioning identification on the pedestrian attribute on each composite characteristic layer;

aiming at any pedestrian attribute, determining a global recognition result of the pedestrian attribute according to the highest level initial feature layer;

according to the positioning recognition result of each pedestrian attribute on each composite feature layer, the global recognition result of each pedestrian attribute and the real attribute mark of each pedestrian attribute in the sample image, training an attribute positioning recognition module for positioning and recognizing each pedestrian attribute on each composite feature layer to obtain an attribute positioning recognition module for positioning and recognizing each pedestrian attribute on each composite feature layer.

5. The method according to claim 4, wherein for any pedestrian attribute, determining a positioning identification result of the pedestrian attribute on each composite feature layer according to an attribute positioning identification module for positioning identification of the pedestrian attribute on each composite feature layer comprises:

for any composite characteristic layer, an attribute positioning and identifying module for positioning and identifying the pedestrian attribute on the composite characteristic layer determines a positioning and identifying result of the pedestrian attribute on the composite characteristic layer through the following steps:

according to said transformation parameter s_x、s_y、t_xAnd t_yDetermining local features corresponding to the pedestrian attributes in the composite feature layer;

and enabling the local features corresponding to the pedestrian attributes to pass through a second full-connection layer, and obtaining a positioning identification result of the pedestrian attributes on the composite feature layer.

6. The method of claim 5, further comprising:

performing the following feature calibration process on the composite feature layer before passing the composite feature layer through a first fully connected layer:

sequentially passing the composite feature layer through a global average pooling layer, a 1 × 1 convolution layer, a ReLU active layer, a 1 × 1 convolution layer and a Sigmoid active layer to obtain a first calibration vector;

multiplying the composite feature layer and the first calibration vector one by one according to channels to obtain a second calibration vector;

7. Method according to claim 5, characterized in that, according to said transformation parameter s_x、s_y、t_xAnd t_yDetermining local features corresponding to the pedestrian attributes in the composite feature layer, including:

according to said transformation parameter s_x、s_y、t_xAnd t_yDetermining a rectangular bounding box in the composite feature layer;

and extracting the features in the rectangular bounding box from the composite feature layer, and determining the features as local features corresponding to the pedestrian attributes.

8. The method according to claim 4, wherein an attribute locating and identifying module for locating and identifying each pedestrian attribute on each composite feature layer is trained according to the locating and identifying result of each pedestrian attribute on each composite feature layer, the global identifying result of each pedestrian attribute, and the real attribute label of each pedestrian attribute in the sample image, so as to obtain an attribute locating and identifying module for locating and identifying each pedestrian attribute on each composite feature layer, and the method comprises the following steps:

and training an attribute positioning and identifying module for positioning and identifying each pedestrian attribute on each composite characteristic layer through the following cross entropy loss functions:

wherein the content of the first and second substances,

9. The method of claim 1, further comprising:

for any pedestrian attribute, determining a positioning identification result of the pedestrian attribute in the test image on each composite feature layer according to an attribute positioning identification module which positions and identifies the pedestrian attribute on each composite feature layer;

and determining the attribute recognition result of the pedestrian attribute in the test image according to the positioning recognition result of the pedestrian attribute in the test image on each composite feature layer and the global recognition result of the pedestrian attribute in the test image.

10. A pedestrian property identification device characterized by comprising:

the characteristic extraction module is used for extracting characteristics of a sample image according to a convolutional neural network to obtain a plurality of initial characteristic layers, wherein the sample image is an image in a preset pedestrian attribute sample set, and the image in the preset pedestrian attribute sample set has a plurality of pedestrian attributes;

the characteristic fusion module is used for carrying out layer-by-layer characteristic fusion on the plurality of initial characteristic layers from top to bottom to obtain a plurality of composite characteristic layers;

and the attribute positioning and identifying module is used for determining an attribute positioning and identifying module for positioning and identifying the attribute of each pedestrian on each composite characteristic layer according to the space transformation network and the plurality of composite characteristic layers.