CN110516512A

CN110516512A - Training method, pedestrian's attribute recognition approach and the device of pedestrian's attributive analysis model

Info

Publication number: CN110516512A
Application number: CN201810488759.7A
Authority: CN
Inventors: 王睿
Original assignee: Beijing Keaosen Data Technology Co Ltd
Current assignee: Beijing Keaosen Data Technology Co Ltd
Priority date: 2018-05-21
Filing date: 2018-05-21
Publication date: 2019-11-29
Anticipated expiration: 2038-05-21
Also published as: CN110516512B

Abstract

The embodiment of the present invention discloses the training method and device of a kind of pedestrian's attributive analysis model, comprising: by pedestrian image, and probability graph corresponding with the pedestrian image inputs convolutional neural networks, obtains prediction attribute；The probability graph is characterized in the pedestrian image for being divided at least one pedestrian's component area, each pixel node belongs to the set of the probability value of pedestrian's component area；Using real property corresponding with the pedestrian image and the prediction attribute, training loss is calculated；If the current model parameter of the convolutional neural networks, is determined as the model parameter of pedestrian's attributive analysis model, obtains pedestrian's attributive analysis model by the training loss convergence.The pedestrian's attributive analysis model obtained using the training of above-mentioned training method, it is in the unknown pedestrian image of attribute for identification, even if facing application scenarios, pedestrian's posture, the biggish pedestrian image of camera angle difference, pedestrian's attribute relatively accurately can also be therefrom identified.

Description

Training method of pedestrian attribute analysis model, and pedestrian attribute identification method and device

Technical Field

The invention relates to the technical field of image processing, in particular to a training method and a training system for a pedestrian attribute analysis model, and a pedestrian attribute identification method and identification device.

Background

Pedestrian attribute recognition (Pedestrian attribute recognition) refers to a technology of processing and analyzing pictures to identify attributes of pedestrians, where the attributes of pedestrians include physical features (such as height, weight, etc.), wearing features (such as jacket, type and color of trousers, backpack, etc.), and human face features (such as age, gender, race, etc.).

At present, the pedestrian attribute recognition is mainly based on a neural network model, and generally comprises two stages of model training and model recognition. In the model training stage, after the image with the label is used as the input data of the neural network model for training, the model parameters meeting the requirements can be obtained. The neural network model after the parameters are determined is a trained model. In the using stage, the picture to be recognized is used as input data of the trained neural network model to obtain output data, and then the attribute of the pedestrian recognized from the picture can be obtained.

However, the pedestrian attribute identification method based on the neural network model has some problems in practical application, one of which is that the pedestrian attributes in all pictures cannot be accurately identified due to the diversity of the pedestrian pictures. Specifically, in practical application scenes, the collected pedestrian pictures are quite diversified due to different monitoring scenes, camera angles, pedestrian dresses, postures and the like, a neural network model-based method has a good effect on recognizing some pictures, and for pictures with great changes in other application scenes, camera angles and the like, the pedestrian attributes cannot be accurately recognized.

Disclosure of Invention

In order to solve the above technical problems, the present application provides a neural network model training method for identifying attributes of pedestrians, and a method for identifying attributes of pedestrians by using the trained neural network model, so as to accurately identify attributes of pedestrians from multiple pictures.

In a first aspect, a training method for a pedestrian attribute analysis model is provided, which includes:

inputting a pedestrian image and a probability map corresponding to the pedestrian image into a convolutional neural network to obtain a prediction attribute; the probability map characterizes a set of probability values that each pixel node belongs to a pedestrian feature region in the pedestrian image divided into at least one pedestrian feature region;

calculating a training loss by using a real attribute corresponding to the pedestrian image and the prediction attribute;

and if the training loss is converged, determining the current model parameters of the convolutional neural network as the model parameters of the pedestrian attribute analysis model to obtain the pedestrian attribute analysis model.

With reference to the first aspect, in a first possible implementation manner of the first aspect, the calculating of the probability map includes the following steps:

inputting a pedestrian image into a pedestrian analysis model to obtain a probability map corresponding to the pedestrian image; the pedestrian analysis model is a fully-convolutional neural network trained by adopting a training image with a real probability map label.

With reference to the first implementation manner of the first aspect, in a second possible implementation manner of the first aspect, the convolutional neural network includes a first sub-network and a second sub-network;

the step of inputting the pedestrian image and the probability map corresponding to the pedestrian image into a convolutional neural network to obtain a prediction attribute comprises:

extracting pedestrian features from the pedestrian image by using a first sub-network;

updating the probability map;

obtaining fusion characteristics according to the updated probability map and the pedestrian characteristics;

and inputting the fusion characteristics into a second sub-network to obtain the prediction attributes.

With reference to the first aspect and the foregoing possible implementation manners, in a third possible implementation manner of the first aspect, the step of updating the probability map specifically includes:

wherein x is_iThe ith pedestrian image;

representing the probability value of the s pixel node in the i pedestrian image belonging to the c pedestrian component region;

representing the updated probability value;

it is indicated that, for the s-th pixel node, which belongs to class C in the C' pedestrian feature area,obtaining a maximum value;

and/or the presence of a gas in the gas,

the step of obtaining fusion features according to the updated probability map and the pedestrian features specifically includes:

carrying out convolution fusion on the updated probability map and the pedestrian feature to obtain a first feature;

φ(x_i)＝[φ(x_i)¹,φ(x_i)²,…,φ(x_i)^C']

wherein phi (x)_i)^cA first feature representing a c-th channel of an i-th pedestrian image;

a set of probability values representing the updated c-th channel;

indicating copyThe new probability value set of the c channel leads the probability value set to be matched with the pedestrian characteristic f_b(x_i) The number of channels is the same;

pixel multiplication representing the corresponding position;

f_b(x_i) A pedestrian feature representing an ith pedestrian image;

φ(x_i) A first feature representing an ith pedestrian image;

and obtaining a fusion feature by using the first feature and the pedestrian feature.

With reference to the first aspect and the foregoing possible implementation manners, in a fourth possible implementation manner of the first aspect, the step of calculating a training loss by using a real attribute corresponding to the pedestrian image and the predicted attribute specifically includes:

J(θ)＝∑_jλ_jJ(θ^j)

wherein J (θ) is the total loss of training;

J(θ^j) A training loss for the jth task;

λ_ja task weight representing a jth task;

a variance representing uncertainty in prediction in the jth task;

m represents the total number of pedestrian images used for training in the j-th task;

K_jthe option number of the value of the jth attribute is represented;

representing the real attribute of the ith pedestrian image in the jth task;

P_i ^jand representing the predicted attribute of the ith pedestrian image in the jth task.

In a second aspect, a training method for a pedestrian attribute analysis model is provided, which includes:

inputting the pedestrian image into a convolutional neural network to obtain a prediction attribute;

calculating a training loss using the real attributes corresponding to the pedestrian images and the predicted attributes:

J(θ)＝∑_jλ_jJ(θ^j)

wherein J (θ) is the total loss of training;

J(θ^j) A training loss for the jth task;

λ_ja task weight representing a jth task;

a variance representing uncertainty in prediction in the jth task;

K_jthe option number of the value of the jth attribute is represented;

representing the real attribute of the ith pedestrian image in the jth task;

P_i ^jrepresenting the predicted attribute of the ith pedestrian image in the jth task;

In a third aspect, a pedestrian attribute identification method is provided, which includes the following steps:

and inputting the pedestrian image to be recognized into the pedestrian attribute analysis model obtained by training by any one of the first aspect or the second aspect to obtain the recognized pedestrian attribute.

In a fourth aspect, a pedestrian attribute analysis model training system is provided, including:

the first training unit is used for inputting the pedestrian image and the probability map corresponding to the pedestrian image into a convolutional neural network to obtain a prediction attribute; calculating a training loss by using a real attribute corresponding to the pedestrian image and the prediction attribute; determining the current model parameters of the convolutional neural network as the model parameters of a pedestrian attribute analysis model under the condition of the convergence of the training loss to obtain the pedestrian attribute analysis model;

wherein the probability map characterizes a set of probability values that each pixel node belongs to a pedestrian component region in the pedestrian image demarcated into at least one pedestrian component region.

In a fifth aspect, a pedestrian attribute analysis model training system is provided, including:

the second training unit is used for inputting the pedestrian image into the convolutional neural network to obtain a prediction attribute; calculating a training loss by using a real attribute corresponding to the pedestrian image and the prediction attribute; under the condition that the training loss is converged, determining the current model parameters of the convolutional neural network as the model parameters of the pedestrian attribute analysis model to obtain the pedestrian attribute analysis model;

the second training unit comprises:

the second weight self-updating unit is used for adjusting the task weight corresponding to the task according to the following formula:

wherein λ is_jA task weight representing a jth task;

a variance representing uncertainty in prediction in the jth task;

K_jthe option number of the value of the jth attribute is represented;

representing the real attribute of the ith pedestrian image in the jth task;

the second training unit is further configured to calculate a total training loss using the task weights:

J(θ)＝∑_jλ_jJ(θ^j)

wherein J (θ) is the total loss of training;

J(θ^j) A training loss for the jth task;

λ_jindicating the task weight of the jth task.

In a sixth aspect, there is provided a pedestrian attribute identification device including:

and the prediction unit is used for inputting the pedestrian image to be recognized into the neural network model trained by the training system of any one of the fourth aspect and the fifth aspect and outputting the recognized pedestrian attribute.

In the training method of the pedestrian attribute analysis model in the first aspect of the present invention, first, a pedestrian image and a probability map corresponding to the pedestrian image are input to a convolutional neural network to obtain a prediction attribute, where the probability map represents a set of probability values of each pixel node belonging to a pedestrian component region in the pedestrian image divided into at least one pedestrian component region. Then, a training loss is calculated using the real attribute corresponding to the pedestrian image and the predicted attribute. And if the training loss is converged, determining the current model parameters of the convolutional neural network as the model parameters of the pedestrian attribute analysis model to obtain the pedestrian attribute analysis model. By the method, the pedestrian component region is divided on the pixel level, the indication information of the pedestrian component region is given, and the pedestrian attribute analysis network is guided to learn the characteristics with pertinence and robustness, so that the influence of diversity such as pedestrian postures and camera angles can be resisted to a certain extent. When the analysis model obtained by training by the training method is adopted to identify the pedestrian image with unknown attribute, even the pedestrian image with large difference of application scene, pedestrian posture and camera angle can be identified more accurately.

In addition, in view of the fact that the difficulty degree and the convergence rate of each attribute analysis task learning are not the same, the technical solution of the second aspect provides a training method for a pedestrian attribute analysis model, in which task weights corresponding to tasks are automatically updated according to different tasks, that is, each task weight is adjusted according to the training condition of each attribute analysis task during each training, so as to increase the contribution degree of a simple task in model training, prevent the model from being dominated by a difficult task, enable a plurality of tasks to be trained in a coordinated manner, and help feature learning and information exchange of each task. The analysis model trained by the training method can accurately identify a plurality of pedestrian attributes from the pedestrian image.

Drawings

In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without any creative effort.

FIG. 1 is a flow chart of one implementation of a first embodiment of a training method of a pedestrian attribute analysis model according to the present application;

FIG. 2 is a flowchart of a second implementation manner of the first embodiment of the training method for a pedestrian attribute analysis model according to the present application;

FIG. 3 is a flowchart of one implementation of the step S100 in the first embodiment of the training method for a pedestrian attribute analysis model according to the present application;

FIG. 4 is a flowchart of a third implementation manner of the first embodiment of the training method of the pedestrian attribute analysis model according to the present application;

FIG. 5 is a flow chart of one implementation of a second embodiment of a training method of a pedestrian attribute analysis model according to the present application;

FIG. 6 is a flowchart of a second implementation manner of the second embodiment of the training method of the pedestrian attribute analysis model according to the present application;

FIG. 7 is a schematic diagram of a pedestrian attribute analysis model training system according to one embodiment of the present application;

FIG. 8 is a schematic diagram of a second embodiment of a pedestrian attribute analysis model training system according to the present application;

FIG. 9 is a schematic diagram of a third embodiment of a pedestrian attribute analysis model training system according to the present application;

fig. 10 is a schematic structural diagram of a fourth embodiment of the training system for pedestrian attribute analysis model according to the present application.

Detailed Description

The following provides a detailed description of the embodiments of the present application.

Convolutional Neural Networks (CNN) is a multi-layer Neural network model that is good at dealing with the relevant machine learning problems of images, especially large images.

In order to solve the problem that the diversity of the pedestrian pictures leads to the accuracy of the pedestrian attribute identification, please refer to fig. 1, in a first embodiment of the present application, a training method of a pedestrian attribute analysis model is provided, which includes:

s100: and inputting the pedestrian image and the probability map corresponding to the pedestrian image into a convolutional neural network to obtain a prediction attribute.

The pedestrian image refers to an image including a pedestrian, and can be an original picture, such as a picture from a monitoring video; or may be a picture or the like that has been subjected to preprocessing. The pedestrian images in this step belong to a first training set, the first training set is a set of training samples for training a pedestrian attribute analysis model, and each pedestrian image in the first training set is a training sample. Each pedestrian image is provided with a corresponding real attribute label for marking the real attribute of the pedestrian image. The real property tags here may be manually labeled.

In an implementation manner, the pedestrian image is obtained by preprocessing an original picture, and specifically includes the following steps:

detecting whether the original picture contains a pedestrian or not;

if the pedestrian is contained, acquiring the position of the pedestrian in the original picture;

and cutting out a pedestrian image from the original picture according to the position of the pedestrian.

Alternatively, the size of the cut-out pedestrian image may be preset here, for example, the pixel size of the cut-out pedestrian image is preset to 224 × 224.

And if the original picture does not contain the pedestrian, abandoning the original picture and carrying out the preprocessing step on the next original picture. If multiple pedestrians are detected in one original picture, multiple pedestrian images can be obtained through cutting.

A pedestrian image may include S 'pixel nodes, for example, with a crop size of 224 × 224, then S' 50176. The predetermined pedestrian feature area may include hair, face, upper body, lower body, etc. The probability map represents a set of probability values that each pixel node belongs to a preset pedestrian component region in the pedestrian image divided into at least one pedestrian component region. Therefore, the probability map may be represented as a matrix in which each probability value represents a probability value that the pixel node belongs to a certain pedestrian component region.

And (3) assuming that the preset pedestrian component regions share the class C ', and the ith pedestrian image comprises S' pixel nodes. s (lower case) denotes the serial number of the pixel node, i.e., the s-th pixel node. c (lower case) indicates the serial number of the pedestrian component region, i.e., the class c pedestrian component region.Representing the probability value that the s pixel node in the i pedestrian image belongs to the c pedestrian component region,a probability map representing the i-th pedestrian image. It should be noted that in the present application, "C" in C' is an capital letter, which indicates the total number of the divided pedestrian component regions in the ith pedestrian image; lower case C is the pedestrian component region number, representing the C-th of the C' pedestrian component regions; "S" in S' is a capital letter indicating the total number of pixel nodes in the ith pedestrian image; the lower case S is the number of pixel nodes, representing the S-th of the S' pixel nodes.

For example, for a certain pedestrian image, the pedestrian part region divided from the pedestrian image includes 3 types of hair, face and upper body, and the pedestrian image includes 224 × 224 — 50176 pixel nodes, then the probability map of the certain pedestrian image can be represented as a 3 × 224 × 224 three-dimensional matrix. The correspondence relationship shown in fig. 1 can be expressed by two-dimensionally expressing the probability map, that is, by numbering the pixel nodes in order.

Table 1 matrix numerical meaning correspondence

Wherein,a probability value representing that the 1 st pixel node belongs to hair,the probability value of the 1 st pixel node belonging to the face is shown, the meanings of other numerical values are similar, and the specific corresponding relation is shown in table 1.

Referring to fig. 2, the probability map corresponding to the pedestrian image can be obtained by the following steps:

s400: and inputting the pedestrian image into a pedestrian analysis model to obtain a probability map corresponding to the pedestrian image. The pedestrian analysis model is a fully-convolutional neural network trained by adopting a training image with a real probability map label.

In this step, the pedestrian analytic model is trained, i.e., a full Convolutional neural network (FCN) with the model parameters already determined.

Here, the training image is also a pedestrian image, and belongs to a second training set, which is a set of training samples for training a pedestrian analysis model. Each training image is provided with a real probability map label corresponding to the training image and is used for marking the real probability map of the training image.

For a certain training image, the real probability icon carried by the training image represents the actual pedestrian component region divided by the pedestrian image and the probability value of each pixel node in the pedestrian image belonging to a certain actual pedestrian component region. For a certain pixel point, the probability value of the pixel point belonging to a certain actual pedestrian component region is usually 1 or 0, wherein 1 represents that the pixel point belongs to the actual pedestrian component region, and 0 represents that the pixel point does not belong to the actual pedestrian component region.

The main process of training the pedestrian analytic model is as follows: and inputting the training images in the second training set into a full convolution neural network, firstly dividing the training images into at least one predicted pedestrian component region, and then outputting the prediction probability value of each pixel node belonging to the predicted pedestrian component region to obtain a prediction probability map. And calculating the training loss of the full convolution neural network by using the prediction probability graph and the real probability graph of the training image. If the training loss converges, determining the model parameters in the current full convolution neural network as the model parameters of the pedestrian analytic model. If the training loss does not converge, the model parameters in the full convolutional neural network are updated, and then the aforementioned training steps are repeated. And determining the latest model parameters in the full convolution neural network as the parameters of the pedestrian analysis model until the training loss of the full convolution neural network obtained by calculation is converged.

Pedestrian attributes may include physical characteristics (e.g., height, weight, etc.), wear characteristics (e.g., jacket, type and color of pants, backpack, etc.), facial characteristics (e.g., age, gender, race), etc.

When training the pedestrian attribute analysis model, the pedestrian images of the convolutional neural network are input into the first training set, and each pedestrian image can correspondingly obtain the corresponding predicted pedestrian attribute, namely the predicted attribute. When only one pedestrian attribute needs to be predicted, such training may be referred to as single-task attribute analysis training. When the pedestrian attributes needing to be predicted are multiple, the method is called multi-task attribute analysis training.

Taking a single task attribute analysis training as an example, for a pedestrian image, the predicted pedestrian attribute is only hair color, and there are 6 possible colors for hair color, and table 2 shows the probability that the pedestrian image is input into the convolutional neural network, and the hair is the corresponding color.

Table 2 prediction attribute examples

	Yellow colour	Brown colour	White colour	Red colour	Green colour	Black color
							Hair colour	0.5	0.3	0.01	0.05	0.04	0.1

In particular, in one implementation, when introducing the probability map into the method of training the pedestrian property analysis model, an attention mechanism may also be introduced simultaneously. Specifically, referring to fig. 3, the convolutional neural network includes a first sub-network and a second sub-network; the step of S100 may include:

s110: extracting pedestrian features from the pedestrian image by using a first sub-network;

s120: updating the probability map;

s130: obtaining fusion characteristics according to the updated probability map and the pedestrian characteristics;

s140: and inputting the fusion characteristics into a second sub-network to obtain the prediction attributes.

In the step of S110, the pedestrian feature may be represented as f_b(x_i) The extraction of the pedestrian features from the pedestrian image using the first subnetwork can be implemented using existing implementations. More specifically, the first subnetwork may comprise several convolution groups, each convolution group comprising one convolution layer and one pooling layer. Inputting the pedestrian image into a first sub-network, and finally obtaining the pedestrian feature f of the pedestrian image through feature extraction and down sampling of a plurality of convolution groups_b(x_i)。

In the step S120, updating the probability value in the probability map specifically includes:

wherein x is_iThe ith pedestrian image; i is the index number of the pedestrian image;

representing the updated probability value;

it is indicated that, for the s-th pixel node, which belongs to class C in the C' pedestrian feature area,the maximum value is taken. That is, when this condition is satisfied, it willThe value of (d) is updated to 1. If not, then,or get the originalThe value of (c).

For example, take the s-th pixel node in Table 1 as an example, which is before the updateThe values of (a) are shown in table 3, respectively, after updating,the values of (A) are shown in Table 4.

TABLE 3 probability values of the s-th pixel node before update

	Hair (c 1)	Face (c 2)	Upper body (c being 3)
				S pixel node	0.7	0.2	0.1

TABLE 4 probability values of the s-th pixel node after update

	Hair (c 1)	Face (c 2)	Upper body (c being 3)
				S pixel node	1	0.2	0.1

If the probability value that the s pixel node belongs to the C pedestrian component region is the maximum, the C pedestrian component region is considered as the prediction category of the s pixel node, and the other (C' -1) categories are non-prediction categories of the s pixel node. By updating the probability value in the probability map, the probability of the prediction category is set to 1, so that the information of the pixel node can be kept as much as possible in the subsequent feature fusion step. And meanwhile, the probability of the non-prediction category is also reserved so as to prevent information loss caused by prediction error of the s-th pixel node.

In the step S130, obtaining a fusion feature according to the updated probability map and the pedestrian feature, which may specifically include:

s131: carrying out convolution fusion on the updated probability map and the pedestrian feature to obtain a first feature;

s132: abstracting the first feature to obtain a second feature;

s133: abstracting the pedestrian feature to obtain a third feature;

s134: and adding and fusing the second characteristic and the third characteristic to obtain a fused characteristic.

In one implementation, the step of convolution fusion of S131 may include:

φ(x_i)＝[φ(x_i)¹,φ(x_i)²,…,φ(x_i)^C']

wherein,

φ(x_i)^ca first feature representing a c-th channel of an i-th pedestrian image;

a set of probability values representing the updated c-th channel; that is, the set of probability values (updated) that each pixel node in the pedestrian image belongs to the class c pedestrian feature region.

Representing the probability value set of the c channel after being copied and updated so as to lead the probability value set to be matched with the pedestrian characteristic f_b(x_i) The number of channels is the same.

Representing the pixel multiplication of the corresponding location.

f_b(x_i) The pedestrian feature of the ith pedestrian image is represented.

φ(x_i) A first feature representing an ith pedestrian image.

For example, assume that the pedestrian feature f is input_b(x_i) Size (128, 56, 56), where 128 refers to the number of channels characteristic of a pedestrian, and 56 refer to height and width, respectively; assuming an updated probability mapThe size is (9, 56, 56), where 9 refers to the number of channels in the probability map, i.e., the total number of pedestrian feature areas, and 56 refer to height and width, respectively. For updated probability mapsThe first feature of the c-th lane of (2) is copied 128 times to match the pedestrian feature f_b(x_i) Are the same in number of channels, i.e.Is (128, 56, 56). Then will beAnd f_b(x_i) Multiplying corresponding elements, and performing convolution fusion to obtain the first characteristic phi (x) of the fused c-th channel_i)^c。

Since the convolution fusion operation is performed separately for each channel in the probability map, i.e., for each pedestrian feature region, the most important is thatThe first characteristic obtained is: phi (x)_i)＝[φ(x_i)¹,φ(x_i)²,…,φ(x_i)^C']And can be expressed as one [ (9X 128) x 56X 56%]Of the matrix of (a).

After the updated probability map and the pedestrian features are subjected to convolution fusion, the features of each semantic region in the pedestrian image are separated independently, so that the convolutional layer of the second sub-network is emphasized in the subsequent step during learning, namely the convolutional layer of the second sub-network can know which semantic regions are emphasized more according to specific numerical values in the first features during learning, and how to combine the features of the semantic regions.

In steps S132 to S134, the first feature is abstracted by several convolution layers to obtain a second feature, and the number of channels of the second feature is the same as the number of channels of the pedestrian feature. And abstracting the pedestrian feature by a plurality of layers of convolution layers to obtain a third feature, wherein the number of channels of the third feature is the same as that of the pedestrian feature. And finally, adding and fusing the second characteristic and the third characteristic to obtain more comprehensive fusion characteristic, so that the convolutional neural network can be trained more accurately, and the accuracy of prediction is improved.

In step S140, the second sub-network and the first sub-network can be regarded as a part of a convolutional neural network, and the two sub-networks together form a convolutional neural network. The second sub-network may comprise a plurality of convolution groups and a plurality of fully connected layers, each convolution group comprising a convolution layer and a pooling layer, the plurality of convolution groups being connected in sequence, the last convolution group being connected to the fully connected layer. And inputting the fusion features into a second sub-network, performing feature extraction and down-sampling of a plurality of convolution groups, and performing prediction through a full-connection layer to obtain prediction attributes.

For example, table 5 illustrates predicted attribute labels for one pedestrian image in the first training set. Wherein each value represents the probability that the value of the attribute is the corresponding option. That is, 0.1 indicates that the probability value of yellow hair color of the pedestrian image is 0.1, the probability value of brown hair color is 0.6, and the rest of the meanings are similar. And selecting the option with the highest probability as the predicted value of the attribute which is finally output.

Table 5 prediction attribute examples

	Yellow colour	Brown colour	White colour	Red colour	Green colour	Black color
							Hair colour	0.1	0.6	0.05	0.15	0.05	0.05

S200: and calculating training loss by using the real attribute corresponding to the pedestrian image and the prediction attribute.

As mentioned above, each pedestrian image in the first training set has a corresponding real attribute label for labeling the real attribute of the pedestrian image. The real property tags here may be manually labeled.

For example, table 6 illustrates the true attribute labels of one pedestrian image in the first training set shown in table 5. Wherein, 1 represents that the attribute is a corresponding value, and 0 represents that the attribute is not the value. That is, the color of the hair of the pedestrian image is brown, not the other five colors.

Table 6 real property example

	Yellow colour	Brown colour	White colour	Red colour	Green colour	Black color
							Hair colour	0	1	0	0	0	0

The training loss can be calculated by using an existing loss function, such as a square error loss function, an SVM loss function, a softmax loss function, and the like.

As mentioned above, when there is only one pedestrian attribute that needs to be predicted, such training may be referred to as single task attribute analysis training. When the pedestrian attributes needing to be predicted are multiple, the method is called multi-task attribute analysis training. And if T attributes need to be predicted in total, T training tasks are correspondingly total, wherein the training task corresponding to the jth attribute is predicted to be the jth task. The total loss of training is then:

J(θ)＝∑_jλ_jJ(θ^j)

wherein J (θ) is the total loss of training;

J(θ^j) A training loss for the jth task;

λ_ja task weight representing a jth task; in general, λ_jMay be set to a fixed value.

Training loss J (theta) for jth task^j) Can be obtained by existing calculation methods, such as optionally J (theta)^j) The calculation is carried out by adopting a softmax loss function, which is as follows:

wherein m represents the total number of pedestrian images used for training in the jth task, namely the total number of training samples of the jth task, and i is the index number of the ith pedestrian image in the m pedestrian images.

K_jThe option number of the value of the jth attribute is represented; k represents K_jThe index number of the kth option; for example, for the pedestrian attribute of "gender", the value may have two options of "male" and "female", and then K_jEqual to 2.

Representing the real attribute of the ith pedestrian image in the jth task.

Representing a dirac function; if and only ifIn the case of the value of the k-th option,otherwise

A penalty factor representing the resistance to non-equalized data,wherein,and representing that the value of the jth attribute in the training sample of the jth task is the proportion of the number of samples of the kth option to the total number of the training samples of the jth task. When multi-task attribute analysis training is performed, the number of training samples of each value in each attribute is often unbalanced, for example, pedestrians who may wear sunglasses in the training samples are far fewer than pedestrians who do not wear sunglasses. In order to make the training more effective, a penalty coefficient for resisting the unbalanced data is introducedWhen the sample proportion of the kth option containing the jth attribute becomes larger, the sample proportion is largerAnd becomes smaller, thereby penalizing.

And the probability value of predicting the j attribute as the k option for the ith pedestrian image is shown. More specifically, the present invention is to provide a novel,

s300: and if the training loss is converged, determining the current model parameters of the convolutional neural network as the parameters of the pedestrian attribute analysis model to obtain the pedestrian attribute analysis model.

Referring to fig. 4, if the training loss does not converge, S301 is executed: updating model parameters in the convolutional neural network. And then repeating the steps from S100 to S200 for training until the calculated training loss is converged, and determining the model parameters in the latest convolutional neural network as the parameters of the pedestrian attribute analysis model. Here, the parameters of the convolutional neural network may be updated by using a conventional algorithm such as a Stochastic Gradient Descent (SGD) method.

Besides the aforementioned diversity of pedestrian pictures can affect the accuracy of pedestrian attribute identification, the pedestrian attribute identification method based on the neural network model has another problem in practical application: the multitask attribute analyzes the training incompatibility. In particular, in practical applications, it is often necessary to identify a plurality of pedestrian attributes from one picture. However, when training the neural network model, the learning difficulty and the task convergence condition are not the same for different pedestrian attributes, for example, it is more difficult to identify the age of a person than to identify the color of clothes. In a conventional model training method, task weights of different tasks are often set to be fixed values, so that the problems of difficulty and inconsistent convergence of the different tasks are ignored, and coordinated training is difficult to form when a model needing to identify multiple pedestrian attributes is trained. Also because of this, when the trained analysis model is used to identify multiple pedestrian attributes in a pedestrian image, it is difficult to identify all pedestrian attributes more accurately.

To this end, the second embodiment of the present application further proposes another training method of a pedestrian attribute analysis model, wherein the task weight λ is_jOr may be adjusted according to the training situation.

Specifically, referring to fig. 5, a method for training a pedestrian attribute analysis model is provided, which includes steps S500 to S700.

S500: and inputting the pedestrian image into a convolutional neural network to obtain a prediction attribute.

The convolutional neural network may use the pedestrian image as input data, and the resulting output data is a predictive attribute. In one implementation, a convolutional neural network may include a first subnetwork and a second subnetwork. Wherein the first sub-network is used for extracting the pedestrian feature f from the pedestrian image_b(x_i). In particular, the first subnetwork may comprise a number of convolution groups, each convolution group comprising one convolution layer and one pooling layer. Inputting the pedestrian image into a first sub-network, and finally obtaining the pedestrian feature f of the pedestrian image through feature extraction and down sampling of a plurality of convolution groups_b(x_i). The second sub-network being arranged to exploit the pedestrian feature f_b(x_i) As input data, a prediction attribute is obtained. In particular, the second subnetwork may comprise a number of convolution groups, each comprising a convolution layer and a pooling layer, and a number of fully-connected layers, the number of convolution groups being connected in series, the last convolution group being connected to the fully-connected layer. And inputting the fusion features into a second sub-network, performing feature extraction and down-sampling of a plurality of convolution groups, and performing prediction through a full-connection layer to obtain prediction attributes.

S600: calculating a training loss using the real attributes corresponding to the pedestrian images and the predicted attributes:

J(θ)＝∑_jλ_jJ(θ^j) (5)

wherein J (θ) is the total loss of training;

J(θ^j) A training loss for the jth task;

λ_ja task weight representing a jth task;

a variance representing uncertainty in prediction in the jth task;

K_jthe option number of the value of the jth attribute is represented;

representing the real attribute of the ith pedestrian image in the jth task;

Task weight λ_jThe derivation process of (1) is as follows:

for the problem of identifying and classifying the pedestrian attributes, the classification problem can be regarded as a regression problem firstly, namely for the ith pedestrian image, the real attribute of the ith pedestrian image in the jth task is assumed to beConverting the label marked with the real attribute into a vector form:wherein,K_jthe number of options of the value of the jth attribute is represented, and K is K_jThe index number of the kth of the individual options.Representing a dirac function; if and only ifWhen the temperature of the water is higher than the set temperature,otherwiseFor example, inTaking the real attributes of Table 5 as an example, K_jIs 6, i.e. there are 6 possible values for this property of hair color. For the ith pedestrian image, the jth attribute of the ith pedestrian image is hair color, and the real attribute is brown, so that the ith pedestrian image is expressed as

After considering the classification problem as a fitting regression problem, its training objective is still consistent, i.e. let the prediction property P be consistent_i ^jNear true propertyConsidering the isomorphic uncertainty and gaussian distribution, the prediction process for the ith pedestrian image in the jth task can be modeled as:

wherein,is the variance of the uncertainty in the jth task. Assuming that there are m training samples and T prediction tasks in the jth task, the whole process can be modeled as:

the negative log-likelihood function of equation (2) is written as:

in a common fitting regression problem,is its training objective function. And in the above-mentioned formula (3),is a fitting regression problem with multi-attribute analysis task weight adaptive learning, anIs the task weight of the jth task. And the latter term K in the formula (3)_j logσ_jIs a regular term, limits σ_jNot too large nor too small. Considering the classification problem and regression problem described above is essentially a problem, and the goal agreement is such that the predicted property P is_i ^jApproximating true attributesTherefore, the task weight of each task can be estimated and then applied to the classification problem.

In the above equation (3), the uncertaintyCan be estimated according to the maximum likelihood method, orderTo obtainThus, the task weight for the jth task may be set to:

as can be seen from the above equation (4), in the process of automatically updating the task weight, the weight λ of each task_jAs model training increases, but the speed of simple tasks increases relatively quickly (simple tasks)Fast descent and corresponding smaller losses) while the speed-up of the difficult task is relatively slow. The contribution degree of the simple task in model training can be increased to a certain extent, and the model is prevented from being difficultly trainedThe task is dominant, so that the training of the multitask attribute analysis is more harmonious.

S700: and if the training loss is converged, determining the current model parameters of the convolutional neural network as the model parameters of the pedestrian attribute analysis model.

And if the training loss is not converged, updating the model parameters in the convolutional neural network, repeating the steps from S500 to S600 for training until the calculated training loss is converged, and determining the latest model parameters in the convolutional neural network as the parameters of the pedestrian attribute analysis model.

It should be noted that, the method for training the analysis model by introducing the probability map for feature fusion in the first embodiment of the present application and the method for automatically updating the task weight in the second embodiment may be combined with each other.

Therefore, alternatively, referring to fig. 6, the step of S500 in the second embodiment may be replaced with the step of S800, i.e.

S800: and inputting the pedestrian image and the probability map corresponding to the pedestrian image into a convolutional neural network to obtain a prediction attribute. Wherein the probability map characterizes a set of probability values that each pixel node belongs to a pedestrian component region in the pedestrian image demarcated into at least one pedestrian component region.

The step S800 may refer to the description related to the first embodiment S100, and adopt the same specific implementation manner, which is not described herein again.

In a third embodiment of the present application, there is provided a pedestrian attribute identification method including the steps of:

and (3) training the obtained pedestrian attribute analysis model by adopting the training method in the first embodiment or the second embodiment, and inputting the pedestrian image to be recognized into the pedestrian attribute analysis model to obtain the recognized pedestrian attribute.

Here, the pedestrian image to be recognized is also an image including a pedestrian, only in which the pedestrian property of the pedestrian is unknown. The pedestrian image to be identified may also be an original picture, such as a picture from a surveillance video or the like; or may be a picture or the like that has been subjected to preprocessing.

In an implementation manner, the step of preprocessing the image of the pedestrian to be recognized from the original picture may specifically include the following steps:

detecting whether the original picture contains a pedestrian or not;

and cutting out the pedestrian image to be identified from the original picture according to the position of the pedestrian.

Alternatively, the size of the cut-out image of the pedestrian to be recognized may be preset here, for example, the pixel size is 224 × 224. In general, the size of the pedestrian image to be recognized may coincide with the size of the pedestrian image employed in training the pedestrian property analysis model.

And if the original picture does not contain the pedestrian, abandoning the original picture and carrying out the preprocessing step on the next original picture. If multiple pedestrians are detected in one original picture, multiple images of the pedestrians to be recognized can be obtained through cutting.

The pedestrian attributes directly output from the pedestrian attribute analysis model may also be represented as a matrix, similar to the way the predicted attributes are represented during the training process. For example, the probability of the hair color being yellow or other 5 colors is 0.01, and the probability of the hair color being brown is 0.95. When the result is output to the user, the option with the highest probability in the options of the attribute is taken as the final predicted value of the attribute, namelyStill taking the foregoing example as an example, when outputting to the user, the predicted color of hair in the image of the pedestrian to be recognized is "brown".

In a fourth embodiment of the present application, referring to fig. 7, corresponding to the training method in the first embodiment, there is provided a training system of a pedestrian attribute analysis model, including:

the first training unit 1 is used for inputting a pedestrian image and a probability map corresponding to the pedestrian image into a convolutional neural network to obtain a prediction attribute; calculating a training loss by using a real attribute corresponding to the pedestrian image and the prediction attribute; and under the condition that the training loss is converged, determining the current model parameters of the convolutional neural network as the model parameters of the pedestrian attribute analysis model to obtain the pedestrian attribute analysis model. Wherein the probability map characterizes a set of probability values that each pixel node belongs to a pedestrian component region in the pedestrian image demarcated into at least one pedestrian component region.

Optionally, referring to fig. 8, the training system may further include:

and the pedestrian analysis unit 2 is used for inputting the pedestrian image into a pedestrian analysis model to obtain a probability map corresponding to the pedestrian image. The pedestrian analysis model is a fully-convolutional neural network trained by adopting a training image with a real probability map label.

Optionally, the first training unit 1 may further include:

the first pedestrian analysis assisting module 11 is configured to update the probability map; and obtaining fusion characteristics according to the updated probability map and the pedestrian characteristics.

The first training unit 1 is further configured to extract pedestrian features from the pedestrian image by using a first sub-network; and inputting the fusion characteristics into a second sub-network to obtain the prediction attributes.

Optionally, the step of updating the probability map by the first pedestrian analysis assisting module 11 specifically includes:

wherein x is_iThe ith pedestrian image;

representing the updated probability value;

it is indicated that, for the s-th pixel node, which belongs to class C in the C' pedestrian feature area,the maximum value is taken.

Optionally, the step of obtaining the fusion feature according to the updated probability map and the pedestrian feature by the first pedestrian analysis assisting module 11 specifically includes:

carrying out convolution fusion on the updated probability map and the pedestrian feature to obtain a first feature:

φ(x_i)＝[φ(x_i)¹,φ(x_i)²,…,φ(x_i)^C']

a set of probability values representing the updated c-th channel;

representing the probability value set of the c channel after being copied and updated so as to lead the probability value set to be matched with the pedestrian characteristic f_b(x_i) The number of channels is the same;

pixel multiplication representing the corresponding position;

f_b(x_i) A pedestrian feature representing an ith pedestrian image;

φ(x_i) A first feature representing an ith pedestrian image;

Optionally, the first training unit 1 further comprises:

a weight self-updating unit 12, configured to adjust a task weight corresponding to the task according to the following formula:

wherein λ is_jA task weight representing a jth task;

a variance representing uncertainty in prediction in the jth task;

K_jthe option number of the value of the jth attribute is represented;

representing the real attribute of the ith pedestrian image in the jth task;

The first training unit 1 is further configured to calculate a total training loss using the task weights:

J(θ)＝∑_jλ_jJ(θ^j)

wherein J (θ) is the total loss of training;

J(θ^j) A training loss for the jth task;

λ_jindicating the task weight of the jth task.

Optionally, the training system further comprises:

the preprocessing unit 3 is used for detecting whether the original picture contains a pedestrian or not; if the pedestrian is contained, acquiring the position of the pedestrian in the original picture; and cutting out a pedestrian image from the original image according to the pedestrian position.

In a fifth embodiment of the present application, referring to fig. 9, corresponding to the training method in the second embodiment, there is provided a training system for a pedestrian attribute analysis model, comprising:

the second training unit 4 is used for inputting the pedestrian image into the convolutional neural network to obtain a prediction attribute; calculating a training loss by using a real attribute corresponding to the pedestrian image and the prediction attribute; determining the current model parameters of the convolutional neural network as the model parameters of a pedestrian attribute analysis model under the condition of the convergence of the training loss to obtain the pedestrian attribute analysis model;

the second training unit 4 comprises:

a second weight self-updating unit 42, configured to adjust the task weight corresponding to the task according to the following formula:

wherein λ is_jA task weight representing a jth task;

a variance representing uncertainty in prediction in the jth task;

K_jthe option number of the value of the jth attribute is represented;

indicating the truth of the ith pedestrian image in the jth taskA real attribute;

The second training unit 4 is further configured to calculate a total training loss using the task weights:

J(θ)＝∑_jλ_jJ(θ^j)

wherein J (θ) is the total loss of training;

J(θ^j) A training loss for the jth task;

λ_jindicating the task weight of the jth task.

Optionally, the second training unit 4 is further configured to input the pedestrian image and the probability map corresponding to the pedestrian image into a convolutional neural network, so as to obtain a prediction attribute; wherein the probability map characterizes a set of probability values that each pixel node belongs to a pedestrian component region in the pedestrian image demarcated into at least one pedestrian component region.

Optionally, referring to fig. 10, the training system may further include:

Optionally, the second training unit 4 may further include:

a second pedestrian analysis assisting module 41 for updating the probability map; and obtaining fusion characteristics according to the updated probability map and the pedestrian characteristics.

The second training unit 4 is further configured to extract pedestrian features from the pedestrian image by using a first sub-network; and inputting the fusion characteristics into a second sub-network to obtain the prediction attributes.

Optionally, the step of updating the probability map by the second pedestrian analysis assisting module 41 specifically includes:

wherein x is_iThe ith pedestrian image;

representing the updated probability value;

Optionally, the step of obtaining the fusion feature according to the updated probability map and the pedestrian feature by the second pedestrian analysis assisting module 41 specifically includes:

φ(x_i)＝[φ(x_i)¹,φ(x_i)²,…,φ(x_i)^C']

a set of probability values representing the updated c-th channel;

pixel multiplication representing the corresponding position;

f_b(x_i) A pedestrian feature representing an ith pedestrian image;

φ(x_i) A first feature representing an ith pedestrian image;

Optionally, the training system further comprises:

In a sixth embodiment of the present application, there is also provided a pedestrian attribute identification device corresponding to the identification method in the third embodiment, including:

and the prediction unit is used for inputting the image of the pedestrian to be recognized into the neural network model trained by the training system in the fourth or fifth embodiment and outputting the recognized pedestrian attribute.

The same and similar parts in the various embodiments in this specification may be referred to each other. The above-described embodiments of the present invention should not be construed as limiting the scope of the present invention.

Claims

1. A training method of a pedestrian attribute analysis model is characterized by comprising the following steps:

2. Training method according to claim 1, characterized in that the calculation of the probability map comprises the following steps:

3. The training method of claim 1, wherein the convolutional neural network comprises a first sub-network and a second sub-network;

updating the probability map;

4. The training method according to claim 3, wherein the step of updating the probability map specifically comprises:

wherein x is_iThe ith pedestrian image;

representing the updated probability value;

and/or the presence of a gas in the gas,

φ(x_i)＝[φ(x_i)¹,φ(x_i)²,…,φ(x_i)^C']

a set of probability values representing the updated c-th channel;

pixel multiplication representing the corresponding position;

f_b(x_i) A pedestrian feature representing an ith pedestrian image;

φ(x_i) A first feature representing an ith pedestrian image;

5. A training method according to any one of claims 1 to 4, wherein the step of calculating a training loss using the real attributes corresponding to the pedestrian images and the predicted attributes specifically comprises:

J(θ)＝Σ_jλ_jJ(θ^j)

wherein J (θ) is the total loss of training;

J(θ^j) A training loss for the jth task;

λ_ja task weight representing a jth task;

a variance representing uncertainty in prediction in the jth task;

K_jthe option number of the value of the jth attribute is represented;

representing the real attribute of the ith pedestrian image in the jth task;

6. A training method of a pedestrian attribute analysis model is characterized by comprising the following steps:

J(θ)＝∑_jλ_jJ(θ^j)

wherein J (θ) is the total loss of training;

J(θ^j) A training loss for the jth task;

λ_ja task weight representing a jth task;

a variance representing uncertainty in prediction in the jth task;

K_jthe option number of the value of the jth attribute is represented;

representing the real attribute of the ith pedestrian image in the jth task;

7. A pedestrian attribute identification method is characterized by comprising the following steps:

inputting the image of the pedestrian to be recognized into the pedestrian attribute analysis model trained by the training method of any one of claims 1 to 5 or claim 6, and obtaining the recognized pedestrian attribute.

8. A pedestrian attribute analysis model training system, comprising:

9. A pedestrian attribute analysis model training system, comprising:

the second training unit comprises:

wherein λ is_jA task weight representing a jth task;

a variance representing uncertainty in prediction in the jth task;

K_jthe option number of the value of the jth attribute is represented;

representing the real attribute of the ith pedestrian image in the jth task;

J(θ)＝∑_jλ_jJ(θ^j)

wherein J (θ) is the total loss of training;

J(θ^j) A training loss for the jth task;

λ_jindicating the task weight of the jth task.

10. A pedestrian property identification device characterized by comprising:

a prediction unit, configured to input an image of a pedestrian to be recognized into the neural network model trained by the training system according to claim 8 or 9, and output a recognized pedestrian attribute.