CN109101915B

CN109101915B - Face, pedestrian and attribute recognition network structure design method based on deep learning

Info

Publication number: CN109101915B
Application number: CN201810864964.9A
Authority: CN
Inventors: 章东平; 陈思瑶
Original assignee: China Jiliang University
Current assignee: China Jiliang University
Priority date: 2018-08-01
Filing date: 2018-08-01
Publication date: 2021-04-27
Anticipated expiration: 2038-08-01
Also published as: CN109101915A

Abstract

The invention disclosesA method for designing a network structure for recognizing human faces, pedestrians and attributes thereof based on deep learning is characterized in that full-connection layers for obtaining the track characteristics of the pedestrian key points and connecting with a pedestrian characteristic extraction sub-network are subjected to characteristic fusion to obtain fusion characteristics

(ii) a The method comprises the steps of carrying out key point detection on multiple face images of the same person to obtain face key points, obtaining face key point track characteristics through calculation, carrying out feature fusion on the obtained face key point track characteristics and a full connection layer connected with a face multitask recognition sub-network to obtain fusion characteristics

Using fusion features

Identifying the human face and the attribute thereof; feature to be fused

And fusion features

Performing feature fusion to obtain feature fusion

Using fusion features

And identifying the pedestrian and the attribute thereof.

Description

Face, pedestrian and attribute recognition network structure design method based on deep learning

Technical Field

The invention relates to the field of deep learning of face and attribute recognition thereof and pedestrian and attribute recognition thereof, in particular to the construction of a network structure.

Background

At present, the technical result of face recognition is leapfrog in the field of academic research, but the problem of low reliability is always existed when the face recognition is applied to the real life. Most face recognition systems currently can only be used in some limited environments, such as: 1. the tested main body needs active matching; 2. the human face image has higher resolution; 3. good lighting conditions. Interference factors such as postures, illumination, expressions and the like often exist in natural scenes, and the interference needs to be overcome in the development and popularization of the face recognition technology.

The pedestrian re-identification technology is a technology for judging whether a specific pedestrian exists in an image or a video sequence by utilizing a computer vision technology. In the face of massive extended monitoring videos, the need of re-identifying pedestrians in the monitoring videos by using a computer arises. Pedestrian re-identification has been rapidly developed in recent years under the continuous effort of researchers, but there is a great gap with the demand for practical applications. Firstly, in a general monitoring video, the resolution of pedestrians in an image is low, and face information is fuzzy, which is very unfavorable for image analysis, feature extraction, segmentation and the like; secondly, there will be a situation of occlusion between the pedestrian and the pedestrian or other objects, which has a great influence on the representation of the pedestrian; finally, the appearance of the same person is greatly changed due to the difference of monitoring environments, the difference of camera parameters and the difference of illumination, and certain difficulty is brought to matching. How to overcome the difficulty of the above factors on the pedestrian matching task and find out an effective method to solve the problem is an important research direction for the pedestrian re-identification problem.

Disclosure of Invention

The invention overcomes the defects of the prior art, provides a face and pedestrian and attribute identification network structure based on deep learning, aims to identify the face and the attribute thereof and identify the pedestrian and the attribute thereof by utilizing a multitask network based on a convolutional neural network, and adds a pedestrian key point track characteristic and a face key point track characteristic to improve the accuracy of the face and pedestrian and attribute identification.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method for designing a network structure for recognizing human faces, pedestrians and attributes thereof based on deep learning comprises the following steps:

step (1): inputting continuous n frames of video images captured by a monitoring camera into a pedestrian detection and tracking module, and outputting a continuous n frames of pedestrian image sequence of an ith pedestrian when the ith pedestrian appears in the video images

Pedestrian detection adopts divisionThe source fast R-CNN algorithm comprises three basic frames, wherein the first basic frame is a region candidate network (RPN) used for generating a candidate region for each monitoring video image, the second basic frame is a convolutional neural network used for extracting pedestrian features from the candidate region, and the third basic frame is a binary Softmax classifier used for judging whether the candidate region contains pedestrians or not, and the pedestrian tracking adopts an optical flow tracking function of opencv;

step (2): respectively sequencing the continuous n frames of pedestrian images of the ith pedestrian obtained in the step (1)

Inputting the pedestrian feature extraction sub-network into a convolutional neural network-based pedestrian feature extraction sub-network, wherein the network comprises two network layers, namely a convolutional layer and a maximum sampling layer, the maximum sampling layer of the two convolutional layers is used as a substructure, and the pedestrian feature extraction sub-network comprises N series-connected substructures;

and (3): respectively aiming at the continuous n-frame pedestrian image sequence of the ith person obtained in the step (1)

Detecting the pedestrian key points to obtain corresponding m pedestrian key points

Calculating the pedestrian key point track by adopting a formula (1) through the position change of m pedestrian key point positions, normalizing the obtained m pedestrian key point track vectors by adopting a formula (2) respectively, merging the normalized m vectors to serve as the pedestrian key point track characteristics, and extracting the full-connection layer P connected with the pedestrian key point track characteristics and the pedestrian characteristic extraction sub-network_cThe output of the step (b) is subjected to feature fusion to obtain a fusion feature s₁The feature fusion adopts a concat layer in a deep learning framework cafe, and extracts the pedestrian key point track feature and the pedestrian feature into a full connection layer P connected with a sub-network_cAs the input of the concat layer, wherein the dimension of the track characteristic of the pedestrian key point is mx (n-1) × 2 dimensions, and the pedestrian characteristic is extractedGet full connectivity layer P that sub-networks are connected to_cThe dimension of (c) is D dimension, and the final concat layer output is the fusion feature S₁；

And (4): respectively sequencing the continuous n frames of pedestrian images of the ith pedestrian obtained in the step (1)

Inputting the image into a face detection module for face detection to obtain a continuous n-frame face image sequence of the ith person

The face detection module adopts a face detection module of an open source face recognition engine SeetaFace, the module adopts a Funnel-Structured Cascade structure (FuSt), the top of the FuSt Cascade structure is composed of a plurality of rapid LAB Cascade classifiers aiming at different postures, then a plurality of multilayer perceptron (MLP) Cascade structures based on SURF characteristics are arranged, finally a unified MLP Cascade structure is used for processing candidate windows of all postures, and finally a correct face window is reserved to obtain a face image;

and (5): judging the continuous n frames of face image sequence of the ith pedestrian obtained in the step (4)

Resolution, the face image with resolution larger than A multiplied by B is not subjected to super resolution processing, the face image with resolution smaller than A multiplied by B is subjected to super resolution processing, and finally the ith continuous n-frame face image sequence with higher pedestrian resolution is obtained

And (6): respectively sequencing the face images obtained in the step (5)

Inputting the facial features into a face feature extraction sub-network based on a convolutional neural network, wherein the network consists of M convolutional layers;

and (7): are respectively paired withThe human face image sequence obtained in the step (5)

Detecting key points of human face to obtain corresponding s key points of human face

Calculating the trajectories of key points of the human face by using a formula (1) according to the position change of key points of the s-person face, normalizing the trajectories of the key points of the obtained s-person face by using a formula (2), combining the normalized s vectors to serve as trajectory features of the key points of the human face, and connecting the trajectory features of the key points of the human face with a sub-network for extracting the features of the key points of the human face to form a full-connection layer F_cThe output of the step (2) is subjected to feature fusion to obtain a fusion feature S₂Wherein, the dimension of the track characteristic of the human face key points is s (x (n-1) x 2 dimension, and the whole connection layer F connected with the human face characteristic extraction sub-network_cThe dimension of (2) is D dimension;

and (8): fusing the features S obtained in the step (7)₂The method comprises the following steps of inputting a face identity feature layer, a face attribute 1 feature layer, a face attribute 2 feature layer and a face attribute v feature layer, wherein the face identity feature layer is used as an input of an identity classification layer, the face attribute 1 feature layer is used as an input of a face attribute 1 classification layer, the face attribute 2 feature layer is used as an input of a face attribute 2 classification layer, and the face attribute v feature layer is used as an input of a face attribute v classification layer;

and (9): for the fusion characteristics S obtained in the step (3)₁And the fusion feature S obtained in step (7)₂Performing feature fusion to obtain feature fusion S₃In which features S are fused₁Has dimension of m x (n-1) x 2+ D, and fuses the feature S₂Dimension of (D) is s × (n-1) × 2+ D dimension;

step (10): fusing the features S obtained in the step (9)₃The pedestrian identity characteristic layer is used as the input of a pedestrian identity characteristic layer, a pedestrian attribute 1 characteristic layer, a pedestrian attribute 2 characteristic layer, a pedestrian attribute v characteristic layer, the pedestrian identity characteristic layer is used as the input of a pedestrian identity classification layer, the pedestrian attribute 1 characteristic layer is used as the input of a pedestrian attribute 1 classification layer, and the pedestrian is classified into a pedestrian attribute 1 classification layerA pedestrian attribute 2 characteristic layer is used as the input of a pedestrian attribute 2 classification layer, and a pedestrian attribute u characteristic layer is used as the input of a pedestrian attribute u classification layer;

wherein, when t is 0,

representing the calculation of the pedestrian key point track of the ith pedestrian, k represents the k pedestrian key point of the ith pedestrian, and k belongs to [1, m ]]J represents the jth frame pedestrian image of the ith pedestrian, j belongs to [1, n-1 ]]，

Representing the k-th pedestrian key point track from the jth frame to the jth + 1-th frame of pedestrian images of the ith pedestrian,

a k-th pedestrian key point coordinate of a j + 1-th frame pedestrian image representing an i-th pedestrian,

the k-th pedestrian key point coordinate of the j-th frame pedestrian image representing the i-th pedestrian,

the x-axis coordinate of the kth pedestrian key point of the j +1 th frame pedestrian image representing the ith pedestrian,

the y-axis coordinate of the kth pedestrian key point of the j +1 th frame pedestrian image representing the ith pedestrian,

the x-axis coordinate of the kth pedestrian key point of the jth frame pedestrian image representing the ith pedestrian,

y-axis coordinates of a kth pedestrian key point of a jth frame pedestrian image representing an ith pedestrian;

when t is equal to 1, the first step is carried out,

representing the calculation of the face key point track of the ith pedestrian, k represents the k personal face key point of the ith pedestrian, and k belongs to [1, s ]]J represents the jth frame face image of the ith pedestrian, and j belongs to [1, n-1 ]]，

A k-th personal face key point track of the face images from the jth frame to the (j + 1) th frame representing the ith pedestrian,

the kth personal face key point coordinates of the j +1 th frame of face image representing the ith pedestrian,

the kth personal face key point coordinates of the jth frame of face image representing the ith pedestrian,

the x-axis coordinate of the kth personal face key point of the j +1 th frame of the face image of the ith pedestrian,

y-axis coordinates of a k-th personal face key point of a j + 1-th frame of the face image of the ith pedestrian,

the x-axis coordinate of the k personal face key point of the j frame human face image representing the ith pedestrian,

and the y-axis coordinate of the kth personal face key point of the jth frame of the face image of the ith pedestrian.

Wherein, when t is 0,

the method is characterized in that the track vector of the pedestrian key point of the ith pedestrian is normalized, k represents the k pedestrian key point of the ith pedestrian, and k belongs to [1, m ]]J represents the jth frame pedestrian image of the ith pedestrian, j belongs to [1, n-1 ]]，

The k-th pedestrian key point track characteristic of the continuous n-frame pedestrian images representing the ith pedestrian is a vector with (n-1) multiplied by 2 dimensions,

a k-th pedestrian keypoint trajectory representing a succession of n frames of pedestrian images of the i-th pedestrian,

representing the trace length of a k pedestrian key point from a jth frame to a j +1 th frame of pedestrian images of the ith pedestrian;

when t is equal to 1, the first step is carried out,

the method is characterized in that the trajectory vector of the face key point of the ith pedestrian is normalized, k represents the k personal face key point of the ith pedestrian, and k belongs to [1, s ]]J represents the jth frame face image of the ith pedestrian, and j belongs to [1, n-1 ]]，

The k-th personal face key point track characteristic of the continuous n-frame face images of the ith pedestrian is a vector of (n-1) multiplied by 2 dimensions,

a k-th individual face keypoint trajectory representing n consecutive frames of face images of an i-th pedestrian,

and the length of the k personal face key point track from the j frame to the j +1 frame of the face image of the ith pedestrian is represented.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a method for designing a network structure for identifying human faces, pedestrians and attributes thereof based on deep learning, wherein monitoring video images are input into a pedestrian detection and tracking module to carry out pedestrian detection and tracking so as to obtain a plurality of pedestrian images of the same person; carrying out pedestrian key point detection on the obtained images of a plurality of pedestrians of the same person, obtaining pedestrian key point track characteristics through calculation, carrying out characteristic fusion on the obtained pedestrian key point track characteristics and a full connection layer connected with a pedestrian characteristic extraction sub-network to obtain fusion characteristics S₁(ii) a Inputting the obtained multiple pedestrian images of the same person into a face detection module to carry out face detection to obtain multiple face images of the same person; judging the resolutions of a plurality of face images of the same person, directly inputting the face image with higher resolution into a face multitask recognition sub-network, performing super-resolution processing on the face image with lower resolution, and then inputting the face image into the face multitask recognition sub-network; face key points obtained by carrying out key point detection on a plurality of face images of the same person are calculated to obtain face key point track characteristics, and the obtained face key point track characteristics and a full connection layer connected with a face multitask identification sub-network are subjected to characteristic fusion to obtain fusion characteristics S₂Using the fusion feature S₂Identifying the human face and the attribute thereof; fusing features S₁And fusion feature S₂Performing feature fusion to obtain feature fusion S₃Using the fusion feature S₃And identifying the pedestrian and the attribute thereof. The network structure improves the accuracy of face and pedestrian identification and attribute identification.

Drawings

Fig. 1 is a schematic diagram of a network structure for recognizing human faces and pedestrians and their attributes based on deep learning.

Fig. 2 is a schematic diagram of a pedestrian feature extraction sub-network structure.

Fig. 3 is a schematic diagram of a face feature extraction sub-network structure.

Detailed Description

In this embodiment, as shown in fig. 1, a schematic diagram of a network structure for recognizing human faces, pedestrians, and attributes thereof based on deep learning mainly includes the following steps:

step (1): inputting continuous 15 frames of video images captured by a monitoring camera into a pedestrian detection and tracking module, and outputting a continuous 15 frames of pedestrian image sequence of an ith pedestrian when the ith pedestrian appears in the video images

The pedestrian detection adopts an open source fast R-CNN algorithm, which comprises three basic frames, wherein the first frame is a candidate area network (RPN) structure used for generating a candidate area for each monitoring video image, the second frame is a convolutional neural network used for extracting pedestrian characteristics from the candidate area, and the third frame is a binary Softmax classifier used for judging whether the candidate area contains a pedestrian or not, and the pedestrian tracking adopts an optical flow tracking function of opencv;

step (2): respectively sequencing the continuous 15-frame pedestrian image sequence of the ith pedestrian obtained in the step (1)

Inputting the pedestrian feature extraction sub-network into a convolutional neural network-based pedestrian feature extraction sub-network, wherein the network comprises two network layers, namely a convolutional layer and a maximum sampling layer, the maximum sampling layer of the two convolutional layers is used as a substructure, and the pedestrian feature extraction sub-network comprises 10 series-connected substructures;

and (3): respectively aiming at the continuous 15-frame pedestrian image sequence of the ith person obtained in the step (1)

Detecting the pedestrian key points to obtain corresponding 18 pedestrian key points

By changing the position of 18 pedestrian key pointsCalculating the pedestrian key point track by adopting a key point track calculation formula, normalizing the obtained 18 pedestrian key point track vectors by adopting a key point track normalization formula respectively, merging the 18 normalized vectors to serve as the pedestrian key point track characteristics, and extracting the pedestrian key point track characteristics and a pedestrian characteristic sub-network to form a full connection layer P_cThe output of the step (b) is subjected to feature fusion to obtain a fusion feature s₁The feature fusion adopts a concat layer in a deep learning framework cafe, and extracts the pedestrian key point track feature and the pedestrian feature into a full connection layer P connected with a sub-network_cThe dimension of the pedestrian key point track feature is 504 dimensions, and the pedestrian feature extraction sub-network is connected with a full connection layer P_cThe dimension of (c) is 512 dimensions, and the final concat layer output is the fusion feature S₁；

And (4): respectively sequencing the continuous 15-frame pedestrian image sequence of the ith pedestrian obtained in the step (1)

Inputting the image into a face detection module for face detection to obtain a continuous 15-frame face image sequence of the ith person

and (5): judging the continuous 15 frames of face image sequence of the ith pedestrian obtained in the step (4)

Resolution, to be greater than 112 x 112The face image is not subjected to super-resolution processing, the face image with the resolution ratio smaller than 112 multiplied by 112 is subjected to super-resolution processing, and finally the ith continuous 15-frame face image sequence with higher pedestrian resolution ratio is obtained

And (6): respectively sequencing the face images obtained in the step (5)

Inputting the facial features into a face feature extraction sub-network based on a convolutional neural network, wherein the network consists of 20 convolutional layers;

and (7): respectively aligning the face image sequences obtained in the step (5)

Detecting the key points of the face to obtain corresponding 5 key points of the face

Calculating the tracks of the key points of the human face by adopting a key point track calculation formula through the position change of the key points of the 5 human faces, normalizing the track vectors of the key points of the 5 obtained human faces by adopting a key point track normalization formula respectively, merging the 5 normalized vectors to serve as track characteristics of the key points of the human face, and extracting the track characteristics of the key points of the human face and a full connection layer F connected with a sub-network of the extraction of the human face characteristics_cThe output of the step (2) is subjected to feature fusion to obtain a fusion feature S₂Wherein, the dimension of the track characteristic of the face key points is 140 dimensions, and the face characteristic extraction sub-network is connected with a full connection layer F_cHas a dimension of 512 dimensions;

and (8): fusing the features S obtained in the step (7)₂The face identity characteristic layer is used as the input of an identity classification layer, the gender characteristic layer, the expression characteristic layer and the age characteristic layer, the gender characteristic layer is used as the input of the gender classification layer, the expression characteristic layer is used as the input of the expression classification layer, and the age characteristic layer is used as the input of the age classification layer;

and (9): for the fusion characteristics S obtained in the step (3)₁And the fusion feature S obtained in step (7)₂Performing feature fusion to obtain feature fusion S₃In which features S are fused₁Has a dimension of 1016, and is fused with the feature S₂Dimension of 652;

step (10): fusing the features S obtained in the step (9)₃The pedestrian identity characteristic layer is used as the input of a pedestrian identity characteristic layer, a gender characteristic layer, a hair style characteristic layer and a clothes type characteristic layer, the gender characteristic layer is used as the input of the gender classification layer, the hair style characteristic layer is used as the input of the hair style classification layer, and the clothes type characteristic layer is used as the input of the clothes type classification layer;

the formula for calculating the track of the key points is as follows:

wherein, when t is 0,

and (3) the y-axis coordinate of the kth pedestrian key point of the jth frame pedestrian image of the ith pedestrian is represented, when j is 1 and k is 1, the 1 st pedestrian key point track from the 1 st frame to the 2 nd frame pedestrian image of the ith person is as follows:

when t is equal to 1, the first step is carried out,

and when j is 1 and k is 1, the track of the 1 st personal face key point of the 1 st frame to the 2 nd frame of the face image of the ith person is as follows:

the key point track normalization formula is as follows:

wherein, when t is 0,

to representThe k-th pedestrian key point trajectory of n consecutive frames of pedestrian images of the ith pedestrian,

and (3) representing the k-th pedestrian key point track length from the j frame to the j +1 frame of the pedestrian image of the ith pedestrian, and when n is 15 and k is 1, normalizing the 1 st pedestrian key point track vector of the ith pedestrian for continuously 15 frames of the pedestrian image into:

when t is equal to 1, the first step is carried out,

the length of the k-th personal face key point track from the j-th frame to the j + 1-th frame of the face image of the ith pedestrian is represented, and when n is 15 and k is 1, the 1-st personal face key point track vector of the continuous 15-frame face image of the ith pedestrian is normalized as follows:

Claims

1. a method for designing a network structure for recognizing human faces, pedestrians and attributes thereof based on deep learning comprises the following steps:

step (1): inputting continuous n frames of video images captured by the monitoring camera into the pedestrian detection and tracking module, and outputting a continuous n frames of pedestrian image sequence { P) of the ith pedestrian when the ith pedestrian appears in the video images_i ¹,P_i ²,…,P_i ⁿThe pedestrian detection adopts an open-source fast R-CNN algorithm, the algorithm comprises three basic frames, the first frame is a candidate area network structure RPN used for generating a candidate area for each monitoring video image, the second frame is a convolutional neural network used for extracting pedestrian features from the candidate area, and the third frame is a binary Softmax classifier used for judging whether the candidate area contains a pedestrian or not, and the pedestrian tracking adopts an optical flow tracking function of opencv;

step (2): respectively converting the continuous n frames of pedestrian image sequences { P) of the ith pedestrian obtained in the step (1)_i ¹,P_i ²,…,P_i ⁿInputting the data into a pedestrian feature extraction sub-network based on a convolutional neural network, wherein the network comprises two network layers of a convolutional layer and a maximum sampling layer, the two convolutional layers are used for connecting the maximum sampling layer as a sub-structure, and the pedestrian feature extraction sub-network comprises N series sub-structures;

and (3): respectively for the i-th person's consecutive n-frame line image sequence { P) obtained in step (1)_i ¹,P_i ²,…,P_i ⁿDetecting pedestrian key points to obtain corresponding m pedestrian key points

Calculating the pedestrian key point track through the position change of m pedestrian key point positions, respectively normalizing the obtained m pedestrian key point track vectors, combining the normalized m vectors to serve as the pedestrian key point track characteristics, and extracting the pedestrian key point track characteristics and the pedestrian characteristic to form a full-connection layer P connected with a sub-network_cOutput of (2) performing feature fusionObtaining a fusion feature S₁The feature fusion adopts a concat layer in a deep learning framework cafe, and extracts the pedestrian key point track feature and the pedestrian feature into a full connection layer P connected with a sub-network_cAs the input of the concat layer, wherein the dimensionality of the track characteristic of the pedestrian key point is mx (n-1) × 2 dimensionality, and the pedestrian characteristic extraction sub-network is connected with the full connection layer P_cThe dimension of (c) is D dimension, and the final concat layer output is the fusion feature S₁；

And (4): respectively converting the continuous n frames of pedestrian image sequences { P) of the ith pedestrian obtained in the step (1)_i ¹,P_i ²,…,P_i ⁿInputting the image into a face detection module for face detection to obtain an ith personal continuous n-frame face image sequence (F)_i ¹,F_i ²,…,F_i ⁿThe face detection module adopts a face detection module of an open source face recognition engine SeetaFace, the module adopts a funnel type cascade structure FuSt, the funnel type cascade structure FuSt is composed of a plurality of rapid LAB cascade classifiers aiming at different postures at the top, then a plurality of multilayer perceptron MLP cascade structures based on SURF characteristics are arranged, finally a unified MLP cascade structure is used for processing candidate windows of all postures, and finally a correct face window is reserved to obtain a face image;

and (5): judging the continuous n frames of face image sequences { F) of the ith pedestrian obtained in the step (4)_i ¹,F_i ²,…,F_i ⁿResolving power, namely not performing super-resolution processing on the face image with the resolving power larger than A multiplied by B, and performing super-resolution processing on the face image with the resolving power smaller than A multiplied by B to finally obtain the ith continuous n-frame face image sequence with higher pedestrian resolution

And (6): respectively sequencing the face images obtained in the step (5)

Calculating the trajectories of key points of the human face through the position change of key points of the s-individual human face, respectively normalizing the trajectory vectors of the key points of the obtained s-individual human face, combining the normalized s vectors to serve as the trajectory features of the key points of the human face, and extracting the full-connection layer F of the sub-networks for connecting the trajectory features of the key points of the human face and the sub-networks for extracting the human face features_cThe output of the step (2) is subjected to feature fusion to obtain a fusion feature S₂Wherein, the dimension of the track characteristic of the human face key points is s (x (n-1) x 2 dimension, and the whole connection layer F connected with the human face characteristic extraction sub-network_cThe dimension of (2) is D dimension;

and (8): fusing the features S obtained in the step (7)₂The method comprises the following steps of taking the input of a face identity feature layer, a face attribute 1 feature layer, a face attribute 2 feature layer, … and a face attribute v feature layer, taking the face identity feature layer as the input of an identity classification layer, taking the face attribute 1 feature layer as the input of the face attribute 1 classification layer, taking the face attribute 2 feature layer as the input of the face attribute 2 classification layer, and taking … the face attribute v feature layer as the input of the face attribute v classification layer;

step (10): fusing the features S obtained in the step (9)₃As a pedestrian identity feature layer, a pedestrian attribute 1 feature layer, a pedestrian attribute 2 feature layer, …, a pedestrian attributeAnd inputting a sexual v characteristic layer, namely taking the pedestrian identity characteristic layer as the input of a pedestrian identity classification layer, taking the pedestrian attribute 1 characteristic layer as the input of a pedestrian attribute 1 classification layer, taking the pedestrian attribute 2 characteristic layer as the input of a pedestrian attribute 2 classification layer, …, and taking the pedestrian attribute v characteristic layer as the input of a pedestrian attribute v classification layer.

2. The method for designing the network structure based on the deep learning of the human faces, pedestrians and the attribute recognition thereof as claimed in claim 1, wherein: the calculation formula of the pedestrian key point track in the step (3) and the calculation formula of the face key point track in the step (7) are as follows:

wherein, when t is 0,

j +1 th frame pedestrian image representing ith pedestrianThe x-axis coordinate of the k-th pedestrian keypoint,

when t is equal to 1, the first step is carried out,

3. The method for designing the network structure based on the deep learning of the human faces, pedestrians and the attribute recognition thereof as claimed in claim 1, wherein: normalizing the trajectory vectors of the pedestrian key points in the step (3) and normalizing the trajectory vectors of the face key points in the step (7), wherein a normalization formula is as follows:

wherein, when t is 0,

k-th pedestrian key representing n consecutive frames of pedestrian images of i-th pedestrianThe locus of the points is such that,

when t is equal to 1, the first step is carried out,