CN109101915B - Face, pedestrian and attribute recognition network structure design method based on deep learning - Google Patents

Face, pedestrian and attribute recognition network structure design method based on deep learning Download PDF

Info

Publication number
CN109101915B
CN109101915B CN201810864964.9A CN201810864964A CN109101915B CN 109101915 B CN109101915 B CN 109101915B CN 201810864964 A CN201810864964 A CN 201810864964A CN 109101915 B CN109101915 B CN 109101915B
Authority
CN
China
Prior art keywords
pedestrian
face
key point
ith
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810864964.9A
Other languages
Chinese (zh)
Other versions
CN109101915A (en
Inventor
章东平
陈思瑶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Jiliang University
Original Assignee
China Jiliang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Jiliang University filed Critical China Jiliang University
Priority to CN201810864964.9A priority Critical patent/CN109101915B/en
Publication of CN109101915A publication Critical patent/CN109101915A/en
Application granted granted Critical
Publication of CN109101915B publication Critical patent/CN109101915B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • G06T2207/30201Face
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30241Trajectory

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention disclosesA method for designing a network structure for recognizing human faces, pedestrians and attributes thereof based on deep learning is characterized in that full-connection layers for obtaining the track characteristics of the pedestrian key points and connecting with a pedestrian characteristic extraction sub-network are subjected to characteristic fusion to obtain fusion characteristics
Figure DEST_PATH_IMAGE001
(ii) a The method comprises the steps of carrying out key point detection on multiple face images of the same person to obtain face key points, obtaining face key point track characteristics through calculation, carrying out feature fusion on the obtained face key point track characteristics and a full connection layer connected with a face multitask recognition sub-network to obtain fusion characteristics
Figure 907345DEST_PATH_IMAGE002
Using fusion features
Figure DEST_PATH_IMAGE003
Identifying the human face and the attribute thereof; feature to be fused
Figure 597084DEST_PATH_IMAGE004
And fusion features
Figure 419546DEST_PATH_IMAGE005
Performing feature fusion to obtain feature fusion
Figure 998164DEST_PATH_IMAGE006
Using fusion features
Figure 259381DEST_PATH_IMAGE007
And identifying the pedestrian and the attribute thereof.

Description

Face, pedestrian and attribute recognition network structure design method based on deep learning
Technical Field
The invention relates to the field of deep learning of face and attribute recognition thereof and pedestrian and attribute recognition thereof, in particular to the construction of a network structure.
Background
At present, the technical result of face recognition is leapfrog in the field of academic research, but the problem of low reliability is always existed when the face recognition is applied to the real life. Most face recognition systems currently can only be used in some limited environments, such as: 1. the tested main body needs active matching; 2. the human face image has higher resolution; 3. good lighting conditions. Interference factors such as postures, illumination, expressions and the like often exist in natural scenes, and the interference needs to be overcome in the development and popularization of the face recognition technology.
The pedestrian re-identification technology is a technology for judging whether a specific pedestrian exists in an image or a video sequence by utilizing a computer vision technology. In the face of massive extended monitoring videos, the need of re-identifying pedestrians in the monitoring videos by using a computer arises. Pedestrian re-identification has been rapidly developed in recent years under the continuous effort of researchers, but there is a great gap with the demand for practical applications. Firstly, in a general monitoring video, the resolution of pedestrians in an image is low, and face information is fuzzy, which is very unfavorable for image analysis, feature extraction, segmentation and the like; secondly, there will be a situation of occlusion between the pedestrian and the pedestrian or other objects, which has a great influence on the representation of the pedestrian; finally, the appearance of the same person is greatly changed due to the difference of monitoring environments, the difference of camera parameters and the difference of illumination, and certain difficulty is brought to matching. How to overcome the difficulty of the above factors on the pedestrian matching task and find out an effective method to solve the problem is an important research direction for the pedestrian re-identification problem.
Disclosure of Invention
The invention overcomes the defects of the prior art, provides a face and pedestrian and attribute identification network structure based on deep learning, aims to identify the face and the attribute thereof and identify the pedestrian and the attribute thereof by utilizing a multitask network based on a convolutional neural network, and adds a pedestrian key point track characteristic and a face key point track characteristic to improve the accuracy of the face and pedestrian and attribute identification.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method for designing a network structure for recognizing human faces, pedestrians and attributes thereof based on deep learning comprises the following steps:
step (1): inputting continuous n frames of video images captured by a monitoring camera into a pedestrian detection and tracking module, and outputting a continuous n frames of pedestrian image sequence of an ith pedestrian when the ith pedestrian appears in the video images
Figure BDA0001750688340000021
Pedestrian detection adopts divisionThe source fast R-CNN algorithm comprises three basic frames, wherein the first basic frame is a region candidate network (RPN) used for generating a candidate region for each monitoring video image, the second basic frame is a convolutional neural network used for extracting pedestrian features from the candidate region, and the third basic frame is a binary Softmax classifier used for judging whether the candidate region contains pedestrians or not, and the pedestrian tracking adopts an optical flow tracking function of opencv;
step (2): respectively sequencing the continuous n frames of pedestrian images of the ith pedestrian obtained in the step (1)
Figure BDA0001750688340000022
Inputting the pedestrian feature extraction sub-network into a convolutional neural network-based pedestrian feature extraction sub-network, wherein the network comprises two network layers, namely a convolutional layer and a maximum sampling layer, the maximum sampling layer of the two convolutional layers is used as a substructure, and the pedestrian feature extraction sub-network comprises N series-connected substructures;
and (3): respectively aiming at the continuous n-frame pedestrian image sequence of the ith person obtained in the step (1)
Figure BDA0001750688340000023
Detecting the pedestrian key points to obtain corresponding m pedestrian key points
Figure BDA0001750688340000024
Calculating the pedestrian key point track by adopting a formula (1) through the position change of m pedestrian key point positions, normalizing the obtained m pedestrian key point track vectors by adopting a formula (2) respectively, merging the normalized m vectors to serve as the pedestrian key point track characteristics, and extracting the full-connection layer P connected with the pedestrian key point track characteristics and the pedestrian characteristic extraction sub-networkcThe output of the step (b) is subjected to feature fusion to obtain a fusion feature s1The feature fusion adopts a concat layer in a deep learning framework cafe, and extracts the pedestrian key point track feature and the pedestrian feature into a full connection layer P connected with a sub-networkcAs the input of the concat layer, wherein the dimension of the track characteristic of the pedestrian key point is mx (n-1) × 2 dimensions, and the pedestrian characteristic is extractedGet full connectivity layer P that sub-networks are connected tocThe dimension of (c) is D dimension, and the final concat layer output is the fusion feature S1
And (4): respectively sequencing the continuous n frames of pedestrian images of the ith pedestrian obtained in the step (1)
Figure BDA0001750688340000025
Inputting the image into a face detection module for face detection to obtain a continuous n-frame face image sequence of the ith person
Figure BDA0001750688340000026
The face detection module adopts a face detection module of an open source face recognition engine SeetaFace, the module adopts a Funnel-Structured Cascade structure (FuSt), the top of the FuSt Cascade structure is composed of a plurality of rapid LAB Cascade classifiers aiming at different postures, then a plurality of multilayer perceptron (MLP) Cascade structures based on SURF characteristics are arranged, finally a unified MLP Cascade structure is used for processing candidate windows of all postures, and finally a correct face window is reserved to obtain a face image;
and (5): judging the continuous n frames of face image sequence of the ith pedestrian obtained in the step (4)
Figure BDA0001750688340000031
Resolution, the face image with resolution larger than A multiplied by B is not subjected to super resolution processing, the face image with resolution smaller than A multiplied by B is subjected to super resolution processing, and finally the ith continuous n-frame face image sequence with higher pedestrian resolution is obtained
Figure BDA0001750688340000032
And (6): respectively sequencing the face images obtained in the step (5)
Figure BDA0001750688340000033
Inputting the facial features into a face feature extraction sub-network based on a convolutional neural network, wherein the network consists of M convolutional layers;
and (7): are respectively paired withThe human face image sequence obtained in the step (5)
Figure BDA0001750688340000034
Detecting key points of human face to obtain corresponding s key points of human face
Figure BDA0001750688340000035
Calculating the trajectories of key points of the human face by using a formula (1) according to the position change of key points of the s-person face, normalizing the trajectories of the key points of the obtained s-person face by using a formula (2), combining the normalized s vectors to serve as trajectory features of the key points of the human face, and connecting the trajectory features of the key points of the human face with a sub-network for extracting the features of the key points of the human face to form a full-connection layer FcThe output of the step (2) is subjected to feature fusion to obtain a fusion feature S2Wherein, the dimension of the track characteristic of the human face key points is s (x (n-1) x 2 dimension, and the whole connection layer F connected with the human face characteristic extraction sub-networkcThe dimension of (2) is D dimension;
and (8): fusing the features S obtained in the step (7)2The method comprises the following steps of inputting a face identity feature layer, a face attribute 1 feature layer, a face attribute 2 feature layer and a face attribute v feature layer, wherein the face identity feature layer is used as an input of an identity classification layer, the face attribute 1 feature layer is used as an input of a face attribute 1 classification layer, the face attribute 2 feature layer is used as an input of a face attribute 2 classification layer, and the face attribute v feature layer is used as an input of a face attribute v classification layer;
and (9): for the fusion characteristics S obtained in the step (3)1And the fusion feature S obtained in step (7)2Performing feature fusion to obtain feature fusion S3In which features S are fused1Has dimension of m x (n-1) x 2+ D, and fuses the feature S2Dimension of (D) is s × (n-1) × 2+ D dimension;
step (10): fusing the features S obtained in the step (9)3The pedestrian identity characteristic layer is used as the input of a pedestrian identity characteristic layer, a pedestrian attribute 1 characteristic layer, a pedestrian attribute 2 characteristic layer, a pedestrian attribute v characteristic layer, the pedestrian identity characteristic layer is used as the input of a pedestrian identity classification layer, the pedestrian attribute 1 characteristic layer is used as the input of a pedestrian attribute 1 classification layer, and the pedestrian is classified into a pedestrian attribute 1 classification layerA pedestrian attribute 2 characteristic layer is used as the input of a pedestrian attribute 2 classification layer, and a pedestrian attribute u characteristic layer is used as the input of a pedestrian attribute u classification layer;
Figure BDA0001750688340000041
wherein, when t is 0,
Figure BDA0001750688340000042
representing the calculation of the pedestrian key point track of the ith pedestrian, k represents the k pedestrian key point of the ith pedestrian, and k belongs to [1, m ]]J represents the jth frame pedestrian image of the ith pedestrian, j belongs to [1, n-1 ]],
Figure BDA0001750688340000043
Representing the k-th pedestrian key point track from the jth frame to the jth + 1-th frame of pedestrian images of the ith pedestrian,
Figure BDA0001750688340000044
a k-th pedestrian key point coordinate of a j + 1-th frame pedestrian image representing an i-th pedestrian,
Figure BDA0001750688340000045
the k-th pedestrian key point coordinate of the j-th frame pedestrian image representing the i-th pedestrian,
Figure BDA0001750688340000046
the x-axis coordinate of the kth pedestrian key point of the j +1 th frame pedestrian image representing the ith pedestrian,
Figure BDA0001750688340000047
the y-axis coordinate of the kth pedestrian key point of the j +1 th frame pedestrian image representing the ith pedestrian,
Figure BDA0001750688340000048
the x-axis coordinate of the kth pedestrian key point of the jth frame pedestrian image representing the ith pedestrian,
Figure BDA0001750688340000049
y-axis coordinates of a kth pedestrian key point of a jth frame pedestrian image representing an ith pedestrian;
when t is equal to 1, the first step is carried out,
Figure BDA00017506883400000410
representing the calculation of the face key point track of the ith pedestrian, k represents the k personal face key point of the ith pedestrian, and k belongs to [1, s ]]J represents the jth frame face image of the ith pedestrian, and j belongs to [1, n-1 ]],
Figure BDA00017506883400000411
A k-th personal face key point track of the face images from the jth frame to the (j + 1) th frame representing the ith pedestrian,
Figure BDA00017506883400000412
the kth personal face key point coordinates of the j +1 th frame of face image representing the ith pedestrian,
Figure BDA00017506883400000413
the kth personal face key point coordinates of the jth frame of face image representing the ith pedestrian,
Figure BDA00017506883400000414
the x-axis coordinate of the kth personal face key point of the j +1 th frame of the face image of the ith pedestrian,
Figure BDA00017506883400000415
y-axis coordinates of a k-th personal face key point of a j + 1-th frame of the face image of the ith pedestrian,
Figure BDA00017506883400000416
the x-axis coordinate of the k personal face key point of the j frame human face image representing the ith pedestrian,
Figure BDA00017506883400000417
and the y-axis coordinate of the kth personal face key point of the jth frame of the face image of the ith pedestrian.
Figure BDA00017506883400000418
Wherein, when t is 0,
Figure BDA00017506883400000419
the method is characterized in that the track vector of the pedestrian key point of the ith pedestrian is normalized, k represents the k pedestrian key point of the ith pedestrian, and k belongs to [1, m ]]J represents the jth frame pedestrian image of the ith pedestrian, j belongs to [1, n-1 ]],
Figure BDA0001750688340000051
The k-th pedestrian key point track characteristic of the continuous n-frame pedestrian images representing the ith pedestrian is a vector with (n-1) multiplied by 2 dimensions,
Figure BDA0001750688340000052
a k-th pedestrian keypoint trajectory representing a succession of n frames of pedestrian images of the i-th pedestrian,
Figure BDA0001750688340000053
representing the trace length of a k pedestrian key point from a jth frame to a j +1 th frame of pedestrian images of the ith pedestrian;
when t is equal to 1, the first step is carried out,
Figure BDA0001750688340000054
the method is characterized in that the trajectory vector of the face key point of the ith pedestrian is normalized, k represents the k personal face key point of the ith pedestrian, and k belongs to [1, s ]]J represents the jth frame face image of the ith pedestrian, and j belongs to [1, n-1 ]],
Figure BDA0001750688340000055
The k-th personal face key point track characteristic of the continuous n-frame face images of the ith pedestrian is a vector of (n-1) multiplied by 2 dimensions,
Figure BDA0001750688340000056
a k-th individual face keypoint trajectory representing n consecutive frames of face images of an i-th pedestrian,
Figure BDA0001750688340000057
and the length of the k personal face key point track from the j frame to the j +1 frame of the face image of the ith pedestrian is represented.
Compared with the prior art, the invention has the beneficial effects that:
the invention provides a method for designing a network structure for identifying human faces, pedestrians and attributes thereof based on deep learning, wherein monitoring video images are input into a pedestrian detection and tracking module to carry out pedestrian detection and tracking so as to obtain a plurality of pedestrian images of the same person; carrying out pedestrian key point detection on the obtained images of a plurality of pedestrians of the same person, obtaining pedestrian key point track characteristics through calculation, carrying out characteristic fusion on the obtained pedestrian key point track characteristics and a full connection layer connected with a pedestrian characteristic extraction sub-network to obtain fusion characteristics S1(ii) a Inputting the obtained multiple pedestrian images of the same person into a face detection module to carry out face detection to obtain multiple face images of the same person; judging the resolutions of a plurality of face images of the same person, directly inputting the face image with higher resolution into a face multitask recognition sub-network, performing super-resolution processing on the face image with lower resolution, and then inputting the face image into the face multitask recognition sub-network; face key points obtained by carrying out key point detection on a plurality of face images of the same person are calculated to obtain face key point track characteristics, and the obtained face key point track characteristics and a full connection layer connected with a face multitask identification sub-network are subjected to characteristic fusion to obtain fusion characteristics S2Using the fusion feature S2Identifying the human face and the attribute thereof; fusing features S1And fusion feature S2Performing feature fusion to obtain feature fusion S3Using the fusion feature S3And identifying the pedestrian and the attribute thereof. The network structure improves the accuracy of face and pedestrian identification and attribute identification.
Drawings
Fig. 1 is a schematic diagram of a network structure for recognizing human faces and pedestrians and their attributes based on deep learning.
Fig. 2 is a schematic diagram of a pedestrian feature extraction sub-network structure.
Fig. 3 is a schematic diagram of a face feature extraction sub-network structure.
Detailed Description
In this embodiment, as shown in fig. 1, a schematic diagram of a network structure for recognizing human faces, pedestrians, and attributes thereof based on deep learning mainly includes the following steps:
step (1): inputting continuous 15 frames of video images captured by a monitoring camera into a pedestrian detection and tracking module, and outputting a continuous 15 frames of pedestrian image sequence of an ith pedestrian when the ith pedestrian appears in the video images
Figure BDA0001750688340000061
The pedestrian detection adopts an open source fast R-CNN algorithm, which comprises three basic frames, wherein the first frame is a candidate area network (RPN) structure used for generating a candidate area for each monitoring video image, the second frame is a convolutional neural network used for extracting pedestrian characteristics from the candidate area, and the third frame is a binary Softmax classifier used for judging whether the candidate area contains a pedestrian or not, and the pedestrian tracking adopts an optical flow tracking function of opencv;
step (2): respectively sequencing the continuous 15-frame pedestrian image sequence of the ith pedestrian obtained in the step (1)
Figure BDA0001750688340000062
Inputting the pedestrian feature extraction sub-network into a convolutional neural network-based pedestrian feature extraction sub-network, wherein the network comprises two network layers, namely a convolutional layer and a maximum sampling layer, the maximum sampling layer of the two convolutional layers is used as a substructure, and the pedestrian feature extraction sub-network comprises 10 series-connected substructures;
and (3): respectively aiming at the continuous 15-frame pedestrian image sequence of the ith person obtained in the step (1)
Figure BDA0001750688340000063
Detecting the pedestrian key points to obtain corresponding 18 pedestrian key points
Figure BDA0001750688340000064
By changing the position of 18 pedestrian key pointsCalculating the pedestrian key point track by adopting a key point track calculation formula, normalizing the obtained 18 pedestrian key point track vectors by adopting a key point track normalization formula respectively, merging the 18 normalized vectors to serve as the pedestrian key point track characteristics, and extracting the pedestrian key point track characteristics and a pedestrian characteristic sub-network to form a full connection layer PcThe output of the step (b) is subjected to feature fusion to obtain a fusion feature s1The feature fusion adopts a concat layer in a deep learning framework cafe, and extracts the pedestrian key point track feature and the pedestrian feature into a full connection layer P connected with a sub-networkcThe dimension of the pedestrian key point track feature is 504 dimensions, and the pedestrian feature extraction sub-network is connected with a full connection layer PcThe dimension of (c) is 512 dimensions, and the final concat layer output is the fusion feature S1
And (4): respectively sequencing the continuous 15-frame pedestrian image sequence of the ith pedestrian obtained in the step (1)
Figure BDA0001750688340000065
Inputting the image into a face detection module for face detection to obtain a continuous 15-frame face image sequence of the ith person
Figure BDA0001750688340000071
The face detection module adopts a face detection module of an open source face recognition engine SeetaFace, the module adopts a Funnel-Structured Cascade structure (FuSt), the top of the FuSt Cascade structure is composed of a plurality of rapid LAB Cascade classifiers aiming at different postures, then a plurality of multilayer perceptron (MLP) Cascade structures based on SURF characteristics are arranged, finally a unified MLP Cascade structure is used for processing candidate windows of all postures, and finally a correct face window is reserved to obtain a face image;
and (5): judging the continuous 15 frames of face image sequence of the ith pedestrian obtained in the step (4)
Figure BDA0001750688340000072
Resolution, to be greater than 112 x 112The face image is not subjected to super-resolution processing, the face image with the resolution ratio smaller than 112 multiplied by 112 is subjected to super-resolution processing, and finally the ith continuous 15-frame face image sequence with higher pedestrian resolution ratio is obtained
Figure BDA0001750688340000073
And (6): respectively sequencing the face images obtained in the step (5)
Figure BDA0001750688340000074
Inputting the facial features into a face feature extraction sub-network based on a convolutional neural network, wherein the network consists of 20 convolutional layers;
and (7): respectively aligning the face image sequences obtained in the step (5)
Figure BDA0001750688340000075
Detecting the key points of the face to obtain corresponding 5 key points of the face
Figure BDA0001750688340000076
Calculating the tracks of the key points of the human face by adopting a key point track calculation formula through the position change of the key points of the 5 human faces, normalizing the track vectors of the key points of the 5 obtained human faces by adopting a key point track normalization formula respectively, merging the 5 normalized vectors to serve as track characteristics of the key points of the human face, and extracting the track characteristics of the key points of the human face and a full connection layer F connected with a sub-network of the extraction of the human face characteristicscThe output of the step (2) is subjected to feature fusion to obtain a fusion feature S2Wherein, the dimension of the track characteristic of the face key points is 140 dimensions, and the face characteristic extraction sub-network is connected with a full connection layer FcHas a dimension of 512 dimensions;
and (8): fusing the features S obtained in the step (7)2The face identity characteristic layer is used as the input of an identity classification layer, the gender characteristic layer, the expression characteristic layer and the age characteristic layer, the gender characteristic layer is used as the input of the gender classification layer, the expression characteristic layer is used as the input of the expression classification layer, and the age characteristic layer is used as the input of the age classification layer;
and (9): for the fusion characteristics S obtained in the step (3)1And the fusion feature S obtained in step (7)2Performing feature fusion to obtain feature fusion S3In which features S are fused1Has a dimension of 1016, and is fused with the feature S2Dimension of 652;
step (10): fusing the features S obtained in the step (9)3The pedestrian identity characteristic layer is used as the input of a pedestrian identity characteristic layer, a gender characteristic layer, a hair style characteristic layer and a clothes type characteristic layer, the gender characteristic layer is used as the input of the gender classification layer, the hair style characteristic layer is used as the input of the hair style classification layer, and the clothes type characteristic layer is used as the input of the clothes type classification layer;
the formula for calculating the track of the key points is as follows:
Figure BDA0001750688340000081
wherein, when t is 0,
Figure BDA0001750688340000082
representing the calculation of the pedestrian key point track of the ith pedestrian, k represents the k pedestrian key point of the ith pedestrian, and k belongs to [1, m ]]J represents the jth frame pedestrian image of the ith pedestrian, j belongs to [1, n-1 ]],
Figure BDA0001750688340000083
Representing the k-th pedestrian key point track from the jth frame to the jth + 1-th frame of pedestrian images of the ith pedestrian,
Figure BDA0001750688340000084
a k-th pedestrian key point coordinate of a j + 1-th frame pedestrian image representing an i-th pedestrian,
Figure BDA0001750688340000085
the k-th pedestrian key point coordinate of the j-th frame pedestrian image representing the i-th pedestrian,
Figure BDA0001750688340000086
the x-axis coordinate of the kth pedestrian key point of the j +1 th frame pedestrian image representing the ith pedestrian,
Figure BDA0001750688340000087
the y-axis coordinate of the kth pedestrian key point of the j +1 th frame pedestrian image representing the ith pedestrian,
Figure BDA0001750688340000088
the x-axis coordinate of the kth pedestrian key point of the jth frame pedestrian image representing the ith pedestrian,
Figure BDA0001750688340000089
and (3) the y-axis coordinate of the kth pedestrian key point of the jth frame pedestrian image of the ith pedestrian is represented, when j is 1 and k is 1, the 1 st pedestrian key point track from the 1 st frame to the 2 nd frame pedestrian image of the ith person is as follows:
Figure BDA00017506883400000810
when t is equal to 1, the first step is carried out,
Figure BDA00017506883400000811
representing the calculation of the face key point track of the ith pedestrian, k represents the k personal face key point of the ith pedestrian, and k belongs to [1, s ]]J represents the jth frame face image of the ith pedestrian, and j belongs to [1, n-1 ]],
Figure BDA00017506883400000812
A k-th personal face key point track of the face images from the jth frame to the (j + 1) th frame representing the ith pedestrian,
Figure BDA00017506883400000813
the kth personal face key point coordinates of the j +1 th frame of face image representing the ith pedestrian,
Figure BDA00017506883400000814
the kth personal face key point coordinates of the jth frame of face image representing the ith pedestrian,
Figure BDA00017506883400000815
the x-axis coordinate of the kth personal face key point of the j +1 th frame of the face image of the ith pedestrian,
Figure BDA00017506883400000816
y-axis coordinates of a k-th personal face key point of a j + 1-th frame of the face image of the ith pedestrian,
Figure BDA00017506883400000817
the x-axis coordinate of the k personal face key point of the j frame human face image representing the ith pedestrian,
Figure BDA00017506883400000818
and when j is 1 and k is 1, the track of the 1 st personal face key point of the 1 st frame to the 2 nd frame of the face image of the ith person is as follows:
Figure BDA00017506883400000819
the key point track normalization formula is as follows:
Figure BDA0001750688340000091
wherein, when t is 0,
Figure BDA0001750688340000092
the method is characterized in that the track vector of the pedestrian key point of the ith pedestrian is normalized, k represents the k pedestrian key point of the ith pedestrian, and k belongs to [1, m ]]J represents the jth frame pedestrian image of the ith pedestrian, j belongs to [1, n-1 ]],
Figure BDA0001750688340000093
The k-th pedestrian key point track characteristic of the continuous n-frame pedestrian images representing the ith pedestrian is a vector with (n-1) multiplied by 2 dimensions,
Figure BDA0001750688340000094
to representThe k-th pedestrian key point trajectory of n consecutive frames of pedestrian images of the ith pedestrian,
Figure BDA0001750688340000095
and (3) representing the k-th pedestrian key point track length from the j frame to the j +1 frame of the pedestrian image of the ith pedestrian, and when n is 15 and k is 1, normalizing the 1 st pedestrian key point track vector of the ith pedestrian for continuously 15 frames of the pedestrian image into:
Figure BDA0001750688340000096
when t is equal to 1, the first step is carried out,
Figure BDA0001750688340000097
the method is characterized in that the trajectory vector of the face key point of the ith pedestrian is normalized, k represents the k personal face key point of the ith pedestrian, and k belongs to [1, s ]]J represents the jth frame face image of the ith pedestrian, and j belongs to [1, n-1 ]],
Figure BDA0001750688340000098
The k-th personal face key point track characteristic of the continuous n-frame face images of the ith pedestrian is a vector of (n-1) multiplied by 2 dimensions,
Figure BDA0001750688340000099
a k-th individual face keypoint trajectory representing n consecutive frames of face images of an i-th pedestrian,
Figure BDA00017506883400000910
the length of the k-th personal face key point track from the j-th frame to the j + 1-th frame of the face image of the ith pedestrian is represented, and when n is 15 and k is 1, the 1-st personal face key point track vector of the continuous 15-frame face image of the ith pedestrian is normalized as follows:
Figure BDA00017506883400000911

Claims (3)

1. a method for designing a network structure for recognizing human faces, pedestrians and attributes thereof based on deep learning comprises the following steps:
step (1): inputting continuous n frames of video images captured by the monitoring camera into the pedestrian detection and tracking module, and outputting a continuous n frames of pedestrian image sequence { P) of the ith pedestrian when the ith pedestrian appears in the video imagesi 1,Pi 2,…,Pi nThe pedestrian detection adopts an open-source fast R-CNN algorithm, the algorithm comprises three basic frames, the first frame is a candidate area network structure RPN used for generating a candidate area for each monitoring video image, the second frame is a convolutional neural network used for extracting pedestrian features from the candidate area, and the third frame is a binary Softmax classifier used for judging whether the candidate area contains a pedestrian or not, and the pedestrian tracking adopts an optical flow tracking function of opencv;
step (2): respectively converting the continuous n frames of pedestrian image sequences { P) of the ith pedestrian obtained in the step (1)i 1,Pi 2,…,Pi nInputting the data into a pedestrian feature extraction sub-network based on a convolutional neural network, wherein the network comprises two network layers of a convolutional layer and a maximum sampling layer, the two convolutional layers are used for connecting the maximum sampling layer as a sub-structure, and the pedestrian feature extraction sub-network comprises N series sub-structures;
and (3): respectively for the i-th person's consecutive n-frame line image sequence { P) obtained in step (1)i 1,Pi 2,…,Pi nDetecting pedestrian key points to obtain corresponding m pedestrian key points
Figure FDA0002974100290000011
Calculating the pedestrian key point track through the position change of m pedestrian key point positions, respectively normalizing the obtained m pedestrian key point track vectors, combining the normalized m vectors to serve as the pedestrian key point track characteristics, and extracting the pedestrian key point track characteristics and the pedestrian characteristic to form a full-connection layer P connected with a sub-networkcOutput of (2) performing feature fusionObtaining a fusion feature S1The feature fusion adopts a concat layer in a deep learning framework cafe, and extracts the pedestrian key point track feature and the pedestrian feature into a full connection layer P connected with a sub-networkcAs the input of the concat layer, wherein the dimensionality of the track characteristic of the pedestrian key point is mx (n-1) × 2 dimensionality, and the pedestrian characteristic extraction sub-network is connected with the full connection layer PcThe dimension of (c) is D dimension, and the final concat layer output is the fusion feature S1
And (4): respectively converting the continuous n frames of pedestrian image sequences { P) of the ith pedestrian obtained in the step (1)i 1,Pi 2,…,Pi nInputting the image into a face detection module for face detection to obtain an ith personal continuous n-frame face image sequence (F)i 1,Fi 2,…,Fi nThe face detection module adopts a face detection module of an open source face recognition engine SeetaFace, the module adopts a funnel type cascade structure FuSt, the funnel type cascade structure FuSt is composed of a plurality of rapid LAB cascade classifiers aiming at different postures at the top, then a plurality of multilayer perceptron MLP cascade structures based on SURF characteristics are arranged, finally a unified MLP cascade structure is used for processing candidate windows of all postures, and finally a correct face window is reserved to obtain a face image;
and (5): judging the continuous n frames of face image sequences { F) of the ith pedestrian obtained in the step (4)i 1,Fi 2,…,Fi nResolving power, namely not performing super-resolution processing on the face image with the resolving power larger than A multiplied by B, and performing super-resolution processing on the face image with the resolving power smaller than A multiplied by B to finally obtain the ith continuous n-frame face image sequence with higher pedestrian resolution
Figure FDA0002974100290000021
And (6): respectively sequencing the face images obtained in the step (5)
Figure FDA0002974100290000022
Inputting the facial features into a face feature extraction sub-network based on a convolutional neural network, wherein the network consists of M convolutional layers;
and (7): respectively aligning the face image sequences obtained in the step (5)
Figure FDA0002974100290000023
Detecting key points of human face to obtain corresponding s key points of human face
Figure FDA0002974100290000024
Calculating the trajectories of key points of the human face through the position change of key points of the s-individual human face, respectively normalizing the trajectory vectors of the key points of the obtained s-individual human face, combining the normalized s vectors to serve as the trajectory features of the key points of the human face, and extracting the full-connection layer F of the sub-networks for connecting the trajectory features of the key points of the human face and the sub-networks for extracting the human face featurescThe output of the step (2) is subjected to feature fusion to obtain a fusion feature S2Wherein, the dimension of the track characteristic of the human face key points is s (x (n-1) x 2 dimension, and the whole connection layer F connected with the human face characteristic extraction sub-networkcThe dimension of (2) is D dimension;
and (8): fusing the features S obtained in the step (7)2The method comprises the following steps of taking the input of a face identity feature layer, a face attribute 1 feature layer, a face attribute 2 feature layer, … and a face attribute v feature layer, taking the face identity feature layer as the input of an identity classification layer, taking the face attribute 1 feature layer as the input of the face attribute 1 classification layer, taking the face attribute 2 feature layer as the input of the face attribute 2 classification layer, and taking … the face attribute v feature layer as the input of the face attribute v classification layer;
and (9): for the fusion characteristics S obtained in the step (3)1And the fusion feature S obtained in step (7)2Performing feature fusion to obtain feature fusion S3In which features S are fused1Has dimension of m x (n-1) x 2+ D, and fuses the feature S2Dimension of (D) is s × (n-1) × 2+ D dimension;
step (10): fusing the features S obtained in the step (9)3As a pedestrian identity feature layer, a pedestrian attribute 1 feature layer, a pedestrian attribute 2 feature layer, …, a pedestrian attributeAnd inputting a sexual v characteristic layer, namely taking the pedestrian identity characteristic layer as the input of a pedestrian identity classification layer, taking the pedestrian attribute 1 characteristic layer as the input of a pedestrian attribute 1 classification layer, taking the pedestrian attribute 2 characteristic layer as the input of a pedestrian attribute 2 classification layer, …, and taking the pedestrian attribute v characteristic layer as the input of a pedestrian attribute v classification layer.
2. The method for designing the network structure based on the deep learning of the human faces, pedestrians and the attribute recognition thereof as claimed in claim 1, wherein: the calculation formula of the pedestrian key point track in the step (3) and the calculation formula of the face key point track in the step (7) are as follows:
Figure FDA0002974100290000031
wherein, when t is 0,
Figure FDA0002974100290000032
representing the calculation of the pedestrian key point track of the ith pedestrian, k represents the k pedestrian key point of the ith pedestrian, and k belongs to [1, m ]]J represents the jth frame pedestrian image of the ith pedestrian, j belongs to [1, n-1 ]],
Figure FDA0002974100290000033
Representing the k-th pedestrian key point track from the jth frame to the jth + 1-th frame of pedestrian images of the ith pedestrian,
Figure FDA0002974100290000034
a k-th pedestrian key point coordinate of a j + 1-th frame pedestrian image representing an i-th pedestrian,
Figure FDA0002974100290000035
the k-th pedestrian key point coordinate of the j-th frame pedestrian image representing the i-th pedestrian,
Figure FDA0002974100290000036
j +1 th frame pedestrian image representing ith pedestrianThe x-axis coordinate of the k-th pedestrian keypoint,
Figure FDA0002974100290000037
the y-axis coordinate of the kth pedestrian key point of the j +1 th frame pedestrian image representing the ith pedestrian,
Figure FDA0002974100290000038
the x-axis coordinate of the kth pedestrian key point of the jth frame pedestrian image representing the ith pedestrian,
Figure FDA0002974100290000039
y-axis coordinates of a kth pedestrian key point of a jth frame pedestrian image representing an ith pedestrian;
when t is equal to 1, the first step is carried out,
Figure FDA00029741002900000310
representing the calculation of the face key point track of the ith pedestrian, k represents the k personal face key point of the ith pedestrian, and k belongs to [1, s ]]J represents the jth frame face image of the ith pedestrian, and j belongs to [1, n-1 ]],
Figure FDA00029741002900000311
A k-th personal face key point track of the face images from the jth frame to the (j + 1) th frame representing the ith pedestrian,
Figure FDA00029741002900000312
the kth personal face key point coordinates of the j +1 th frame of face image representing the ith pedestrian,
Figure FDA00029741002900000313
the kth personal face key point coordinates of the jth frame of face image representing the ith pedestrian,
Figure FDA00029741002900000314
the x-axis coordinate of the kth personal face key point of the j +1 th frame of the face image of the ith pedestrian,
Figure FDA00029741002900000315
y-axis coordinates of a k-th personal face key point of a j + 1-th frame of the face image of the ith pedestrian,
Figure FDA00029741002900000316
the x-axis coordinate of the k personal face key point of the j frame human face image representing the ith pedestrian,
Figure FDA00029741002900000317
and the y-axis coordinate of the kth personal face key point of the jth frame of the face image of the ith pedestrian.
3. The method for designing the network structure based on the deep learning of the human faces, pedestrians and the attribute recognition thereof as claimed in claim 1, wherein: normalizing the trajectory vectors of the pedestrian key points in the step (3) and normalizing the trajectory vectors of the face key points in the step (7), wherein a normalization formula is as follows:
Figure FDA0002974100290000041
wherein, when t is 0,
Figure FDA0002974100290000042
the method is characterized in that the track vector of the pedestrian key point of the ith pedestrian is normalized, k represents the k pedestrian key point of the ith pedestrian, and k belongs to [1, m ]]J represents the jth frame pedestrian image of the ith pedestrian, j belongs to [1, n-1 ]],
Figure FDA0002974100290000043
The k-th pedestrian key point track characteristic of the continuous n-frame pedestrian images representing the ith pedestrian is a vector with (n-1) multiplied by 2 dimensions,
Figure FDA0002974100290000044
k-th pedestrian key representing n consecutive frames of pedestrian images of i-th pedestrianThe locus of the points is such that,
Figure FDA0002974100290000045
representing the trace length of a k pedestrian key point from a jth frame to a j +1 th frame of pedestrian images of the ith pedestrian;
when t is equal to 1, the first step is carried out,
Figure FDA0002974100290000046
the method is characterized in that the trajectory vector of the face key point of the ith pedestrian is normalized, k represents the k personal face key point of the ith pedestrian, and k belongs to [1, s ]]J represents the jth frame face image of the ith pedestrian, and j belongs to [1, n-1 ]],
Figure FDA0002974100290000047
The k-th personal face key point track characteristic of the continuous n-frame face images of the ith pedestrian is a vector of (n-1) multiplied by 2 dimensions,
Figure FDA0002974100290000048
a k-th individual face keypoint trajectory representing n consecutive frames of face images of an i-th pedestrian,
Figure FDA0002974100290000049
and the length of the k personal face key point track from the j frame to the j +1 frame of the face image of the ith pedestrian is represented.
CN201810864964.9A 2018-08-01 2018-08-01 Face, pedestrian and attribute recognition network structure design method based on deep learning Active CN109101915B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810864964.9A CN109101915B (en) 2018-08-01 2018-08-01 Face, pedestrian and attribute recognition network structure design method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810864964.9A CN109101915B (en) 2018-08-01 2018-08-01 Face, pedestrian and attribute recognition network structure design method based on deep learning

Publications (2)

Publication Number Publication Date
CN109101915A CN109101915A (en) 2018-12-28
CN109101915B true CN109101915B (en) 2021-04-27

Family

ID=64848324

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810864964.9A Active CN109101915B (en) 2018-08-01 2018-08-01 Face, pedestrian and attribute recognition network structure design method based on deep learning

Country Status (1)

Country Link
CN (1) CN109101915B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109858402B (en) * 2019-01-16 2021-08-31 腾讯科技(深圳)有限公司 Image detection method, device, terminal and storage medium
CN109886154A (en) * 2019-01-30 2019-06-14 电子科技大学 Most pedestrian's appearance attribute recognition methods according to collection joint training based on Inception V3
CN109829436B (en) * 2019-02-02 2022-05-13 福州大学 Multi-face tracking method based on depth appearance characteristics and self-adaptive aggregation network
CN110084216B (en) * 2019-05-06 2021-11-09 苏州科达科技股份有限公司 Face recognition model training and face recognition method, system, device and medium
CN110298278B (en) * 2019-06-19 2021-06-04 中国计量大学 Underground parking garage pedestrian and vehicle monitoring method based on artificial intelligence
CN110263756A (en) * 2019-06-28 2019-09-20 东北大学 A kind of human face super-resolution reconstructing system based on joint multi-task learning
CN111553231B (en) * 2020-04-21 2023-04-28 上海锘科智能科技有限公司 Face snapshot and deduplication system, method, terminal and medium based on information fusion
CN112818833B (en) * 2021-01-29 2024-04-12 中能国际建筑投资集团有限公司 Face multitasking detection method, system, device and medium based on deep learning

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9437009B2 (en) * 2011-06-20 2016-09-06 University Of Southern California Visual tracking in video images in unconstrained environments by exploiting on-the-fly context using supporters and distracters
CN103116756B (en) * 2013-01-23 2016-07-27 北京工商大学 A kind of persona face detection method and device
CN104077804B (en) * 2014-06-09 2017-03-01 广州嘉崎智能科技有限公司 A kind of method based on multi-frame video picture construction three-dimensional face model
AU2015224526B2 (en) * 2014-09-11 2020-04-30 Iomniscient Pty Ltd An image management system
CN105518744B (en) * 2015-06-29 2018-09-07 北京旷视科技有限公司 Pedestrian recognition methods and equipment again
CN108038409B (en) * 2017-10-27 2021-12-28 江西高创保安服务技术有限公司 Pedestrian detection method

Also Published As

Publication number Publication date
CN109101915A (en) 2018-12-28

Similar Documents

Publication Publication Date Title
CN109101915B (en) Face, pedestrian and attribute recognition network structure design method based on deep learning
Konstantinidis et al. Sign language recognition based on hand and body skeletal data
Zhan et al. Face detection using representation learning
CN111414862B (en) Expression recognition method based on neural network fusion key point angle change
Pu et al. Facial expression recognition from image sequences using twofold random forest classifier
Shirsat et al. Proposed system for criminal detection and recognition on CCTV data using cloud and machine learning
Xia et al. Face occlusion detection using deep convolutional neural networks
Liu et al. Facial attractiveness computation by label distribution learning with deep CNN and geometric features
Yang et al. Face recognition based on MTCNN and integrated application of FaceNet and LBP method
CN117541994A (en) Abnormal behavior detection model and detection method in dense multi-person scene
Archana et al. Real time face detection and optimal face mapping for online classes
Chen et al. A multi-scale fusion convolutional neural network for face detection
Hsiao et al. EfficientNet based iris biometric recognition methods with pupil positioning by U-net
Silwal et al. A novel deep learning system for facial feature extraction by fusing CNN and MB-LBP and using enhanced loss function
Myvizhi et al. Extensive analysis of deep learning-based deepfake video detection
Sajid et al. Facial asymmetry-based feature extraction for different applications: a review complemented by new advances
Liu et al. Lip event detection using oriented histograms of regional optical flow and low rank affinity pursuit
Yang et al. Heterogeneous face detection based on multi‐task cascaded convolutional neural network
Liu et al. Robust saliency-aware distillation for few-shot fine-grained visual recognition
Martinez-Gonzalez et al. Real time face detection using neural networks
Nguyen et al. A method for hand detection based on Internal Haar-like features and Cascaded AdaBoost Classifier
Sadeq et al. Comparison Between Face and Gait Human Recognition Using Enhanced Convolutional Neural Network
Rondón et al. Machine learning models in people detection and identification: a literature review
Ismail et al. A review on Arabic sign language recognition
Papadimitriou et al. Fingerspelled alphabet sign recognition in upper-body videos

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant