CN109829436B

CN109829436B - Multi-face tracking method based on depth appearance characteristics and self-adaptive aggregation network

Info

Publication number: CN109829436B
Application number: CN201910106309.1A
Authority: CN
Inventors: 柯逍; 郑毅腾; 朱敏琛
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2019-02-02
Filing date: 2019-02-02
Publication date: 2022-05-13
Anticipated expiration: 2039-02-02
Also published as: WO2020155873A1; CN109829436A

Abstract

The invention relates to a multi-face tracking method based on depth appearance characteristics and a self-adaptive aggregation network, which comprises the steps of firstly adopting a face recognition data set to train the self-adaptive aggregation network; then, acquiring the position of a human face by using a human face detection method based on a convolutional neural network, initializing a human face target to be tracked, and extracting human face characteristics; then, predicting the position of each face tracking target in the next frame by adopting a Kalman filter, positioning the position of the face in the next frame again, and extracting the characteristics of the detected face; and finally, using a self-adaptive aggregation network to aggregate the face feature set in each tracked face target tracking track, dynamically generating a face depth apparent feature fused with multi-frame information, combining the predicted position and the fused feature, performing similarity calculation and matching with the face position and the feature thereof obtained by detection in the current frame, and updating the tracking state. The invention can improve the performance of face tracking.

Description

Multi-face tracking method based on depth appearance characteristics and self-adaptive aggregation network

Technical Field

The invention relates to the field of pattern recognition and computer vision, in particular to a multi-face tracking method based on depth appearance characteristics and a self-adaptive aggregation network.

Background

In recent years, with social progress and continuous development of science and technology, the problem of video face recognition gradually becomes a popular research field, the research interests of numerous experts and scholars at home and abroad are attracted, the video face recognition is used as an entrance and a basis for video face recognition, the face detection and tracking technology is rapidly developed, and the video face recognition and tracking method is widely applied to the fields of intelligent monitoring, virtual reality perception interfaces, video conferences and the like.

In order to analyze a face, a face must be captured first, which can be realized by a face detection technology and a face tracking technology, and only if a face target is accurately positioned and tracked in a video, the face can be analyzed more carefully, such as face recognition, pose estimation and the like. The target tracking technology is undoubtedly one of the most important technologies in intelligent security, the face tracking technology is a specific application of the current tracking technology, a tracking algorithm is used for processing a moving face in a video sequence, and the face area is kept locked to complete tracking, so that the technology has good application prospects in scenes such as intelligent security, video monitoring and the like.

Face tracking plays an important role in video monitoring, but currently, in a real scene, due to the large change of the face pose and the overlapping and shielding between tracking targets, practical application is difficult.

Disclosure of Invention

In view of this, the present invention provides a multi-face tracking method based on a deep appearance feature and an adaptive aggregation network, which can improve the face tracking performance.

The invention is realized by adopting the following scheme: a multi-face tracking method based on depth appearance characteristics and a self-adaptive aggregation network specifically comprises the following steps:

step S1: training a self-adaptive aggregation network by adopting a face recognition data set;

step S2: acquiring the position of a human face by adopting a convolutional neural network according to an initial input video frame, initializing a human face target to be tracked, extracting human face characteristics and storing;

step S3: predicting the position of each face target in the next frame by adopting a Kalman filter, positioning the position of the face in the next frame again, and extracting characteristics of the detected face;

step S4: and (4) using the self-adaptive aggregation network trained in the step (S1) to aggregate the face feature set in each tracked face target tracking track, dynamically generating a face depth apparent feature fused with multi-frame information, combining the predicted position and the fused feature, performing similarity calculation and matching with the face position and the feature thereof obtained through detection in the current frame, and updating the tracking state.

Further, step S1 specifically includes the following steps:

step S11: collecting a public face recognition data set to obtain pictures and names of related persons;

step S12: and integrating the pictures of the common characters in the plurality of data sets by adopting a fusion strategy, carrying out face detection and face key point positioning by using a pre-trained MTCNN model, carrying out face alignment by applying similarity transformation, and simultaneously subtracting the mean value of each channel on the training set from all the images in the training set to finish data preprocessing and train the self-adaptive aggregation network.

Furthermore, the self-adaptive aggregation network is formed by connecting a depth feature extraction module and a self-adaptive feature aggregation module in series, receives one or more face images of the same person as input, and outputs aggregated features, wherein the depth feature extraction module adopts 34 layers of ResNet as a backbone network, and the self-adaptive feature aggregation module comprises a feature aggregation layer; let B denote the number of samples input, { z_tAnd B represents an input sample number, and the calculation mode of a feature aggregation layer is as follows:

a＝∑_to_tz_t；

wherein q represents a feature vector z_tThe weights of the individual components are parameters that can be learned by using the face recognition signal as a supervisory signal and learning by means of back propagation and gradient descent, v_tIs the output of the sigmoid function and,representing each feature vector z_tIn the range between 0 and 1, o_tNormalized output for L1, such that ∑_to_tAnd a is a feature vector after the aggregation of the B feature vectors.

Further, step S2 specifically includes the following steps:

step S21: let i denote the number of the ith frame of the input video, initially i equals 1, and the pre-trained MTCNN model is used to simultaneously detect the positions D of all facesⁱAnd the position C of its corresponding facial key pointⁱWherein

J is the number of the jth detected face, JⁱThe number of faces detected for the first frame,

wherein

The position of the jth face in the ith frame is shown, x, y, w and h respectively show the coordinates of the upper left corner of the face region and the width and height of the face region,

wherein

Representing a keypoint of the jth face in the ith frame, c₁,c₂,c₃,c₄,c₅Coordinates of a left eye, a right eye, a nose, a left mouth corner and a right mouth corner of the human face are respectively represented;

step S22: position D for each face_j ⁱAnd coordinates of key points of the face

To which a unique identity ID is assigned_k,k＝1,2,...,KⁱWhere K denotes the number of the kth tracking target, KⁱIndicating the number of persons tracking the target at frame i and initiatingChange its corresponding tracker T_k＝{ID_k,P_k,L_k,E_k,A_kTherein ID of_kRepresenting a unique identity, P, of the kth tracked object_kIndicating the face position coordinates assigned to the kth target, L_kCoordinates of key points of the face representing the k-th object, E_kList of face features representing the kth target, A_kDenotes the life cycle of the kth target, initializing Kⁱ＝Jⁱ，

A_k＝1；

Step S23: for T_kPosition P of each face in (1)_kCutting the image to obtain a corresponding face image, and using the corresponding key point position L of the face_kCarrying out face alignment by applying similarity transformation to obtain an aligned face image;

step S24: inputting the aligned face image into the self-adaptive aggregation network to obtain corresponding face depth apparent characteristics, and adding the face depth apparent characteristics into the tracker T_kFeature list E of_k。

Further, step S3 specifically includes the following steps:

step S31: representing the target state of each tracked face as follows:

in the formula, m represents the tracked face target state, u and v represent the central coordinates of the tracked face region, s is the area of the face frame, r is the aspect ratio of the face frame,

respectively representing the velocities of (u, v, s, r) in the image coordinate space;

step S32: each tracker T_kFace position P in_kConversion to (x, y, w, h)

In the form of (1), wherein

Representing the form of the transformed face position of the kth tracking target in the ith frame;

step S33: will be provided with

As a direct observation result of the kth tracking target in the ith frame, the kth tracking target is detected by human faces, and the state of the kth tracking target in the (i + 1) th frame is detected by adopting a Kalman filter based on a linear uniform motion model

Carrying out prediction;

step S34: in the (i + 1) th frame, adopting MTCNN model to make face detection and face key point positioning again to obtain face position Dⁱ⁺¹And face key point Cⁱ⁺¹；

Step S35: for each face position

Based on its facial key points

Completing face alignment by applying similarity transformation, inputting the face alignment into a self-adaptive aggregation network to extract features, and obtaining a feature set Fⁱ⁺¹In which F isⁱ⁺¹And representing the feature sets of all the faces in the (i + 1) th frame.

Further, step S4 specifically includes the following steps:

step S41: tracker T for each face_kA set E of all the characteristics in the historical motion trail_kInputting the self-adaptive aggregation network to obtain an aggregation characteristic f_kWherein f is_kRepresenting an aggregation characteristic output after all characteristic vectors in the history motion trail of the kth tracking target are fused;

step (ii) ofS42: the position state of the kth target predicted by the Kalman filter in the ith frame is in the next frame

Is converted into

In the form of (a);

step S43: bonding of

And the characteristic f after polymerization of the object k_kAnd face position D in the i +1 th frame obtained by face detectionⁱ⁺¹And its feature set Fⁱ⁺¹The following correlation matrix is calculated:

G＝[g_jk],j＝1,2,...,Jⁱ⁺¹,k＝1,2,...,Kⁱ；

in the formula, Jⁱ⁺¹For the number of faces detected in the i +1 th frame, KⁱFor the number of tracked objects in the ith frame,

for the jth personal face detection frame in the (i + 1) th frame and the position state of the kth target predicted by the Kalman filter in the (i + 1) th frame

The degree of coincidence between them,

for the j individual face feature in the (i + 1) th frame

With the k-th target in the i-th frame_kCosine similarity between them, λ is a hyper-parameter, used to balance the weights of the two metrics;

step S44: using the incidence matrix G as the costMatrix, calculating by using Hungarian algorithm to obtain matching result, and detecting the face in the (i + 1) th frame

Associating to the kth tracking target;

step S45: corresponding the subscript in the matching result to the item in the incidence matrix G, and filtering all the subscripts smaller than T_similarityItem g of_jkIt is deleted from the matching result, where T_similarityThe minimum similarity threshold value is a set hyper-parameter and represents the successful matching;

step S46: in the matching result, if the frame is detected

If the association with the kth tracking target is successful, the corresponding tracker T is updated_kPosition state of

Human face key point position

Life cycle a_k＝A_k+1, and corresponding face features

Add to feature list E_kIf the detection frame is not correct

If the association fails, a new tracker is created;

step S47: for each tracker T_kIf it is in life cycle A_k＞T_ageThen delete the tracker, where T_ageThe set hyper-parameter represents the maximum time a tracked object can survive.

Compared with the prior art, the invention has the following beneficial effects:

1. the multi-face tracking method based on the depth appearance characteristics and the self-adaptive aggregation network can effectively track the face in the video, improves the face tracking accuracy and reduces the target switching times.

2. The invention can track the human face in the video on line while ensuring the tracking effect.

3. The invention provides a method for utilizing apparent features of human face depth, which improves the performance of human face tracking by combining information between a spatial position and depth features.

4. Aiming at the problem that all features in the same target tracking track are difficult to be effectively utilized and a plurality of feature sets are effectively compared in the face tracking process, the invention provides a self-adaptive aggregation network, and the importance degree of each feature in the feature sets is adaptively learned and effectively fused through a feature aggregation module, so that the face tracking effect is improved.

Drawings

FIG. 1 is a schematic flow chart of an embodiment of the present invention.

Detailed Description

The invention is further explained below with reference to the drawings and the embodiments.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

As shown in fig. 1, the present embodiment provides a multi-face tracking method based on depth appearance features and an adaptive aggregation network, which specifically includes the following steps:

step S2: acquiring the position of a human face by using a human face detection method based on a convolutional neural network according to an initial input video frame, initializing a human face target to be tracked, extracting human face characteristics and storing the human face characteristics;

step S3: predicting the position of each face target in the next frame by adopting a Kalman filter, positioning the position of the face in the next frame by using a face detection method again, and extracting features of the detected face;

In this embodiment, step S1 specifically includes the following steps:

In this embodiment, the adaptive aggregation network is formed by connecting a depth feature extraction module and an adaptive feature aggregation module in series, and accepts one or more face images of the same person as input and outputs aggregated features, wherein the depth feature extraction module adopts 34 layers of ResNet as a backbone network, and the adaptive feature aggregation module performs adaptive feature aggregationThe module comprises a characteristic polymerization layer; let B denote the number of samples input, { z_tAnd B represents an input sample number, and the calculation mode of a feature aggregation layer is as follows:

a＝∑_to_tz_t；

wherein q represents a feature vector z_tThe weights of the individual components are parameters that can be learned by using the face recognition signal as a supervisory signal and learning by means of back propagation and gradient descent, v_tRepresenting each feature vector z as the output of the sigmoid function_tIn the range between 0 and 1, o_tNormalized output for L1, such that ∑_to_tAnd a is a feature vector after the aggregation of the B feature vectors.

In this embodiment, step S2 specifically includes the following steps:

wherein

The position of the jth face in the ith frame is shown, and x, y, w and h respectively represent peopleThe coordinates of the upper left corner of the face area and its width and height,

wherein

To which a unique identity ID is assigned_k,k＝1,2,...,KⁱWhere K denotes the number of the kth tracking target, KⁱIndicating the number of persons tracking the target at frame i and initializing the tracker T corresponding thereto_k＝{ID_k,P_k,L_k,E_k,A_kTherein ID of_kRepresenting a unique identity, P, of the kth tracked object_kIndicating the face position coordinates assigned to the kth target, L_kCoordinates of key points of the face representing the k-th object, E_kList of face features representing the kth target, A_kRepresenting the life cycle of the kth target, initializing Kⁱ＝Jⁱ，

A_k＝1；

step S24: inputting the aligned face images into a self-adaptive aggregation network to obtain corresponding face depth apparent characteristics, and adding the face depth apparent characteristics into a tracker T_kIs characterized byList E_k。

In this embodiment, step S3 specifically includes the following steps:

step S31: the state of each tracked face target is represented in the form:

step S32: each tracker T_kFace position P in_kConversion to (x, y, w, h)

In the form of (1), wherein

step S33: will be provided with

Carrying out prediction;

Step S35: for each personFace position

Based on its facial key points

In this embodiment, step S4 specifically includes the following steps:

step S42: the position state of the kth target predicted by the Kalman filter in the ith frame is in the next frame

Is converted into

In the form of (a);

step S43: bonding of

G＝[g_jk],j＝1,2,...,Jⁱ⁺¹,k＝1,2,...,Kⁱ；

The degree of coincidence between them,

for the j individual face feature in the (i + 1) th frame

step S44: taking the incidence matrix G as a cost matrix, calculating by using a Hungarian algorithm to obtain a matching result, and detecting the face in the (i + 1) th frame

Associating to the kth tracking target;

step S46: in the matching result, if the frame is detected

Human face key point position

Life cycle a_k＝A_k+1, and corresponding face features

Add to feature list E_kIf the detection frame is not correct

If the association fails, a new tracker is created;

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing is directed to preferred embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow. However, any simple modification, equivalent change and modification of the above embodiments according to the technical essence of the present invention are within the protection scope of the technical solution of the present invention.

Claims

1. A multi-face tracking method based on depth appearance characteristics and an adaptive aggregation network is characterized in that: the method comprises the following steps:

step S4: using the self-adaptive aggregation network trained in the step S1 to aggregate the face feature set in each tracked face target tracking track, dynamically generating a face depth apparent feature fused with multi-frame information, combining the predicted position and the fused feature, performing similarity calculation and matching with the face position and the feature thereof obtained through detection in the current frame, and updating the tracking state;

step S1 specifically includes the following steps:

step S12: integrating images of people in a plurality of data sets by adopting a fusion strategy, carrying out face detection and face key point positioning by using a pre-trained MTCNN model, carrying out face alignment by applying similarity transformation, simultaneously subtracting the mean value of each channel on a training set from all the images in the training set, completing data preprocessing, and training a self-adaptive aggregation network;

step S2 specifically includes the following steps:

J is the number of the jth detected face, JⁱThe number of faces detected for the ith frame,

wherein

wherein

step S22: for each face position

And coordinates of key points of the face

To which a unique identity ID is assigned_k,k＝1,2,...,KⁱWhere K denotes the number of the kth tracking target, KⁱIndicating the number of persons tracking the target at frame i and initializing the tracker T corresponding thereto_k＝{ID_k,P_k,L_k,E_k,A_kTherein ID of_kRepresenting a unique identity, P, of the kth tracked object_kIndicating the face position coordinates assigned to the kth target, L_kCoordinates of key points of the face representing the k-th object, E_kList of face features representing the kth target, A_kDenotes the life cycle of the kth target, initializing Kⁱ＝Jⁱ，

A_k＝1；

step S24: inputting the aligned face image into the self-adaptive aggregation network to obtain corresponding face depth apparent characteristics, and adding the face depth apparent characteristics into the tracker T_kFeature list E of_k；

Step S3 specifically includes the following steps:

step S31: representing the target state of each tracked face as follows:

step S32: each tracker T_kFace position P in_kConversion to (x, y, w, h)

In the form of (1), wherein

step S33: will be provided with

Carrying out prediction;

Step S35: for each face position

Based on its facial key points

Completing face alignment by applying similarity transformation, inputting the face alignment into a self-adaptive aggregation network to extract features, and obtaining a feature set Fⁱ⁺¹In which F isⁱ⁺¹Representing the feature set of all the human faces in the (i + 1) th frame;

step S4 specifically includes the following steps:

Is converted into

In the form of (a);

step S43: bonding of

G＝[g_jk],j＝1,2,...,Jⁱ⁺¹,k＝1,2,...,Kⁱ；

The degree of coincidence between them,

for the j individual face feature in the (i + 1) th frame

Associating to the kth tracking target;

step S45: corresponding the subscript in the matching result to the item in the incidence matrix G, and filtering all the subscripts smaller than T_similarityItem g of (a)_jkIt is deleted from the matching result, where T_similarityThe minimum similarity threshold value is a set hyper-parameter and represents the successful matching;

step S46: in the matching result, if the frame is detected

Human face key point position

Life cycle a_k＝A_k+1, and corresponding face features

Add to feature list E_kIf the detection frame is not correct

If the association fails, a new tracker is created;

2. The method for tracking multiple faces based on the depth appearance characteristics and the adaptive aggregation network as claimed in claim 1, wherein: the self-adaptive aggregation network is formed by connecting a depth feature extraction module and a self-adaptive feature aggregation module in series, receives one or more face images of the same person as input and outputs aggregated features, wherein the depth feature extraction module adopts 34 layers of ResNet as a backbone network, and the self-adaptive feature aggregation module comprises a feature aggregation layer; let B denote the number of samples input, { z_tAnd B represents an input sample number, and the calculation mode of a feature aggregation layer is as follows:

a＝∑_to_tz_t；

wherein q represents a feature vector z_tThe weight of each component is a parameter which can be learned by using a face recognition signal as a supervision signal and learning by using a back propagation and gradient descent method, v_tRepresenting each feature vector z as the output of the sigmoid function_tIn the range between 0 and 1, o_tNormalized output for L1, such that ∑_to_tAnd a is a feature vector after the aggregation of the B feature vectors.