WO2020155873A1

WO2020155873A1 - Deep apparent features and adaptive aggregation network-based multi-face tracking method

Info

Publication number: WO2020155873A1
Application number: PCT/CN2019/124966
Authority: WO
Inventors: 柯逍; 郑毅腾; 朱敏琛
Original assignee: 福州大学
Priority date: 2019-02-02
Filing date: 2019-12-13
Publication date: 2020-08-06
Also published as: CN109829436A; CN109829436B

Abstract

A deep apparent features and adaptive aggregation network-based multi-face tracking method: first, using a face recognition data set to train an adaptive aggregation network; next, using a convolutional neural network-based face detection method to obtain the position of a face, initializing a face target to be tracked, and extracting face features; then, using a Kalman filter to predict the position of each face tracking target in a next frame, and again locating the position at which the face is located in the next frame to extract the features of the detected face; finally, using the adaptive aggregation network to aggregate face feature sets in each tracked face target tracking trajectory, dynamically generating a face depth apparent feature fused with a plurality of frames of information, performing similarity calculation and matching with the face position and the feature thereof obtained by means of detection in a current frame in conjunction with the predicted position and the fused feature, and updating a tracking status. The described method may improve the performance of face tracking.

Description

A multi-face tracking method based on deep appearance features and adaptive aggregation network

Technical field

The invention relates to the field of pattern recognition and computer vision, in particular to a multi-face tracking method based on deep appearance features and an adaptive aggregation network.

Background technique

In recent years, with the progress of society and the continuous development of technology, the problem of video face recognition has gradually become a hot research field, attracting the research interest of many experts and scholars at home and abroad, as the entrance and basis of video face recognition, face detection And tracking technology has developed rapidly and is widely used in fields such as intelligent surveillance, virtual reality perception interfaces, video conferencing, etc. Because the real video background is complex and changeable, and as a non-rigid target, human faces may exist in video sequences. With large changes in posture or expression, it is still a big challenge to implement a robust face tracking algorithm in a real scene.

In order to analyze a face, we must first capture the face. This can be achieved through face detection technology and face tracking technology. Only by accurately locating and tracking the face target in the video, we can make changes to the face. Detailed analysis, such as face recognition, pose estimation, etc. Target tracking technology is undoubtedly one of the most important technologies in intelligent security. Face tracking technology is a specific application of current tracking technology. It uses tracking algorithms to process the moving faces in the video sequence, and keep the face area The lock completes the tracking, and the technology has good application prospects in scenarios such as smart security and video surveillance.

technical problem

Face tracking plays an important role in video surveillance, but at present, in real scenes, due to the large changes in face pose and the overlap and occlusion between tracking targets, practical applications are still difficult.

Technical solutions

In view of this, the purpose of the present invention is to propose a multi-face tracking method based on deep appearance features and an adaptive aggregation network, which can improve the performance of face tracking.

The present invention adopts the following scheme to realize: a multi-face tracking method based on deep appearance features and an adaptive aggregation network, which specifically includes the following steps:

Step S1: Use the face recognition data set to train an adaptive aggregation network;

Step S2: According to the initial input video frame, use the convolutional neural network to obtain the position of the face, initialize the face target to be tracked, extract and save the face features;

Step S3: Use the Kalman filter to predict the position of each face target in the next frame, and locate the position of the face again in the next frame, and extract features from the detected face;

Step S4: Use the adaptive aggregation network trained in step S1 to aggregate the face feature sets in the tracking trajectory of each tracked face target, and dynamically generate a face depth apparent feature fused with multi-frame information, combined The predicted position and fusion features are calculated and matched with the face position and features obtained through detection in the current frame, and the tracking state is updated.

Further, step S1 specifically includes the following steps:

Step S11: Collect public face recognition data sets to obtain pictures and names of relevant people;

Step S12: Use the fusion strategy to integrate the pictures of the common people in the multiple data sets, use the pre-trained MTCNN model for face detection and face key point positioning, and apply similar transformations for face alignment, and at the same time all the training sets The image subtracts the mean value of each channel on the training set, completes the data preprocessing, and trains the adaptive aggregation network.

Further, the adaptive aggregation network is composed of a deep feature extraction module and an adaptive feature aggregation module in series. It accepts one or more face images of the same person as input and outputs the aggregated features, where the deep feature extraction The module uses 34-layer ResNet as the backbone network, and the adaptive feature aggregation module contains a feature aggregation layer; let B denote the number of input samples, {z _t } denote the output feature set of the deep feature extraction module, where t=1, 2, ..., B represents the input sample number, the calculation method of the feature aggregation layer is:

a=∑ _t o _t z _t ;

In the formula, q represents the weight of each component of the feature vector z _t , which is a parameter that can be learned. The face recognition signal is used as a supervisory signal to learn using back propagation and gradient descent methods. V _t is the output of the sigmoid function, representing The score of each feature vector z _t ranges between 0 and 1, o _t is the normalized output of L1, so that ∑ _t o _t = 1, and a is a feature vector aggregated by B feature vectors.

Further, step S2 specifically includes the following steps:

Step S21: Let i represents the number of i-th frame of the input video, initially i = 1, using the model pre-trained simultaneously detect all faces MTCNN position C ⁱ D ⁱ and the position of the corresponding key face, wherein

j is the number of the j-th detected face, J ⁱ is the number of faces detected in the frame,

among them

Represents the position of the j-th face in the i-th frame, x, y, w, and h represent the coordinates of the upper left corner of the face area and its width and height,

among them

Represents the key points of the j-th face in the i-th frame, c ₁ , c ₂ , c ₃ , c ₄ , and c ₅ represent the coordinates of the left eye, right eye, nose, left mouth corner, and right mouth corner of the face respectively;

Step S22: For each person's face position

Key point coordinates

Assign a unique identity ID _k , k=1, 2,..., K ⁱ , where k represents the number of the k-th tracking target, and K ⁱ represents the number of people tracking the target in the i-th frame, and initialize it The corresponding tracker T _k ={ID _k ,P _k ,L _k ,E _k ,A _k }, where ID _k represents the unique identity of the k-th tracking target, and P _k represents the face assigned to the k-th target Position coordinates, L _k represents the face key point coordinates of the k-th target, E _k represents the face feature list of the k-th target, _Ak represents the life cycle of the k-th target, initialized K ⁱ =J ⁱ ,

A _k =1;

After the position of T _k in each individual face P _k, cropping the image, to obtain the corresponding face image using a face corresponding keypoint locations L _k, the similarity transformation applied for face alignment, alignment obtained: Step S23 Face image of

Step S24: Input the aligned face image into the adaptive aggregation network to obtain the corresponding deep apparent feature of the face, and add it to the feature list E _k of T _k in the tracker.

Further, step S3 specifically includes the following steps:

Step S31: Express the state of each tracked face target in the following form:

In the formula, m represents the state of the tracked face target, u and v represent the center coordinates of the tracked face area, s is the area of the face frame, and r is the aspect ratio of the face frame,

Respectively represent the speed of (u, v, s, r) in the image coordinate space;

Step S32: Convert the face position P _k = (x, y, w, h) in each tracker T _k into

In the form of

Represents the transformed form of the face position of the k-th tracking target in the i-th frame;

Step S33: Change

As the direct observation result of the k-th tracking target in the i-th frame, it is derived from face detection, and the state of the k-th tracking target in the i+1-th frame is determined by the Kalman filter based on the linear uniform motion model.

Make predictions

Step S34: In the i+1th frame, the MTCNN model is used to perform face detection and facial key point positioning again, to obtain the face position Di ⁺¹ and the face key point C ⁱ⁺¹ ;

Step S35: For each person's face position

Based on its facial key points

The similarity transformation is applied to complete the face alignment, and the adaptive aggregation network is input to extract the features, and the feature set F ^{i+1 is obtained} , where F ⁱ⁺¹ represents the feature set of all faces in the i+1 frame.

Further, step S4 specifically includes the following steps:

Step S41: For each face tracker T _k , the set E _k of all the features in its historical motion trajectory is input into the adaptive aggregation network to obtain the aggregated feature f _k , where f _k represents the k-th tracking target historical motion trajectory An aggregated feature output after all feature vectors in the fusion;

Step S42: Set the position state of the kth target predicted by the Kalman filter in the i-th frame in the next frame

change into

form;

Step S43: Combine

And certain features of the polymeric k f _k, as well as human face detection in the i + 1-position of the face frame obtained D ^{i + 1,} and the feature set F ^{i + 1,} the correlation matrix is calculated as follows:

G=[g _jk ],j=1,2,...,J ⁱ⁺¹ ,k=1,2,...,K ⁱ ;

Where J ⁱ⁺¹ is the number of faces detected in the i+1 frame, K ⁱ is the number of tracking targets in the i frame,

Is the position state of the j-th face detection frame in the i+1-th frame and the k-th target predicted by the Kalman filter in the i-th frame in the i+1-th frame

The degree of overlap between

Is the j-th face feature in the i+1-th frame

The cosine similarity with the k-th target aggregation feature f _k in the i-th frame, where λ is a hyperparameter used to balance the weights of the two metrics;

Step S44: Using the incidence matrix G as the cost matrix, the Hungarian algorithm is used to calculate the matching result, and the face detection frame in the i+1 frame

Related to the kth tracking target;

Step S45: Correspond the subscripts in the matching result to items in the incidence matrix G, and filter all items g _jk that are less than T _similarity , and delete them from the matching result, where T _similarity is the set hyperparameter, which means the matching is successful The lowest similarity threshold;

Step S46: In the matching result, if the check box

If successfully associated with the k-th tracking target, update the position status in the corresponding tracker T _k

Key points of the face

The life cycle A _k =A _k +1, and the corresponding facial features

Add to the feature list E _k , if the check box

If the association fails, a new tracker will be created;

Step S47: For each tracker T _k , if its life cycle A _k > T _age , delete the tracker, where T _age is a set hyperparameter, which represents the longest time a tracking target can survive.

Beneficial effect

Compared with the prior art, the present invention has the following beneficial effects:

1. A multi-face tracking method based on deep appearance features and an adaptive aggregation network constructed by the present invention can effectively track faces in videos, improve the accuracy of face tracking, and reduce target switching The number of times.

2. The present invention can track the face in the video online while ensuring the tracking effect.

3. In the face tracking process, the predicted face position is uncertain, and at the same time, the face may undergo significant posture changes and occlusion. The present invention proposes a method of using the depth and apparent features of the face. The information between position and depth features improves the performance of face tracking.

4. In the face tracking process, it is difficult to effectively use all the features in the same target tracking trajectory and effectively compare multiple feature sets. The present invention proposes an adaptive aggregation network, which is adaptive through a feature aggregation module The importance of each feature in the feature set is learned and fused effectively, which improves the effect of face tracking.

Description of the drawings

Fig. 1 is a schematic flowchart of an embodiment of the present invention.

Embodiments of the invention

The present invention will be further described below in conjunction with the drawings and embodiments.

It should be pointed out that the following detailed descriptions are all exemplary and are intended to provide further descriptions of this application. Unless otherwise indicated, all technical and scientific terms used herein have the same meaning as commonly understood by those of ordinary skill in the technical field to which this application belongs.

It should be noted that the terms used here are only for describing specific implementations, and are not intended to limit the exemplary implementations according to the present application. As used herein, unless the context clearly indicates otherwise, the singular form is also intended to include the plural form. In addition, it should also be understood that when the terms "comprising" and/or "including" are used in this specification, they indicate There are features, steps, operations, devices, components, and/or combinations thereof.

As shown in Figure 1, this embodiment provides a multi-face tracking method based on deep appearance features and an adaptive aggregation network, which specifically includes the following steps:

Step S2: According to the initial input video frame, the face detection method based on convolutional neural network is used to obtain the position of the face, initialize the face target to be tracked, and extract and save the face feature;

Step S3: Use the Kalman filter to predict the position of each face target in the next frame, and use the face detection method to locate the position of the face again in the next frame, and extract features from the detected face;

In this embodiment, step S1 specifically includes the following steps:

In this embodiment, the adaptive aggregation network is composed of a deep feature extraction module and an adaptive feature aggregation module in series. It accepts one or more face images of the same person as input and outputs the aggregated features, where The deep feature extraction module uses 34-layer ResNet as the backbone network, and the adaptive feature aggregation module contains a feature aggregation layer; let B represent the number of input samples, {z _t } represent the output feature set of the deep feature extraction module, where t=1 ,2,...,B represents the input sample number, and the calculation method of the feature aggregation layer is:

a=∑ _t o _t z _t ;

In the formula, q represents the weight of each component of the feature vector z _t , which is a parameter that can be learned. By using the face recognition signal as the supervision signal, the learning is performed using back propagation and gradient descent methods, v _t is the output of the sigmoid function, representing The score of each feature vector z _t ranges between 0 and 1, o _t is the normalized output of L1, so that ∑ _t o _t = 1, and a is a feature vector aggregated by B feature vectors.

In this embodiment, step S2 specifically includes the following steps:

among them

Step S22: For each person's face position

Key point coordinates

A _k =1;

In this embodiment, step S3 specifically includes the following steps:

Step S31: Express the state of each tracked face target in the following form:

Respectively represent the speed of (u, v, s, r) in the image coordinate space;

In the form of

Step S33: Change

Make predictions

Step S35: For each person's face position

Based on its facial key points

In this embodiment, step S4 specifically includes the following steps:

change into

form;

Step S43: Combine

G=[g _jk ],j=1,2,...,J ⁱ⁺¹ ,k=1,2,...,K ⁱ ;

The degree of overlap between

Is the j-th face feature in the i+1-th frame

Related to the kth tracking target;

Step S46: In the matching result, if the check box

Key points of the face

The life cycle A _k =A _k +1, and the corresponding facial features

Add to the feature list E _k , if the check box

If the association fails, a new tracker will be created;

Those skilled in the art should understand that the embodiments of the present application can be provided as methods, systems, or computer program products. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.

This application is described with reference to flowcharts and/or block diagrams of methods, equipment (systems), and computer program products according to embodiments of this application. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be implemented by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment can be generated A device that implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

These computer program instructions can also be stored in a computer-readable memory that can direct a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device. The device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment. The instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

The above are only preferred embodiments of the present invention, and are not intended to limit the present invention in other forms. Any person familiar with the profession may use the technical content disclosed above to change or modify the equivalent of equivalent changes. Examples. However, any simple modifications, equivalent changes and modifications made to the above embodiments based on the technical essence of the present invention without departing from the content of the technical solution of the present invention still belong to the protection scope of the technical solution of the present invention.

Claims

A multi-face tracking method based on deep appearance features and an adaptive aggregation network is characterized in that it includes the following steps:

Step S1: Use the face recognition data set to train an adaptive aggregation network;

Step S2: According to the initial input video frame, use the convolutional neural network to obtain the position of the face, initialize the face target to be tracked, extract the face features and save;

Step S3: Use the Kalman filter to predict the position of each face target in the next frame, and locate the position of the face again in the next frame, and extract features from the detected face;

Step S4: Use the adaptive aggregation network trained in step S1 to aggregate the face feature sets in the tracking trajectory of each tracked face target, and dynamically generate a face depth apparent feature fused with multi-frame information, combined The predicted position and fusion features are calculated and matched with the face position and features obtained through detection in the current frame, and the tracking state is updated.
The method for tracking multiple faces based on deep appearance features and an adaptive aggregation network according to claim 1, wherein step S1 specifically includes the following steps:

Step S11: Collect public face recognition data sets to obtain pictures and names of relevant people;

Step S12: Use the fusion strategy to integrate the pictures of the common people in the multiple data sets, use the pre-trained MTCNN model for face detection and face key point positioning, and apply similar transformations for face alignment, and at the same time all the training sets The image subtracts the mean value of each channel on the training set, completes the data preprocessing, and trains the adaptive aggregation network.
The multi-face tracking method based on deep appearance features and an adaptive aggregation network according to claim 2, characterized in that: the adaptive aggregation network is composed of a deep feature extraction module and an adaptive feature aggregation module in series , It accepts one or more face images of the same person as input and outputs aggregated features. The deep feature extraction module uses 34-layer ResNet as the backbone network, and the adaptive feature aggregation module contains a feature aggregation layer; let B Represents the number of input samples, {z t } represents the output feature set of the deep feature extraction module, where t=1, 2,...,B represents the input sample number, and the calculation method of the feature aggregation layer is:

a=∑ t o t z t ;

In the formula, q represents the weight of each component of the feature vector z t , which is a parameter that can be learned. The face recognition signal is used as a supervisory signal to learn using back propagation and gradient descent methods. V t is the output of the sigmoid function, representing The score of each feature vector z t ranges between 0 and 1, o t is the normalized output of L1, so that ∑ t o t = 1, and a is a feature vector aggregated by B feature vectors.
The multi-face tracking method based on deep appearance features and adaptive aggregation network according to claim 1, characterized in that: step S2 specifically includes the following steps:

Step S21: Let i represents the number of i-th frame of the input video, initially i = 1, using the model pre-trained simultaneously detect all faces MTCNN position C i D i and the position of the corresponding key face, wherein
j is the number of the j-th detected face, J i is the number of faces detected in the frame,
among them
Represents the position of the j-th face in the i-th frame, x, y, w, and h represent the coordinates of the upper left corner of the face area and its width and height,
among them
Represents the key points of the j-th face in the i-th frame, c 1 , c 2 , c 3 , c 4 , and c 5 represent the coordinates of the left eye, right eye, nose, left mouth corner, and right mouth corner of the face respectively;

Step S22: For each person's face position
Key point coordinates
Assign a unique identity ID k , k=1, 2,..., K i , where k represents the number of the k-th tracking target, and K i represents the number of people tracking the target in the i-th frame, and initialize it The corresponding tracker T k ={ID k ,P k ,L k ,E k ,A k }, where ID k represents the unique identity of the k-th tracking target, and P k represents the face assigned to the k-th target Position coordinates, L k represents the face key point coordinates of the k-th target, E k represents the face feature list of the k-th target, Ak represents the life cycle of the k-th target, initialized K i =J i ,
A k =1;

After the position of T k in each individual face P k, cropping the image, to obtain the corresponding face image using a face corresponding keypoint locations L k, the similarity transformation applied for face alignment, alignment obtained: Step S23 Face image of

Step S24: Input the aligned face image into the adaptive aggregation network to obtain the corresponding deep apparent feature of the face, and add it to the feature list E k of T k in the tracker.
The method for tracking multiple faces based on deep appearance features and an adaptive aggregation network according to claim 1, wherein step S3 specifically includes the following steps:

Step S31: Express the state of each tracked face target in the following form:

In the formula, m represents the state of the tracked face target, u and v represent the center coordinates of the tracked face area, s is the area of the face frame, and r is the aspect ratio of the face frame,
Respectively represent the speed of (u, v, s, r) in the image coordinate space;

Step S32: Convert the face position P k = (x, y, w, h) in each tracker T k into
In the form of
Represents the transformed form of the face position of the k-th tracking target in the i-th frame;

Step S33: Change
As the direct observation result of the k-th tracking target in the i-th frame, it is derived from face detection, and the state of the k-th tracking target in the i+1-th frame is determined by the Kalman filter based on the linear uniform motion model.
Make predictions

Step S34: In the i+1th frame, the MTCNN model is used to perform face detection and facial key point positioning again, to obtain the face position Di +1 and the face key point C i+1 ;

Step S35: For each person's face position
Based on its facial key points
The similarity transformation is applied to complete the face alignment, and the adaptive aggregation network is input to extract the features, and the feature set F i+1 is obtained , where F i+1 represents the feature set of all faces in the i+1 frame.
The multi-face tracking method based on deep appearance features and an adaptive aggregation network according to claim 1, wherein step S4 specifically includes the following steps:

Step S41: For each face tracker T k , the set E k of all the features in its historical motion trajectory is input into the adaptive aggregation network to obtain the aggregated feature f k , where f k represents the k-th tracking target historical motion trajectory An aggregated feature output after all feature vectors in the fusion;

Step S42: Set the position state of the kth target predicted by the Kalman filter in the i-th frame in the next frame
change into
form;

Step S43: Combine
And certain features of the polymeric k f k, as well as human face detection in the i + 1-position of the face frame obtained D i + 1, and the feature set F i + 1, the correlation matrix is calculated as follows:

G=[g jk ],j=1,2,...,J i+1 ,k=1,2,...,K i ;

Where J i+1 is the number of faces detected in the i+1 frame, K i is the number of tracking targets in the i frame,
Is the position state of the j-th face detection frame in the i+1-th frame and the k-th target predicted by the Kalman filter in the i-th frame in the i+1-th frame
The degree of overlap between
Is the j-th face feature in the i+1-th frame
The cosine similarity with the k-th target aggregation feature f k in the i-th frame, where λ is a hyperparameter used to balance the weights of the two metrics;

Step S44: Using the incidence matrix G as the cost matrix, the Hungarian algorithm is used to calculate the matching result, and the face detection frame in the i+1 frame
Related to the kth tracking target;

Step S45: Correspond the subscripts in the matching result to items in the incidence matrix G, and filter all items g jk that are less than T similarity , and delete them from the matching result, where T similarity is the set hyperparameter, which means the matching is successful The lowest similarity threshold;

Step S46: In the matching result, if the check box
If successfully associated with the k-th tracking target, update the position status in the corresponding tracker T k
Key points of the face
The life cycle A k =A k +1, and the corresponding facial features
Add to the feature list E k , if the check box
If the association fails, a new tracker will be created;

Step S47: For each tracker T k , if its life cycle A k > T age , delete the tracker, where T age is a set hyperparameter, which represents the longest time a tracking target can survive.