CN109299669A

CN109299669A - Video human face critical point detection method and device based on double intelligent bodies

Info

Publication number: CN109299669A
Application number: CN201811007365.1A
Authority: CN
Inventors: 鲁继文; 周杰; 郭明皓
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2018-08-30
Filing date: 2018-08-30
Publication date: 2019-02-01
Anticipated expiration: 2038-08-30
Also published as: CN109299669B

Abstract

The video human face critical point detection method and device based on double intelligent bodies that the invention discloses a kind of, wherein method includes: to establish tracking intelligent body and critical point detection intelligent body respectively, and be connected by communication information channel；The edge distribution probability and conditional probability distribution of output tracking intelligent body and critical point detection intelligent body are distinguished according to Bayesian model；Markovian decision model is established according to edge distribution probability and conditional probability distribution, wherein, intelligent body and critical point detection intelligent body are tracked to act by the sequence of variable length, while updating the position of detection block and key point, and interaction transmitting information, to obtain testing result；And by establishing supervised learning training function and intensified learning training function optimization testing result, to obtain final result.This method exports the detection of face frame and the testing result of key point by interactive mode, has the advantages of performance for promoting detection system, optimizing detection result.

Description

Video face key point detection method and device based on double intelligent agents

Technical Field

The invention relates to the technical field of computer vision, in particular to a video face key point detection method and device based on double intelligent agents.

Background

In the prior art, video face key point detection is widely concerned in the field of computer vision based on the rapid development of face key detection in the field of images. In practical application, a video scene better meets the actual requirement, and not only can the video provide more frames of human faces than a single image, but also the video has time dimension information, so that the method is helpful for key point positioning and subsequent human face identification and living body detection. The purpose of video face key point detection is as follows: a section of face video is given, and a series of key points such as face parts, face outlines and the like are detected for all frames of the video. In practical application, because the acquired face video is in a non-limited environment, besides the problems of large posture, expression change, severe shielding and the like existing in a static image, the detection of key points of the face becomes more difficult due to illumination change, motion blur and the like in the video.

During the past decades, there have been many research methods for video face key point detection. Since it is very difficult to directly perform the key point detection from the whole image of the video frame without any prior, most methods make a frame-by-frame detection strategy, which adopts a serial mode to process the video face key point problem. Specifically, the methods firstly generate a face detection frame with high confidence for each frame of image, and then perform key point detection on the face image region framed by the detection frame. Although this strategy reduces the difficulty of keypoint detection by introducing a priori of detection frames, the accuracy of keypoint detection thus obtained depends largely on the generated detection frames. Fig. 1 shows the effect of the face detection box on the keypoint detection. It can be found that the accuracy of the detection of the key points is greatly affected by the slight deviation of the detection frame. The phenomenon is caused because the generation of the detection frame does not consider any pose and expression information of the face, and especially when the face is in an extreme condition, the face area covered by the detection frame often cannot contain all face key points, and finally the detection effect of the face key points is limited. Therefore, the detection of the key points of the face of the video needs to fully utilize the interactive information of the face detection frame and the key points to ensure the precision. Because the face key points can effectively represent the motion of the face across postures, the key points can provide additional useful information for the generation of an accurate detection frame. However, most of the existing video face key point detection methods ignore the mutual information between the two methods, and result in low precision under extreme conditions.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, one objective of the present invention is to provide a method for detecting video face key points based on dual agents, which has the advantages of improving the performance of a detection system and optimizing the detection result.

The invention also aims to provide a video human face key point detection device based on double agents.

In order to achieve the above object, an embodiment of the present invention provides a method for detecting video face key points based on dual agents, including the following steps: respectively establishing a tracking intelligent agent and a key point detection intelligent agent, and connecting the tracking intelligent agent and the key point detection intelligent agent through a communication information channel; respectively outputting the edge distribution probability of the tracking intelligent agent and the key point detection intelligent agent according to a Bayesian model, and respectively acquiring conditional probability distribution according to communication information between the tracking intelligent agent and the key point detection intelligent agent; establishing a Markov decision model according to the edge distribution probability and the conditional probability distribution, wherein the tracking agent and the key point detection agent simultaneously update the positions of a detection frame and a key point through variable-length sequence actions and interactively transmit information to obtain a detection result; and optimizing the detection result by establishing a supervised learning training function and a reinforcement learning training function to obtain a final result.

According to the video face key point detection method based on the double agents, the tracking agent and the key point detection agent are respectively established, the video face key point detection is analyzed in a probability mode according to the Bayesian model, the positions of the detection frame and the key point are simultaneously updated through the Markov decision model, and the detection result of the face frame and the detection result of the key point are output in an interactive mode.

In addition, the video face key point detection method based on the double agents according to the above embodiment of the present invention may further have the following additional technical features:

further, in an embodiment of the present invention, the establishing a tracking agent and a key point detecting agent respectively, and connected by a communication channel, further includes: the tracking intelligent agent is established based on a VGG-M model, a single-layer Q network is accessed, the key point detection intelligent agent is established through the combination of a cascade hourglass network and a confidence coefficient network, and the tracking intelligent agent is connected through two communication information channels based on a deconvolution layer and long-short term memory unit codes.

Further, in an embodiment of the present invention, the establishing a markov decision model according to the edge distribution probability and the conditional probability distribution, the tracking agent and the key point detecting agent performing variable-length sequence actions, updating positions of a detection frame and a key point, and interactively transmitting information to obtain a detection result, further includes: the tracking agent changes the currently observed region by a movement action, wherein the movement action comprises a left, right, up, down, zoom in, and zoom out; the key point detection agent decides whether the iteration stops by generating a stop or continue action.

Further, in an embodiment of the present invention, the normalized detection is adopted to obtain the coordinates of the key points as a representative of the three-dimensional posture information, and the long-short term memory unit LSTM is used to memorize the posture change of the time dimension.

Further, in an embodiment of the present invention, the supervised learning training function and the reinforcement learning training function further include:

the supervised learning training function is:

wherein,

the reinforcement learning training function is as follows:

wherein,

in order to achieve the above object, an embodiment of the present invention provides a video face key point detecting device based on dual agents, including: the establishing module is used for respectively establishing a tracking intelligent agent and a key point detection intelligent agent and is connected through a communication information channel; the probability distribution acquisition module is used for respectively outputting the edge distribution probability of the tracking intelligent agent and the key point detection intelligent agent according to a Bayesian model and respectively acquiring conditional probability distribution according to communication information between the tracking intelligent agent and the key point detection intelligent agent; the detection interaction module is used for establishing a Markov decision model according to the edge distribution probability and the conditional probability distribution, wherein the tracking intelligent agent and the key point detection intelligent agent perform actions through variable-length sequences, meanwhile, the positions of a detection frame and key points are updated, and information is interactively transmitted to obtain a detection result; and the optimization module is used for optimizing the detection result by establishing a supervised learning training function and an intensified learning training function so as to obtain a final result.

The video face key point detection device based on the double agents respectively establishes the tracking agent and the key point detection agent, analyzes video face key point detection in a probability mode according to the Bayesian model, updates the positions of the detection frame and the key point simultaneously through the Markov decision model, and outputs the detection result of the face frame and the detection result of the key point in an interactive mode.

In addition, the video face key point detection device based on the dual agents according to the above embodiment of the present invention may further have the following additional technical features:

further, in an embodiment of the present invention, the establishing module further includes: the tracking intelligent agent is established based on a VGG-M model, a single-layer Q network is accessed, the key point detection intelligent agent is established through the combination of a cascade hourglass network and a confidence coefficient network, and the tracking intelligent agent is connected through two communication information channels based on a deconvolution layer and long-short term memory unit codes.

Further, in an embodiment of the present invention, the detecting interaction module further includes: the tracking agent changes the currently observed region by a movement action, wherein the movement action comprises a left, right, up, down, zoom in, and zoom out; the key point detection agent decides whether the iteration stops by generating a stop or continue action.

the supervised learning training function is:

wherein,

the reinforcement learning training function is as follows:

wherein,

additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a schematic diagram illustrating the effect of a face detection box on keypoint detection;

FIG. 2 is a flow chart of a method for detecting key points of a video face based on dual agents according to an embodiment of the invention;

fig. 3 is a schematic diagram illustrating the interactive output of the detection of the face frame and the detection of the key points in the video face key point detection method based on the dual agents according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of partial results on a subset of challenges in the public face database 300VW of a dual-agent based video face keypoint detection method according to one embodiment of the present invention; and

fig. 5 is a schematic structural diagram of a video face key point detection apparatus based on dual agents according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

The following describes a method and an apparatus for detecting video face key points based on dual agents according to an embodiment of the present invention with reference to the accompanying drawings, and first, a method for detecting video face key points based on dual agents according to an embodiment of the present invention will be described with reference to the accompanying drawings.

Fig. 2 is a flowchart of a method for detecting key points of a video face based on dual agents according to an embodiment of the present invention.

As shown in fig. 2, the method for detecting video face key points based on dual agents includes the following steps:

in step S101, a tracking agent and a key point detection agent are respectively established and connected via a communication channel.

Specifically, a tracking intelligent agent is established based on a VGG-M model, a single-layer Q network is accessed, a key point detection intelligent agent is established through the combination of a cascade hourglass network and a confidence coefficient network, and the tracking intelligent agent is connected through two communication information channels based on a deconvolution layer and long-short term memory unit codes.

In one embodiment of the invention, the network structure comprises three parts: tracking intelligent structure, detecting intelligent structure and communication information channel by key point. The tracking intelligent agent structure is based on a VGG-M model, a single-layer Q network is accessed behind the tracking intelligent agent structure, and the key point detection intelligent agent structure is designed to be the combination of a cascade hourglass network and a confidence coefficient network. The two communication channels are encoded by the deconvolution layer and the Long Short Term Memory (LSTM) unit, respectively.

The communication information between two agents explicitly encodes the synergy between the two agents. The information passed from the tracking agent to the keypoint detection agent is intended to provide a priori additional texture information for the keypoint detection agent to improve the robustness of the keypoint detection. We chose as information the profile of the third convolutional layer conv3 that tracks the agent and paralleled it with the first order network of the hourglass network for keypoint detection in the depth dimension. Because the feature graph selected by the tracking agent and the feature graph corresponding to the first-stage network of the key point detection agent are different in size, the feature graph and the feature graph can be input into a subsequent network in parallel only by unifying the scales of the feature graph and the feature graph.

The embodiment adopts deconvolution operation to enlarge the scale of the feature map to meet the requirement, and meanwhile, the deconvolution layer contains learnable parameters, so that the transmitted information can be encoded more appropriately through training.

The information transferred from the key point detection agent to the tracking agent provides additional three-dimensional attitude information for detection frame tracking, and aims to provide prior knowledge of the face attitude for accurate frame tracking. To achieve this, the present embodiment uses the detected coordinates of the key points after normalization as a representative of three-dimensional pose information, and uses the long-short term memory unit LSTM to memorize the pose changes in the time dimension. For stable training, the middle layer of the LSTM is updated when a markov decision terminates.

In step S102, edge distribution probabilities of the tracking agent and the key point detecting agent are respectively output according to the bayesian model, and conditional probability distributions are respectively obtained according to communication information between the tracking agent and the key point detecting agent.

In one embodiment of the invention, the video face keypoint detection problem is analyzed in a probabilistic manner according to a Bayesian model. The final output of the video face key points can be regarded as a joint probability distribution, as follows:

p(B_k，V_k|I_k，B_k-1，V_k-1)＝p(B_k|I_k，B_k-1，V_k-l)p(V_k|B_k，I_k，B_k-1，V_k-1)，

wherein,

according to bayes' theorem, the joint probability distribution can be expressed in two ways, namely:

p(B_k|I_k)p(V_k|B_k，I_k)＝p(V_k|I_k)p(B_k|V_k，I_k)

wherein,

in this embodiment, two agents, namely, the tracking agent and the key point detecting agent, are defined, and the marginal probability distribution in the above formula is output respectively, and the other two conditional probability distributions are represented by communication information between the two agents, so that the above equation constraint is ensured in an explicit manner through interaction between the two agents.

In step S103, a markov decision model is established according to the edge distribution probability and the conditional probability distribution, wherein the agent and the agent for detecting the key point are tracked to move through a variable length sequence, and the positions of the detection frame and the key point are updated at the same time, and information is interactively transmitted to obtain a detection result.

In particular, the tracking agent changes the currently observed region by a movement action, wherein the movement action comprises a left, right, up, down, zoom in and zoom out; the key point detection agent decides whether the iteration stops by generating a stop or continue action.

In one embodiment of the invention, the video face keypoint detection problem is modeled as a Markov decision process, and the following explains the key definitions in the Markov decision process:

the state is as follows: the face image blocks framed by the cutting detection frame are obtained by:

s_t＝φ(B，I)，

wherein,

the actions are as follows: for a tracking agent, the tracking agent generates movement actions to change the currently observed region, specifically, movement actions are defined as left, right, up, down, zoom in, zoom out.

For a keypoint detection agent, the keypoint detection generates a stop/continue action to decide whether the iteration should stop.

Rewarding: in the case of a tracking agent, it is,

wherein,

for a key point detection agent to be,

wherein,

in step S104, the detection result is optimized by establishing a supervised learning training function and an reinforcement learning training function to obtain a final result.

In one embodiment of the invention, a two-stage training method is adopted during training: supervision training and reinforcement learning training.

Wherein, the supervised learning training objective function is as follows:

wherein,

the reinforcement learning training objective function is:

wherein,

as shown in fig. 3, the effect is good as seen from the partial results on the challenged subset in the public face database 300 VW.

Fig. 5 is a schematic structural diagram of a dual-agent-based video face keypoint detection apparatus according to an embodiment of the present invention.

As shown in fig. 5, the dual agent-based video face keypoint detection apparatus 10 includes: the system comprises a building module 100, a probability distribution obtaining module 200, a detection interaction module 300 and an optimization module 400.

The establishing module 100 is used for respectively establishing a tracking agent and a key point detection agent, and is connected through a communication information channel, and the probability distribution obtaining module 200 is used for respectively outputting the edge distribution probability of the tracking agent and the key point detection agent according to a Bayesian model, and respectively obtaining the conditional probability distribution according to the communication information between the tracking agent and the key point detection agent. The detection interaction module 300 is configured to establish a markov decision model according to the edge distribution probability and the conditional probability distribution, wherein the intelligent agent and the key point detection intelligent agent are tracked to perform variable-length sequence actions, and meanwhile, the positions of the detection frame and the key point are updated, and information is interactively transmitted to obtain a detection result. The optimization module 400 is configured to optimize the detection result by establishing a supervised learning training function and a reinforcement learning training function to obtain a final result. The video face key point detection side device based on the double intelligent agents has the advantages of improving the performance of a detection system and optimizing a detection result.

Further, in an embodiment of the present invention, the establishing module 100 further includes: a tracking intelligent body is established based on a VGG-M model, a single-layer Q network is accessed, a key point detection intelligent body is established by combining a cascade hourglass network and a confidence coefficient network, and the tracking intelligent body is connected through two communication information channels based on a deconvolution layer and long-short term memory unit codes.

Further, in an embodiment of the present invention, the module for detecting interaction 200 further comprises: tracking the agent to change the currently observed region by a movement action, wherein the movement action comprises moving left, right, up, down, zooming in and zooming out; the key point detection agent decides whether the iteration stops by generating a stop or continue action.

Further, in an embodiment of the present invention, the supervised learning training function and the reinforcement learning training function further comprise:

the supervised learning training function is:

wherein,

the reinforcement learning training function is:

wherein,

it should be noted that the explanation of the embodiment of the method for detecting key points of a video face based on dual agents is also applicable to the apparatus for detecting key points of a video face based on dual agents in this embodiment, and is not repeated here.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A video face key point detection method based on double agents is characterized by comprising the following steps:

respectively establishing a tracking intelligent agent and a key point detection intelligent agent, and connecting the tracking intelligent agent and the key point detection intelligent agent through a communication information channel;

respectively outputting the edge distribution probability of the tracking intelligent agent and the key point detection intelligent agent according to a Bayesian model, and respectively acquiring conditional probability distribution according to communication information between the tracking intelligent agent and the key point detection intelligent agent;

establishing a Markov decision model according to the edge distribution probability and the conditional probability distribution, wherein the tracking agent and the key point detection agent simultaneously update the positions of a detection frame and a key point through variable-length sequence actions and interactively transmit information to obtain a detection result; and

and optimizing the detection result by establishing a supervised learning training function and a reinforcement learning training function to obtain a final result.

2. The method for detecting video face key points based on dual agents as claimed in claim 1, wherein the tracking agent and the key point detecting agent are respectively established and connected through a communication information channel, further comprising:

the tracking intelligent agent is established based on a VGG-M model, a single-layer Q network is accessed, the key point detection intelligent agent is established through the combination of a cascade hourglass network and a confidence coefficient network, and the tracking intelligent agent is connected through two communication information channels based on a deconvolution layer and long-short term memory unit codes.

3. The method of claim 1, wherein a Markov decision model is established according to the edge distribution probability and the conditional probability distribution, and the tracking agent and the key point detecting agent simultaneously update positions of a detection frame and key points through variable-length sequence actions and interactively transmit information to obtain a detection result, further comprising:

the tracking agent changes the currently observed region by a movement action, wherein the movement action comprises a left, right, up, down, zoom in, and zoom out;

the key point detection agent decides whether the iteration stops by generating a stop or continue action.

4. The dual agent-based video face keypoint detection method of claim 1, wherein the normalized detection is used to obtain keypoint coordinates as a representation of three-dimensional pose information, and the long-short term memory unit (LSTM) is used to memorize the pose change in time dimension.

5. The dual agent-based video face keypoint detection method of claim 1, wherein the supervised learning training function and the reinforcement learning training function further comprise:

the supervised learning training function is:

the reinforcement learning training function is as follows:

6. the utility model provides a video face key point detection side's device based on two agents which characterized in that includes:

the establishing module is used for respectively establishing a tracking intelligent agent and a key point detection intelligent agent and is connected through a communication information channel;

the probability distribution acquisition module is used for respectively outputting the edge distribution probability of the tracking intelligent agent and the key point detection intelligent agent according to a Bayesian model and respectively acquiring conditional probability distribution according to communication information between the tracking intelligent agent and the key point detection intelligent agent;

the detection interaction module is used for establishing a Markov decision model according to the edge distribution probability and the conditional probability distribution, wherein the tracking intelligent agent and the key point detection intelligent agent perform actions through variable-length sequences, meanwhile, the positions of a detection frame and key points are updated, and information is interactively transmitted to obtain a detection result; and

and the optimization module is used for optimizing the detection result by establishing a supervised learning training function and an intensified learning training function so as to obtain a final result.

7. The dual agent-based video face keypoint detection apparatus of claim 6, wherein the establishment module further comprises:

8. The dual-agent based video face keypoint detection apparatus according to claim 6, wherein said detection interaction module further comprises:

9. The dual agent-based video face keypoint detection apparatus of claim 6, wherein the normalized detection is used to obtain keypoint coordinates as a representation of three-dimensional pose information, and the long-short term memory unit (LSTM) is used to memorize the pose change in time dimension.

10. The dual agent-based video face keypoint detection device of claim 6, wherein the supervised learning training function and the reinforcement learning training function further comprise:

the supervised learning training function is:

the reinforcement learning training function is as follows: