CN108256479B

CN108256479B - Face tracking method and device

Info

Publication number: CN108256479B
Application number: CN201810045567.9A
Authority: CN
Inventors: 何涛
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2018-01-17
Filing date: 2018-01-17
Publication date: 2023-08-01
Anticipated expiration: 2038-01-17
Also published as: CN108256479A

Abstract

The embodiment of the application discloses a face tracking method and a face tracking device. One embodiment of the method comprises the following steps: acquiring the position information of a target face object in a previous image frame of a current image frame in a video sequence, extracting the image information of an area where the target face object indicated by the position information in the previous image frame of the current image frame is located, and forming a target face image corresponding to the previous image frame; according to the position information of the target face object in the last image frame of the current image frame, determining a search area of the target face object from the current image frame, extracting image information of the search area, and forming a search area image corresponding to the current image frame; and inputting the target face image corresponding to the previous image frame and the search area image corresponding to the current image frame into a face area positioning model, and outputting the position information of the target face object in the search area. The embodiment realizes the improvement of the face tracking accuracy.

Description

Face tracking method and device

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to the technical field of image processing, and particularly relates to a face tracking method and device.

Background

With the development of computer technology, image processing technology is applied to more and more fields. The face tracking technology based on the image sequence is a technology for detecting a face from a video and capturing information such as the position, the size and the like of the face in a subsequent frame.

In face tracking technology, conventional methods include model-based tracking and motion information-based tracking. The model tracking is based on prior knowledge of the acquired target, a model based on skin color, the relative position of the five sense organs and the like is established, and each frame of image is matched through a sliding window. The tracking based on the motion information is to predict the face area by utilizing the continuity rule of the target motion between the continuous frames of the image. Usually, methods such as motion segmentation, optical flow, stereoscopic vision and the like are adopted, and a Kalman filter is utilized for tracking. The method tracks different face objects in the same frame of image by adopting the same template or the same prediction method, and is difficult to distinguish differences among different faces.

Disclosure of Invention

The embodiment of the application provides a face tracking method and device.

In a first aspect, an embodiment of the present application provides a face tracking method, including: acquiring the position information of a target face object in a previous image frame of a current image frame in a video sequence, extracting the image information of an area where the target face object indicated by the position information in the previous image frame of the current image frame is located, and forming a target face image corresponding to the previous image frame; according to the position information of the target face object in the last image frame of the current image frame, determining a search area of the target face object from the current image frame, extracting image information of the search area, and forming a search area image corresponding to the current image frame; inputting a target face image corresponding to the previous image frame and a search area image corresponding to the current image frame into a face area positioning model, and outputting the position information of a target face object in the search area; the face region positioning model is trained based on a sample video sequence marked with the position information of the face object in each image frame.

In some embodiments, the above method further comprises: and mapping the position information of the target face object in the search area to the current image frame to obtain the position information of the target face object in the current image frame.

In some embodiments, the determining the search area of the target face object position from the current image frame according to the position information of the target face object in the image frame previous to the current image frame includes: and determining a search area which at least covers the image area indicated by the position information of the target face object in the last image frame of the current image frame in the current image frame by taking the central position positioned by the position information of the target face object in the last image frame of the current image frame as the central position of the search area.

In some embodiments, the face region localization model includes a first convolutional neural network, a second convolutional neural network, and a full connection layer; the step of inputting the target face image corresponding to the previous image frame and the search area image corresponding to the current image frame into the face area positioning model, and outputting the position information of the target face object in the search area, comprising: inputting a target face image corresponding to the previous image frame into a first convolutional neural network for feature extraction to obtain a first feature; inputting the search area image corresponding to the current image frame into a second convolutional neural network for feature extraction to obtain a second feature; and connecting the first feature and the second feature to generate a connection feature, inputting the connection feature into a full connection layer of the face region positioning model, and outputting the position information of the target face object in the search region.

In some embodiments, the above method further comprises: extracting image information of an area where a target face object is located, indicated by the position information of the target face object in the search area, in the current image frame, and forming a target face object corresponding to the current image frame; comparing whether the target face image corresponding to the previous image frame is consistent with the target face image corresponding to the current image frame; and determining whether the position information of the target face object in the search area is accurate or not according to the comparison result.

In some embodiments, the comparing whether the target face image corresponding to the previous image frame is consistent with the target face image corresponding to the current image frame includes: inputting a target face image corresponding to the previous image frame and a target face image corresponding to the current image frame into a face image consistency comparison model to obtain a comparison result; the face image consistency comparison model is constructed based on a convolutional neural network and is obtained through training based on a sample face image pair of a marked consistency comparison result.

In a second aspect, an embodiment of the present application provides a face tracking apparatus, including: the first extraction unit is used for acquiring the position information of the target face object in the image frame which is the last image frame of the current image frame in the video sequence, extracting the image information of the area where the target face object indicated by the position information is located in the image frame which is the last image frame of the current image frame, and forming a target face image corresponding to the last image frame; the second extraction unit is used for determining a search area of the target face object from the current image frame according to the position information of the target face object in the previous image frame of the current image frame, extracting the image information of the search area and forming a search area image corresponding to the current image frame; the positioning unit is used for inputting the target face image corresponding to the previous image frame and the search area image corresponding to the current image frame into the face area positioning model and outputting the position information of the target face object in the search area; the face region positioning model is trained based on a sample video sequence marked with the position information of the face object in each image frame.

In some embodiments, the apparatus further comprises: and the mapping unit is used for mapping the position information of the target face object in the search area to the current image frame to obtain the position information of the target face object in the current image frame.

In some embodiments, the second extracting unit is configured to determine a search area of the target face object position from the current image frame as follows: and determining a search area which at least covers the image area indicated by the position information of the target face object in the last image frame of the current image frame in the current image frame by taking the central position positioned by the position information of the target face object in the last image frame of the current image frame as the central position of the search area.

In some embodiments, the face region localization model includes a first convolutional neural network, a second convolutional neural network, and a full connection layer; and the positioning unit is used for determining the position information of the target face object in the search area according to the following mode: inputting a target face image corresponding to the previous image frame into a first convolutional neural network for feature extraction to obtain a first feature; inputting the search area image corresponding to the current image frame into a second convolutional neural network for feature extraction to obtain a second feature; and connecting the first feature and the second feature to generate a connection feature, inputting the connection feature into a full connection layer of the face region positioning model, and outputting the position information of the target face object in the search region.

In some embodiments, the apparatus further comprises: the third extraction unit is used for extracting the image information of the area where the target face object is located, indicated by the position information of the target face object in the search area, in the current image frame, and forming the target face object corresponding to the current image frame; the comparison unit is used for comparing whether the target face image corresponding to the previous image frame is consistent with the target face image corresponding to the current image frame or not, and the judging unit is used for determining whether the position information of the target face object in the search area is accurate or not according to the comparison result.

In some embodiments, the comparing unit is configured to compare whether the target face image corresponding to the previous image frame and the target face image corresponding to the current image frame are consistent in the following manner: inputting a target face image corresponding to the previous image frame and a target face image corresponding to the current image frame into a face image consistency comparison model to obtain a comparison result; the face image consistency comparison model is constructed based on a convolutional neural network and is obtained through training based on a sample face image pair of a marked consistency comparison result.

In a third aspect, an embodiment of the present application provides an electronic device, including: one or more processors; and a storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the face tracking method as provided in the first aspect.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium having a computer program stored thereon, where the program when executed by a processor implements the face tracking method provided in the first aspect.

According to the face tracking method and device, the position information of the target face object in the previous image frame of the current image frame in the video sequence is obtained, the image information of the area where the target face object indicated by the position information is located in the previous image frame of the current image frame is extracted, the target face image corresponding to the previous image frame is formed, then the search area of the target face object is determined from the current image frame according to the position information of the target face object in the previous image frame of the current image frame, the image information of the search area is extracted, the search area image corresponding to the current image frame is formed, then the target face image corresponding to the previous image frame and the search area image corresponding to the current image frame are input into the face area positioning model, and the position information of the target face object in the search area is output; the face region positioning model is obtained by training based on a sample video sequence marked with the position information of the face objects in each image frame, so that the face tracking accuracy is improved, and tracking errors caused by the fact that different tracking objects are close to each other or overlap in the images can be effectively avoided.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings, in which:

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow chart of one embodiment of a face tracking method according to the present application;

FIG. 3 is a schematic diagram of one embodiment of a face tracking method according to the present application;

FIG. 4 is a flow chart of another embodiment of a face tracking method according to the present application;

FIG. 5 is a schematic diagram of a face tracking device according to an embodiment of the present application;

fig. 6 is a schematic diagram of a computer system suitable for use in implementing embodiments of the present application.

Detailed Description

The present application is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.

It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary system architecture 100 in which the face tracking method or face tracking apparatus of the present application may be applied.

As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user 110 may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as a photographic application, an image processing application, a search application, and the like, can be installed on the terminal devices 101, 102, 103.

The terminal devices 101, 102, 103 may be electronic devices with image capturing functions, such as monitoring cameras, access control devices including cameras, and various electronic devices with display screens, cameras and supporting network data access, including but not limited to smartphones, tablet computers, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a video server processing video uploaded by the terminal devices 101, 102, 103. The video processing server may perform analysis and other processing on the received video sequence, and feed back the processing result (for example, the tracked position information of the face in each image frame of the video sequence) to the terminal device.

It should be noted that, the face tracking method provided in the embodiment of the present application may be executed by the server 105, and accordingly, the face tracking device may be disposed in the server 105.

It should be noted that the local video sequence of the server 105 may also be directly stored, and the server 105 may directly extract the local video sequence for face recognition, where the exemplary system architecture 100 may not have the terminal devices 101, 102, 103 and the network 104.

It should also be noted that the video processing type application may be installed in the terminal devices 101, 102, 103, and the terminal devices 101, 102, 103 may perform face detection on the video sequence based on the video processing type application, and in this case, the face tracking method may also be performed by the terminal devices 101, 102, 103, and accordingly, the face tracking apparatus may also be provided in the terminal devices 101, 102, 103. At this point, the exemplary system architecture 100 may not have the server 105 and network 104 present.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, the server may be a clustered server, including multiple servers with different processes deployed.

With continued reference to fig. 2, a flow 200 of one embodiment of a face tracking method according to the present application is shown. The face tracking method comprises the following steps:

step 201, acquiring position information of a target face object in a previous image frame of a current image frame in a video sequence, extracting image information of an area where the target face object indicated by the position information is located in the previous image frame of the current image frame, and forming a target face image corresponding to the previous image frame.

In this embodiment, the electronic device (for example, the terminal device or the server shown in fig. 1) on which the face tracking method operates may first acquire the video sequence. The video sequence here may be a video sequence of a scene (e.g. a monitoring scene, a video teleconferencing scene) comprising a plurality of image frames acquired consecutively.

The video sequence may be stored in an electronic device on which the face tracking method operates, at which time the electronic device may read the video sequence from memory. The video sequence may be stored in other devices, where the electronic device may establish a communication connection with the device storing the video sequence through a wired or wireless connection, may receive the video sequence sent by the device storing the video sequence in response to receiving the video search request, or may actively send an acquisition request to a terminal device storing the video sequence or acquiring the video sequence, and receive the video sequence transmitted by the terminal device in response to the acquisition request.

After the video sequence is acquired, whether each image frame in the video sequence contains the target face object may be sequentially detected, and specifically, whether each image frame contains the target face object may be detected from the image frame acquired first according to the acquisition time of the image frame in the video sequence. The target face object is a tracking object, which can be a target face object with a specified identity, or a face object detected in a video sequence. In an actual scenario, the first face object detected in the video sequence may be taken as a target face object, or a certain face object in the video sequence may be designated as a target face object.

A face detection method may be employed to determine the first image frame in the video sequence that contains the target face object. When the target face object is detected, a pre-constructed face model can be utilized to match with the image frame. The pre-constructed face model may be a model containing data and data combinations such as relative positions, relative sizes, head length-width ratios and the like of the five sense organs, or may be a model constructed based on feature points of face images extracted by a feature extraction algorithm. Face detection may also be performed using machine learning models based on architectures such as decision trees, logistic regression, hidden markov, neural networks, etc., where the machine learning model may be trained using sample images of face objects marked therein.

It should be noted that, the face detection method in this embodiment is not limited to the above-mentioned method based on the face model and the machine learning, and the face detection may be performed by using an algorithm such as wavelet transform and Adaboost based on features such as skin color.

When a target face object is detected to be contained in one image frame, the target face object can be detected in the next image frame, and thus all image frames containing the face object can be detected in turn.

Specifically, when the position of the target face object in the image frame immediately preceding the current image frame is determined, the detection of the target face object may be performed for the current image frame. At this time, the position information of the target face object in the image frame immediately preceding the current image frame may be acquired. The position information is the position information of the area covered by the target face object in the whole image frame, and may be, for example, coordinates of four vertices of the minimum circumscribed rectangle of the image area covered by the target face object in the previous image frame. Then, the image information of the area where the target face object is located can be extracted from the previous image frame according to the position information, and a target face image corresponding to the previous image frame is formed. For example, the image information in the minimum circumscribed rectangle can be extracted to form a target face image corresponding to the last image frame.

Step 202, determining a search area of the target face object from the current image frame according to the position information of the target face object in the previous image frame of the current image frame, extracting the image information of the search area, and forming a search area image corresponding to the current image frame.

The video sequence comprises continuously acquired image frames, the target motion among the continuous image frames has a continuity rule, and the motion displacement of the same object between two adjacent image frames is usually small, that is, the position change of the same target face object between the two adjacent image frames is within a certain range. In this embodiment, the search range of the target face object in the current image frame may be determined according to the position information of the target face object in the previous image frame. Specifically, an area covering a range indicated by the positional information of the target face object in the previous image frame may be defined in the current image frame as the search area, and the area of the area is larger than the area of the area covered by the target face object in the previous image frame. And then extracting the image information of the search area in the current image frame to form a search area image corresponding to the current image frame.

In some optional implementations of this embodiment, the determining, according to the location information of the target face object in the previous image frame of the current image frame, the search area of the location of the target face object in the current image frame may be implemented as follows: and determining a search area which at least covers the image area indicated by the position information of the target face object in the last image frame of the current image frame in the current image frame by taking the central position positioned by the position information of the target face object in the last image frame of the current image frame as the central position of the search area.

Specifically, in the above implementation manner, the coordinates of the center position and the coordinates of the edge position, where the position information of the target face object in the previous image frame is located, may be mapped into the current image frame, the coordinates of the center position are determined as the coordinates of the center position of the search area, and then the edge position is extended outward by a certain range to determine the edge position of the search area. For example, if the position information of the target face object in the previous image frame is four vertex coordinates of the rectangular area, each boundary of the rectangular area may be expanded outward in the current image frame with the center point of the corresponding rectangular area as the center, so that the expanded rectangular area covers the rectangular area between the expansions, and the expanded rectangular area may be used as the search area.

The image of the search area portion may be separated from the current image frame to form a search area image corresponding to the current image frame. The search area image is an image of a search area of a target face object in the current image frame.

Step 203, inputting the target face image corresponding to the previous image frame and the search area image corresponding to the current image frame into a face area positioning model, and outputting the position information of the target face object in the search area.

Here, the face region localization model is trained based on a sample video sequence that has been marked with positional information of a face object in each of the image frames included.

The electronic device may input the target face image corresponding to the previous image frame extracted in step 201 and the search area image corresponding to the current image frame extracted in step 202 into a face area positioning model trained by using a machine learning method. The face object contained in the target face image in the input search area image can be positioned by using the face area positioning model, and the image area of the face object which is consistent with the face object in the target face image in the search area can be positioned. The location information of the target face object in the search area can be obtained.

The face region positioning model can be obtained by training a supervised learning method. In particular, a sample video sequence may be constructed and the position information of at least one face object in each of the image frames comprised by the sample video sequence may be marked, for example, a rectangular box may be used in each image frame of the sample video sequence to mark the face object. Optionally, the identity of these face objects may also be marked to distinguish between different face objects. Then extracting the image information of the appointed face object of the previous image frame in two adjacent image frames of the sample video sequence to form a sample face image; and extracting the region which contains the image information of the appointed face object and has the area larger than the area of the sample face image from the next image frame in the two adjacent image frames of the sample video sequence to form a sample searching region image. And then the sample face image and the sample search area image can be input into a face area positioning model for prediction, the face area positioning model can extract the characteristics of the sample face image, the extracted characteristics are matched in the sample search area image, and the prediction result of the position information of the appointed face object in the sample search area image is output. And then comparing the position information of the face object indicated by the prediction result of the position information of the appointed face object in the sample search area image in the next two image frames of the sample video sequence and the position information of the marked face object in the next two image frames of the sample video sequence, adjusting the parameters of the face area positioning model according to the difference between the two image frames indicated by the comparison result, repeating the steps of performing the prediction on the position information of the face object, comparing the prediction result with the marking result and adjusting the parameters of the face area positioning model repeatedly until the difference between the prediction result and the marking result of the face area positioning model meets the preset convergence condition, and stopping updating the parameters of the face area positioning model. At this time, a trained face region localization model is obtained.

After the position information of the target face object in the search area corresponding to the current image frame is obtained, the next image frame can be used as the current image frame, the position information of the target face object in the next image frame is determined by adopting the method, and then the position information of the target face object in each image frame in the video sequence is determined, so that face tracking is realized.

According to the face tracking method, the position information of the target face object in the last image frame of the current image frame in the video sequence is obtained, the image information of the area where the target face object indicated by the position information is located in the last image frame of the current image frame is extracted, the target face image corresponding to the last image frame is formed, then the search area of the target face object is determined from the current image frame according to the position information of the target face object in the last image frame of the current image frame, the image information of the search area is extracted, a search area image corresponding to the current image frame is formed, then the target face image corresponding to the last image frame and the search area image corresponding to the current image frame are input into a face area positioning model, and the position information of the target face object in the search area is output; the face region positioning model is obtained by training based on a sample video sequence marked with the position information of the face objects in each image frame, so that the face tracking combining position continuity and face characteristics is realized, the accuracy of the face tracking is improved, and tracking errors caused by the fact that different tracking objects are close to each other or overlap in the images can be effectively avoided.

In some embodiments, the face region localization model is constructed based on a convolutional neural network, and may include a first convolutional neural network and a second convolutional neural network, and a full-connection layer. The first convolutional neural network and the second convolutional neural network are used for extracting features of an input image. The first convolutional neural network may include a plurality of convolutional layers, a plurality of pooled layers, and a fully-connected layer, and the second convolutional neural network may include a plurality of convolutional layers, a plurality of pooled layers, and a fully-connected layer. And (3) carrying out convolution, pooling and normalization processing on the image input into the first convolution neural network or the second convolution neural network to obtain the feature vector of the image.

At this time, the step 203 of inputting the target face image corresponding to the previous image frame and the search area image corresponding to the current image frame into the face area positioning model and outputting the position information of the target face object in the search area may be performed as follows: inputting a target face image corresponding to the last image frame of the current image frame into a first convolutional neural network for feature extraction to obtain a first feature; inputting the search area image corresponding to the current image frame into a second convolutional neural network for feature extraction to obtain a second feature; and connecting the first feature and the second feature to generate a connection feature, inputting the connection feature into a full connection layer of the face region positioning model, and outputting the position information of the target face object in the search region.

Specifically, the target face image corresponding to the last image frame extracted in step 201 may be input into a first convolutional neural network, and the first convolutional neural network extracts a first feature of the target face image corresponding to the last image frame, where the first feature may represent a feature of the target face object, and may be, for example, a 128-dimensional vector; the search area image corresponding to the current image frame extracted in step 202 is input into a second convolutional neural network, and a second feature of the search area image corresponding to the current image frame is extracted by the second convolutional neural network, where the second feature may represent a feature of the search area image corresponding to the current image frame, and may be, for example, a 128-dimensional vector. The first feature is then stitched to the second feature to generate a connected feature, and the inflow stitch together two 128-dimensional vectors to form a 256-dimensional feature as the connected feature. And then inputting the connection features into a full connection layer of the face region positioning model, wherein the full connection layer of the face region positioning model can estimate the probability of the position information of the face region (for example, the probability of the vertex coordinates of the rectangular region at the positions of all pixel points in the search region image), and determining the position information of the target face object in the search region image according to the probability.

By adopting the face positioning model comprising the first convolutional neural network, the second convolutional neural network and the full-connection layer, the face characteristics of the target face image corresponding to the previous image frame and the characteristics of the search area image corresponding to the current image frame can be accurately extracted, and the positions of the characteristics similar to the face characteristics of the target face image in the current image frame can be compared, so that the position information of the target face object can be more accurately positioned.

In some embodiments, the face tracking method may further include: and mapping the position information of the target face object in the search area to the current image frame to obtain the position information of the target face object in the current image frame. After outputting the position information of the target face object in the search area, the position information of the target face object in the current image frame may be determined according to the position of the search area in the current image frame, and the position information of the target face object in the search area image, where the search area is located when the search area of the target face object is determined in step 202. The vertex coordinates of the rectangle for indicating the position of the target face object in the search area image may be converted into the coordinates of the current image frame by coordinate conversion, for example, to obtain the vertex coordinates of the rectangle for indicating the position of the target face object in the search area image in the current image frame.

Referring to fig. 3, a schematic diagram of one embodiment of a face tracking method according to the present application is shown. As shown in fig. 3, the position of the target face object is determined as a rectangular area a in the figure in the previous image frame F1 of the current image frame F2, and the rectangular area a may be mapped in the current image frame F2 and expanded outwards to obtain a search area R.

Then, the rectangular region a may be extracted from the previous image frame F1 to form a target face image, and the search region R may be extracted from the current image frame F2 to form a search region image. Inputting the target face image and the search area into a face area positioning model, specifically, inputting the target face image into a first convolutional neural network CNN1 of the face positioning model for feature extraction to obtain a first feature f1 of a target face object; and inputting the search area image into a second convolutional neural network CNN2 of the face positioning model to perform feature extraction to obtain a second feature f2 of the search area face image, then inputting the first feature f1 and the second feature f2 into a full connection layer FC of the face positioning model after splicing to obtain the position information of the target face object in the search area image, wherein a rectangular area TA is shown as the position of the target face object in a search area R. And then the position TA information of the target face object in the search area can be sent to the current image frame F2 to obtain the position of the target face object in the current image frame F2.

After the position of the target face object in the current image frame F2 is obtained, the current image frame F2 may be taken as the previous image frame of the next image frame F3 adjacent to the image frame F2, and the above-mentioned face tracking method is continuously executed with the image frame F3 being taken as the current image frame, thereby realizing the position tracking of the target face object in the continuous image frames.

Fig. 4 shows a flow chart of another embodiment of a face tracking method according to the present application. As shown in fig. 4, the flow 400 of the face tracking method may include the following steps:

step 401, acquiring position information of a target face object in a previous image frame of a current image frame in a video sequence, extracting image information of an area where the target face object indicated by the position information in the previous image frame of the current image frame is located, and forming a target face image corresponding to the previous image frame.

In this embodiment, the electronic device on which the face tracking method operates may first obtain a video sequence from a local or other device and detect a target face object in the video sequence. The target face object may be a specified face object or the first face object detected in the video sequence. Specifically, a plurality of face detection methods can be adopted to detect the target face object.

After detecting a target face object in one image frame, tracking of the position of the target face object in the next image frame may continue. Specifically, the position information of the target face object in the image frame previous to the current image frame may be acquired, for example, the coordinates of the region where the target face object is located in the image frame previous to the current image frame may be acquired, and then the image information of the region may be extracted as the target face image.

Step 402, determining a search area of the target face object from the current image frame according to the position information of the target face object in the previous image frame of the current image frame, extracting the image information of the search area, and forming a search area image corresponding to the current image frame.

An area covering a range indicated by the position information of the target face object in the previous image frame may be defined in the current image frame as the search area, and the area of the area is larger than that of the area covered by the target face object in the previous image frame. And then extracting the image information of the search area in the current image frame to form a search area image corresponding to the current image frame.

Step 403, inputting the target face image corresponding to the previous image frame and the search area image corresponding to the current image frame into a face area positioning model, and outputting the position information of the target face object in the search area.

Here, the face region localization model is trained based on a sample video sequence that has been marked with positional information of a face object in each of the image frames included. The target face image corresponding to the last image frame extracted in step 401 and the search area image corresponding to the current image frame extracted in step 402 may be input into a face area positioning model trained by a machine learning method. The face object contained in the target face image in the input search area image is positioned through the face area positioning model, so that the image area of the face object in the search area, which is consistent with the face object in the target face image, can be positioned. The location information of the target face object in the search area can be obtained.

The steps 401, 402, and 403 correspond to the steps 201, 202, and 203 in the foregoing embodiments, respectively, and the descriptions of the steps 201, 202, and 203 are also applicable to the steps 401, 402, and 403, which are not repeated herein.

And step 404, extracting the image information of the area where the target face object is located, indicated by the position information of the target face object in the search area, in the current image frame, and forming the target face object corresponding to the current image frame.

After the position information of the target face object in the search area image corresponding to the current image frame is determined, the image information of the area where the target face object is located in the search area image can be extracted to form the target face image corresponding to the current image frame.

Specifically, according to the position information of the target face object in the search area image determined in step 403, the area corresponding to the position information in the search area image is segmented from the search area image, so as to obtain the target face image corresponding to the current image frame.

Step 405, comparing whether the target face image corresponding to the previous image frame is consistent with the target face image corresponding to the current image frame.

In this embodiment, whether the target face image corresponding to the previous image frame is consistent with the target face image corresponding to the current image frame may be compared. The purpose of the comparison here is to determine whether the target face object located in the previous image frame is the same face object as the target face object located in the current image frame. The specific alignment method may be a feature-based alignment method. Feature extraction methods such as Scale-invariant feature transform (Scale-invariant Feature Transform, SIFT), acceleration robust features (Speeded Up Robust Features, SURF), gaussian filter based, etc. may be employed to extract feature points, textures, contours, etc. of the two target face images. And then calculating the similarity of the two images based on the extracted features, and if the similarity is larger than a preset threshold value, determining that the corresponding target face image in the previous image frame is consistent with the corresponding target face image of the current image frame.

In some optional implementations of this embodiment, it may be compared whether the target face image corresponding to the previous image frame and the target face image corresponding to the current image frame are consistent in the following manner: and inputting the target face image corresponding to the previous image frame and the target face image corresponding to the current image frame into a face image consistency comparison model to obtain a comparison result. The face image consistency comparison model is constructed based on a convolutional neural network and is obtained through training based on a sample face image pair of a marked consistency comparison result.

Specifically, two target face images can be input into a face image consistency comparison model, the face image consistency comparison model can be constructed based on a convolutional neural network, feature images of the two input images can be extracted, and consistency between image features represented by the feature images can be compared. The face image consistency comparison model can be trained in the following way: obtaining a sample face image pair marked with a consistency comparison result, inputting the sample face image pair into a face image consistency comparison model, obtaining a consistency comparison prediction result of the sample face image pair, constructing a loss function based on the output of the face image consistency comparison model and the consistency comparison marking result of the sample face image pair, calculating the value of the loss function, and updating the parameters of the face image consistency comparison model by adopting a back propagation algorithm until the value of the loss function is smaller than a preset value. At this time, a trained face image consistency comparison model is obtained.

Step 406, determining whether the position information of the target face object in the search area is accurate according to the comparison result.

If the comparison result shows that the target face image corresponding to the previous image frame is consistent with the target face image corresponding to the current image frame, the position information of the target face object in the search area can be determined to be accurate. If the comparison result is that the target face image corresponding to the previous image frame is inconsistent with the target face image corresponding to the current image frame, the inaccuracy of the position information of the target face object in the search area can be determined. At this point, steps 202 and 203 may be performed back, the search area may be redetermined, and the target face object in the search area may be located, or the target face object may be considered not to be present in the current image frame, ending the face tracking.

As can be seen from fig. 4, the face tracking method of this embodiment can further ensure that the target face objects extracted in the adjacent image frames are the same face object by adding the step of comparing the consistency of the target face image corresponding to the previous image frame and the target face image corresponding to the current image frame, so as to avoid following the wrong object in the face tracking process, and further improve the accuracy of face tracking.

With continued reference to fig. 5, as an implementation of the method shown in the foregoing figures, the present application provides an embodiment of a face tracking apparatus, where an embodiment of the apparatus corresponds to the method embodiment shown in fig. 2 and fig. 4, and the apparatus may be specifically applied to various electronic devices.

As shown in fig. 5, the face tracking apparatus 500 of the present embodiment includes: a first extraction unit 501, a second extraction unit 502, and a positioning unit 503. The first extraction unit 501 may be configured to obtain location information of a target face object in a previous image frame of a current image frame in a video sequence, and extract image information of an area where the target face object indicated by the location information is located in the previous image frame of the current image frame, so as to form a target face image corresponding to the previous image frame; the second extraction unit 502 may be configured to determine a search area of the target face object from the current image frame according to the position information of the target face object in the previous image frame of the current image frame, extract image information of the search area, and form a search area image corresponding to the current image frame; the positioning unit 503 may be configured to input a target face image corresponding to a previous image frame and a search area image corresponding to a current image frame into a face area positioning model, and output position information of a target face object in the search area; the face region positioning model is trained based on a sample video sequence marked with the position information of the face object in each image frame.

In this embodiment, the first extraction unit 501 may first acquire a video sequence, and then detect whether an image frame in the video sequence contains a target face object. Specifically, it is possible to detect whether each image frame contains a target face object according to the acquisition time of the image frame in the video sequence, starting from the image frame acquired first. The target face object can be a target face object with a specified identity, or can be any face object detected in a video sequence. In an actual scenario, the first face object detected in the video sequence may be taken as a target face object, or a certain face object in the video sequence may be designated as a target face object.

Positional information of the detected target face object in the image frame may then be determined. When the current image frame is tracked by the target face object, the first extraction unit 501 may acquire the position information of the detected target face object in the previous image frame, and then divide the region of the target face object indicated by the position information in the previous image frame to form the target face image corresponding to the previous image frame.

The second extraction unit 502 may determine the search range of the target face object in the current image frame according to the position information of the target face object in the previous image frame. Specifically, the second extraction unit 502 may define, as the search area, an area that covers a range indicated by the positional information of the target face object in the previous image frame and that is larger in area than the range indicated by the positional information of the target face object in the previous image frame in the current image frame. And then extracting the image information of the search area in the current image frame to form a search area image corresponding to the current image frame.

The positioning unit 503 may perform feature extraction on the target face image corresponding to the previous image frame and the search area image corresponding to the current image frame by using the face area positioning model, and compare the extracted features, so as to find the position of the target face object in the search area image.

In some embodiments, the apparatus 500 may further include: and the mapping unit is used for mapping the position information of the target face object in the search area to the current image frame to obtain the position information of the target face object in the current image frame.

In some embodiments, the second extracting unit 502 may be configured to determine the search area of the target face object position from the current image frame as follows: and determining a search area which at least covers the image area indicated by the position information of the target face object in the last image frame of the current image frame in the current image frame by taking the central position positioned by the position information of the target face object in the last image frame of the current image frame as the central position of the search area.

In some embodiments, the face region localization model includes a first convolutional neural network and a second convolutional neural network, and a full connection layer. At this time, the above-mentioned locating unit 503 may be used to determine the position information of the target face object in the search area as follows: inputting a target face image corresponding to the previous image frame into a first convolutional neural network for feature extraction to obtain a first feature; inputting the search area image corresponding to the current image frame into a second convolutional neural network for feature extraction to obtain a second feature; and connecting the first feature and the second feature to generate a connection feature, inputting the connection feature into a full connection layer of the face region positioning model, and outputting the position information of the target face object in the search region.

In some embodiments, the apparatus 500 may further include: the third extraction unit is used for extracting the image information of the area where the target face object is located, indicated by the position information of the target face object in the search area, in the current image frame, and forming the target face object corresponding to the current image frame; the comparison unit is used for comparing whether the target face image corresponding to the previous image frame is consistent with the target face image corresponding to the current image frame or not, and the judging unit is used for determining whether the position information of the target face object in the search area is accurate or not according to the comparison result.

In a further embodiment, the comparing unit may be configured to compare whether the target face image corresponding to the previous image frame and the target face image corresponding to the current image frame are consistent in the following manner: inputting a target face image corresponding to the previous image frame and a target face image corresponding to the current image frame into a face image consistency comparison model to obtain a comparison result; the face image consistency comparison model is constructed based on a convolutional neural network and is obtained through training based on a sample face image pair of a marked consistency comparison result.

The elements recited in apparatus 500 correspond to the various steps in the methods described with reference to fig. 2 and 4. Thus, the operations and features described above with respect to the method are equally applicable to the apparatus 500 and the units contained therein, and are not described in detail herein.

According to the face tracking device 500 of the embodiment, the first extraction unit obtains the position information of the target face object in the previous image frame of the current image frame in the video sequence, extracts the image information of the region where the target face object indicated by the position information is located in the previous image frame of the current image frame, forms the target face image corresponding to the previous image frame, then the second extraction unit determines the search region of the target face object from the current image frame according to the position information of the target face object in the previous image frame, extracts the image information of the search region, forms a search region image corresponding to the current image frame, finally the positioning unit inputs the target face image corresponding to the previous image frame and the search region image corresponding to the current image frame into the face region positioning model, and outputs the position information of the target face object in the search region; the face region positioning model is obtained by training based on a sample video sequence marked with the position information of the face objects in each image frame, so that the face tracking combining position continuity and face characteristics is realized, the accuracy of the face tracking is improved, and tracking errors caused by the fact that different tracking objects are close to each other or overlap in the images can be effectively avoided.

Referring now to FIG. 6, a schematic diagram of a computer system 600 suitable for use in implementing an electronic device of an embodiment of the present application is shown. The electronic device shown in fig. 6 is only an example and should not impose any limitation on the functionality and scope of use of the embodiments of the present application.

As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU) 601, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, mouse, etc.; an output portion 607 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The drive 610 is also connected to the I/O interface 605 as needed. Removable media 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on drive 610 so that a computer program read therefrom is installed as needed into storage section 608.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network through the communication portion 609, and/or installed from the removable medium 611. The above-described functions defined in the method of the present application are performed when the computer program is executed by a Central Processing Unit (CPU) 601. It should be noted that the computer readable medium of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present application may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present application may be implemented by software, or may be implemented by hardware. The described units may also be provided in a processor, for example, described as: a processor includes a first extraction unit, a second extraction unit, and a positioning unit. The names of the units are not limited to the unit itself in some cases, for example, the extracting unit may also be described as "a unit that obtains position information of a target face object in an image frame immediately preceding a current image frame in a video sequence, and extracts image information of an area where the target face object indicated by the position information in the immediately preceding image frame of the current image frame, and forms a target face image corresponding to the immediately preceding image frame".

As another aspect, the present application also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: acquiring the position information of a target face object in a previous image frame of a current image frame in a video sequence, extracting the image information of an area where the target face object indicated by the position information in the previous image frame of the current image frame is located, and forming a target face image corresponding to the previous image frame; according to the position information of the target face object in the last image frame of the current image frame, determining a search area of the target face object from the current image frame, extracting image information of the search area, and forming a search area image corresponding to the current image frame; inputting a target face image corresponding to the previous image frame and a search area image corresponding to the current image frame into a face area positioning model, and outputting the position information of a target face object in the search area; the face region positioning model is trained based on a sample video sequence marked with the position information of the face object in each image frame.

The foregoing description is only of the preferred embodiments of the present application and is presented as a description of the principles of the technology being utilized. It will be appreciated by persons skilled in the art that the scope of the invention referred to in this application is not limited to the specific combinations of features described above, but it is intended to cover other embodiments in which any combination of features described above or equivalents thereof is possible without departing from the spirit of the invention. Such as the above-described features and technical features having similar functions (but not limited to) disclosed in the present application are replaced with each other.

Claims

1. A face tracking method, comprising:

acquiring the position information of a target face object in a previous image frame of a current image frame in a video sequence, extracting the image information of an area where the target face object indicated by the position information is located in the previous image frame of the current image frame, and forming a target face image corresponding to the previous image frame;

determining a search area of a target face object from the current image frame according to the position information of the target face object in the previous image frame of the current image frame, extracting the image information of the search area, and forming a search area image corresponding to the current image frame;

Inputting a target face image corresponding to the previous image frame and a search area image corresponding to the current image frame into a face area positioning model, and outputting position information of a target face object in the search area;

the face region positioning model is trained based on a sample video sequence marked with the position information of a face object in each image frame;

the method further comprises the steps of: extracting image information of an area where a target face object is located, indicated by the position information of the target face object in the search area, in the current image frame, and forming a target face object corresponding to the current image frame; based on the face characteristics of the target face image, comparing whether the target face image corresponding to the previous image frame is consistent with the target face image corresponding to the current image frame, comprising: inputting a target face image corresponding to the last image frame and a target face image corresponding to the current image frame into a face image consistency comparison model to obtain a comparison result, wherein the face image consistency comparison model is constructed based on a convolutional neural network and is obtained by training based on a sample face image pair of a marked consistency comparison result; determining whether the position information of the target face object in the search area is accurate or not according to the comparison result;

Determining whether the position information of the target face object in the search area is accurate according to the comparison result comprises the following steps: if the comparison result shows that the target face image corresponding to the last image frame is inconsistent with the target face image corresponding to the current image frame, determining that the position information of the target face object in the search area is inaccurate, determining the search area of the target face object from the current image frame again, and positioning the target face object in the search area.

2. The method of claim 1, wherein the method further comprises:

and mapping the position information of the target face object in the search area to the current image frame to obtain the position information of the target face object in the current image frame.

3. The method according to claim 1, wherein the determining the search area of the target face object position from the current image frame according to the position information of the target face object in the image frame previous to the current image frame includes:

and determining a search area which at least covers an image area indicated by the position information of the target face object in the previous image frame of the current image frame in the current image frame by taking the central position positioned by the position information of the target face object in the previous image frame of the current image frame as the central position of the search area.

4. The method of claim 1, wherein the face region localization model comprises a first convolutional neural network and a second convolutional neural network, and a fully-connected layer;

the step of inputting the target face image corresponding to the previous image frame and the search area image corresponding to the current image frame into a face area positioning model, and outputting the position information of the target face object in the search area comprises the following steps:

inputting the target face image corresponding to the previous image frame into the first convolutional neural network for feature extraction to obtain a first feature;

inputting the search area image corresponding to the current image frame into the second convolutional neural network for feature extraction to obtain a second feature;

and connecting the first feature and the second feature to generate a connection feature, inputting the connection feature into a full connection layer of the face region positioning model, and outputting the position information of the target face object in the search region.

5. A face tracking apparatus comprising:

a first extraction unit, configured to obtain location information of a target face object in a previous image frame of a current image frame in a video sequence, and extract image information of an area where the target face object indicated by the location information is located in the previous image frame of the current image frame, so as to form a target face image corresponding to the previous image frame;

A second extraction unit, configured to determine a search area of a target face object from the current image frame according to position information of the target face object in an image frame previous to the current image frame, and extract image information of the search area to form a search area image corresponding to the current image frame;

the positioning unit is used for inputting the target face image corresponding to the last image frame and the search area image corresponding to the current image frame into a face area positioning model and outputting the position information of the target face object in the search area;

the apparatus further comprises: a third extraction unit, configured to extract image information of an area where a target face object indicated by position information of the target face object in the search area is located in the current image frame, so as to form a target face object corresponding to the current image frame; the comparison unit is configured to compare whether a target face image corresponding to a previous image frame is consistent with a target face image corresponding to the current image frame based on face features of the target face image, and includes: inputting a target face image corresponding to the last image frame and a target face image corresponding to the current image frame into a face image consistency comparison model to obtain a comparison result, wherein the face image consistency comparison model is constructed based on a convolutional neural network and is obtained by training based on a sample face image pair of a marked consistency comparison result; the judging unit is used for determining whether the position information of the target face object in the search area is accurate or not according to the comparison result;

The judging unit is further configured to determine that the position information of the target face object in the search area is inaccurate if the comparison result indicates that the target face image corresponding to the previous image frame is inconsistent with the target face image corresponding to the current image frame, and re-determine the search area of the target face object from the current image frame and locate the target face object in the search area.

6. The apparatus of claim 5, wherein the apparatus further comprises:

and the mapping unit is used for mapping the position information of the target face object in the search area to the current image frame to obtain the position information of the target face object in the current image frame.

7. The apparatus of claim 5, wherein the second extraction unit is configured to determine a search area of the target face object location from the current image frame in the following manner:

8. The apparatus of claim 5, wherein the face region localization model comprises a first convolutional neural network and a second convolutional neural network, and a fully-connected layer;

the positioning unit is used for determining the position information of the target face object in the search area according to the following mode:

9. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs,

when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-4.

10. A computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the method of any of claims 1-4.