CN106919918B

CN106919918B - Face tracking method and device

Info

Publication number: CN106919918B
Application number: CN201710108748.7A
Authority: CN
Inventors: 赵凌; 李季檩
Original assignee: Tencent Technology Shanghai Co Ltd
Current assignee: Tencent Technology Shanghai Co Ltd
Priority date: 2017-02-27
Filing date: 2017-02-27
Publication date: 2022-11-29
Anticipated expiration: 2037-02-27
Also published as: WO2018153294A1; CN106919918A

Abstract

The embodiment of the invention discloses a face tracking method and a face tracking device; when the face tracking of the video stream is required, the embodiment can acquire a corresponding deep learning network model, allocate memory resources to the network model, enable all layers of the network model to share the same storage space, and then process the video stream based on the allocated memory resources and the network model to realize the real-time face tracking; in the scheme, all layers of the network model can share the same storage space, so that an independent storage space is not required to be allocated to each layer of the network model, the occupation of a memory can be saved, the calculation efficiency is improved, the storage fragments can be reduced, and the performance of an application program is improved.

Description

Face tracking method and device

Technical Field

The invention relates to the technical field of communication, in particular to a face tracking method and device.

Background

In recent years, the face tracking technology has been developed greatly, and in many fields, such as video conference and remote teaching, the specific face needs to be tracked and analyzed.

In the prior art, there are various face tracking techniques, and a deep learning forward prediction technique is one of them. In the deep learning forward prediction technology, different network models need to be established for different application fields, and the hierarchy of the network models is different according to the complexity of the problem to be solved, for example, a problem with higher complexity generally needs to establish a network model with a deeper hierarchy, and so on. At a Personal Computer (PC), each layer of a network model needs to have an exclusive storage area, and the storage area may be specifically set through a configuration file, for example, when allocating storage resources, the configuration file may be read, the size of the storage space of a current layer is calculated, and the storage space is allocated to the current layer, and so on, where the storage areas of each layer need to be allocated independently, and there is no shared memory between the storage areas of each layer.

In the research and practice process of the prior art, the inventor of the invention finds that, in the existing scheme, each layer of a network model needs to monopolize a section of storage area, so that the required total memory is large, and the computing performance is reduced on a platform with limited storage, even the algorithm cannot run; furthermore, since the allocation operation needs to occur a plurality of times, it is also easier to form memory fragments, resulting in a decrease in application performance.

Disclosure of Invention

The embodiment of the invention provides a face tracking method and a face tracking device, which can not only save the occupation of a memory and improve the calculation efficiency, but also reduce the storage fragments and improve the performance of an application program.

The embodiment of the invention provides a face tracking method, which comprises the following steps:

acquiring a video stream needing face tracking and a network model for deep learning;

allocating memory resources for the network model so that all layers of the network model share the same storage space;

and tracking the face in the video stream based on the allocated memory resources and the network model.

Correspondingly, an embodiment of the present invention further provides a face tracking apparatus, including:

the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a video stream needing face tracking and a network model for deep learning;

the allocation unit is used for allocating memory resources to the network model so that all layers of the network model share the same storage space;

and the tracking unit is used for tracking the face in the video stream based on the allocated memory resources and the network model.

When the video stream needs to be deeply learned to track the human face, the embodiment of the invention can obtain a corresponding deeply learned network model, allocate memory resources to the network model, enable all layers of the network model to share the same storage space, and then process the video stream based on the allocated memory resources and the network model to realize the real-time tracking of the human face; in the scheme, all layers of the network model can share the same storage space, so that an independent storage space does not need to be allocated to each layer of the network model, the occupation of a memory can be greatly saved, the calculation efficiency is improved, and in addition, only one-time allocation is needed, so that the frequency of allocation operation can be greatly reduced, the storage fragments are reduced, and the application program performance is favorably improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1a is a schematic view of a scene of a face tracking method according to an embodiment of the present invention;

FIG. 1b is a flowchart of a face tracking method according to an embodiment of the present invention;

fig. 1c is a schematic diagram of memory allocation in the face tracking method according to the embodiment of the present invention;

fig. 1d is a schematic diagram illustrating the use of a memory space in the face tracking method according to the embodiment of the present invention;

FIG. 2a is another flowchart of a face tracking method according to an embodiment of the present invention;

fig. 2b is a schematic diagram of each layer of a network model in the face tracking method according to the embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a face tracking apparatus according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a terminal according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a face tracking method and device.

For example, referring to fig. 1a, when the mobile terminal needs to perform face tracking on a video stream, the mobile terminal may obtain a corresponding deep learning network model, and allocate memory resources to the network model at one time, so that all layers of the network model share the same storage space, for example, a storage space required by each layer of the network model may be calculated, and a maximum value of the storage space is selected as a size of a pre-allocated storage space, and accordingly, memory resources are allocated to the network model, and so on.

The following are detailed descriptions. The numbers in the following examples are not intended to limit the order of preference of the examples.

The first embodiment,

The embodiment will be described from the perspective of a face tracking device, where the face tracking device may be specifically integrated in a mobile terminal and the like, and the mobile terminal may include a mobile phone, a tablet computer, or an intelligent wearable device.

A face tracking method, comprising: the method comprises the steps of obtaining a video stream needing face tracking and a network model for deep learning, allocating memory resources for the network model, enabling all layers of the network model to share the same storage space, and tracking the face in the video stream based on the allocated memory resources and the network model.

As shown in fig. 1b, the specific flow of the face tracking method may be as follows:

101. and acquiring a video stream needing face tracking and a network model for deep learning.

For example, video streams, deep-learning network models, and the like may be obtained from local or other storage devices.

The network model may be set according to the requirements of the actual application, and is not described herein again.

102. Allocating memory resources to the network model, so that all layers of the network model share the same storage space, for example, the following may be specifically used:

(1) And calculating the storage space required by each layer of the network model.

For example, a configuration file of a network model may be obtained, and the storage space required by each layer of the network model may be calculated according to the configuration file. For example, the following may be specifically mentioned:

firstly, reading a configuration file of the network model, secondly, calculating the number of parameters of each network layer according to the configuration file, obtaining an input (i.e., bottom) Blob, an output (i.e., top) Blob and a size of a Blob (i.e., temporary Blob) to be temporarily created by the layer in each layer, and finally, calculating a storage space required by the layer according to the Bottom Blob, the Top Blob and the temporary Blob, namely, the size of an a + B + C area shown in fig. 1C, wherein the a area can be used as an input area of the layer and an output area of a previous layer or a next layer, the B area is a temporary area of the layer, and the C area can be used as an output area of the layer and an input area of the previous layer or the next layer.

It should be noted that, for convenience of description, in the embodiment of the present invention, the Bottom Blob is referred to as an input area, the Top Blob is referred to as an output area, and the temporary Blob is referred to as a temporary area. Wherein, blob is the name of the storage unit of the deep network model, and is a four-dimensional matrix containing the dimension of each matrix.

(2) And taking the maximum value of the storage space required by each layer as the size of the pre-allocated storage space.

For example, taking a network model with six layers as an example, if the storage space required by the fifth layer is the largest, the storage space required by the fifth layer is taken as the size of the pre-allocated storage space, and so on.

(3) And allocating memory resources for the network model according to the size of the pre-allocated storage space.

That is, only the memory resource with the size of the pre-allocated storage space needs to be allocated to the network model at one time, and other spaces do not need to be allocated during forward calculation.

The memory allocation process of the forward calculation may be as shown in fig. 1 d: assuming that the a area currently stores the input area (Bottom Blob) data of the nth layer, the B area stores the temporary data required by the current layer, and the C area stores the output area (Top Blob) data obtained by calculation, when the nth layer obtains the output result by calculation, the pointer of the output area (Top Blob) can be assigned to the input area (Bottom Blob) of the (n + 1) th layer, and the pointer of the input area of the a area is assigned to the output area (Top Blob) of the (n + 1) th layer for storing the output result of the (n + 1) th layer, and the B area is also used for storing the temporary data of the (n + 1) th layer, so that the calculation of the whole forward network can be completed repeatedly, no other data copy and transmission exist in the process, even if the pointer assignment operation can be completed in the preprocessing stage.

103. Tracking the face in the video stream based on the allocated memory resources and the network model; for example, the following may be specifically mentioned:

(1) And determining the current image to be processed according to the video stream to obtain the current frame.

(2) And acquiring the coordinates and the confidence coefficient of the key points of the face of the previous frame of image of the current frame.

The face key points refer to information that can reflect face features, such as eyes, eyebrows, nose, mouth, and outer contour of the face. The face key point coordinates refer to the coordinates of the face key points, and each face key point coordinate can be used in an array, such as an array (x) ₁ ，y ₁ ，x ₂ ，y ₂ ，…，x _n ，y _n ) Is characterized in that (x) _i ，y _i ) Representing the coordinates of the ith point therein.

(3) And predicting the coordinates and the confidence coefficient of the face key point of the current frame based on the allocated memory resources, the network model and the coordinates and the confidence coefficient of the face key point of the previous frame of image, and returning to the step of determining the current image to be processed according to the video stream until the images in the video stream are processed completely.

The method for predicting the face keypoint coordinates and the confidence coefficient of the current frame based on the allocated memory resources, the network model, and the face keypoint coordinates and the confidence coefficient of the previous frame of image may be various, and for example, the method may specifically include the following steps:

and when the confidence coefficient of the previous frame of image is determined to be larger than a preset threshold value, calculating the coordinates of the key points of the face of the previous frame of image through the network model by using the allocated memory resources to obtain a calculation result, predicting the coordinates of the key points of the face of the current frame according to the calculation result, and calculating the confidence coefficient of the current frame.

The preset threshold may be set according to the requirements of practical applications, and is not described herein again.

For example, taking the case that the network model includes a public network portion, a key point prediction branch, and a confidence prediction branch as an example, the steps "calculating the coordinates of the key points of the face of the previous frame image through the network model to obtain a calculation result, predicting the coordinates of the key points of the face of the current frame according to the calculation result, and calculating the confidence of the current frame" may specifically be as follows:

calculating the coordinates of the key points of the face of the previous frame of image through the public network part to obtain a calculation result; and processing the calculation result through the confidence degree prediction branch to obtain the confidence degree of the current frame, and the like.

It should be noted that if the confidence level (i.e., the confidence level of the face key point coordinates of the previous frame of image) is lower than (i.e., not higher than, including equal to) the preset threshold, it indicates that the reference value of the face key point coordinates of the previous frame is low, and therefore, the face key point coordinates in the current frame may be obtained by using a detection method at this time; similarly, if the face key point coordinates and the confidence of the previous frame image of the current frame cannot be obtained, for example, the current frame is the first frame of the video stream, the face key point coordinates in the current frame may also be obtained in a detection manner, that is, optionally, after the step "allocating memory resources to the network model", the face tracking method may further include:

and when the coordinates and the confidence coefficient of the key points of the face of the previous frame of image of the current frame cannot be obtained, or the confidence coefficient of the previous frame of image is determined to be less than or equal to a preset threshold value, detecting the face of the current frame by a face detection algorithm based on the allocated memory resources so as to determine the coordinates and the confidence coefficient of the key points of the face of the current frame.

The detection mode may be various, for example, the following modes may be adopted:

A. based on the allocated memory resources, the face region of the current frame is determined through a face detection algorithm, which may be, for example, as follows:

based on the allocated memory resources, the face features in the current frame are obtained by calculating an image integral graph, strong classifiers of strong faces and non-faces are constructed according to the face features, and then the current frame is processed according to the strong classifiers to obtain the face region of the current frame.

In order to improve the accuracy of face detection, an Adaboost algorithm can be adopted to construct a strong classifier for distinguishing a face from a non-face, and the strong classifier is cascaded in a system in a cascading manner, that is, the strong classifier is cascaded in the same system.

Adaboost is an iterative algorithm, and the core idea is to train different classifiers (weak classifiers) for the same training set, and then to assemble these weak classifiers to form a stronger final classifier (strong classifier).

B. And predicting the positions of facial features in the facial region through the network model to obtain the coordinates and confidence of the facial key points of the current frame.

As can be seen from the above, when the face tracking needs to be performed on the video stream, the embodiment may obtain a corresponding deep learning network model, allocate memory resources to the network model, so that all layers of the network model share the same storage space, and then process the video stream based on the allocated memory resources and the network model to implement the real-time face tracking; in the scheme, all layers of the network model can share the same storage space, so that an independent storage space does not need to be allocated to each layer of the network model, the occupation of a memory can be greatly saved, the calculation efficiency is improved, and in addition, only one-time allocation is needed, so that the frequency of allocation operation can be greatly reduced, the storage fragments are reduced, and the application program performance is favorably improved.

Example II,

The method described in the first embodiment is further illustrated by way of example.

In this embodiment, an example that the face tracking apparatus can be specifically integrated in a mobile terminal will be described.

As shown in fig. 2a, a specific process of a face tracking method may be as follows:

201. the mobile terminal obtains a video stream.

For example, the mobile terminal may specifically receive a video stream sent by another device, or obtain the video stream from a local storage space, and so on.

202. The mobile terminal obtains a deep learning network model.

The network model may be set according to requirements of actual applications, for example, the network model may include three parts, first, a first part is a public network part, and second, two branches, a critical point prediction branch and a confidence prediction branch, which are generated subsequently by the public network part. The hierarchy of each part may be determined according to the requirement, for example, referring to fig. 2b, the hierarchy of each part may be specifically as follows:

the common network part may comprise 6 convolutional (Convolution) layers, such as convolutional layer 1, convolutional layer 2, convolutional layer 3, convolutional layer 4, convolutional layer 5, and convolutional layer 6, each of which is followed by a modified linear unit (Relu) activation function, called nonlinear activation function for short, and may also be followed by a Pooling (Pooling) layer, which is a layer used for aggregation, as shown in fig. 2b.

The keypoint prediction branch may comprise 1 convolutional layer and 3 Inner Product (Inner Product) layers, e.g., see fig. 2b, and may specifically comprise convolutional layer 7, inner layer 1, inner layer 2, and Inner layer 3, each of which is followed by a nonlinear activation function.

The confidence prediction branch may include 1 convolutional layer (i.e., convolutional layer 8), 5 inner layers (i.e., inner layer 4, inner layer 5, inner layer 6, inner layer 7, and inner layer 8), and 1 flexible maximum transfer function (Softmax) layer, where the Softmax layer outputs two values, which are face probability and non-face probability, respectively, and the sum of the two values is 1.0. In addition, each convolutional layer, and each two inner layers, may be followed by a nonlinear activation function.

203. The mobile terminal calculates the storage space required by each layer of the network model.

For example, the configuration file of the network model may be read, the number of parameters of each network layer is calculated according to the configuration file, the sizes of the input area, the output area, and the temporary area of each layer network are obtained, and then, the size of the storage space required by the layer, that is, the size of the a + B + C area shown in fig. 1C, may be calculated according to the sizes of the input area, the output area, and the temporary area.

204. And the mobile terminal takes the maximum value of the storage space required by each layer as the size of the pre-allocated storage space and allocates memory resources for the network model according to the size of the pre-allocated storage space.

The memory allocation process of the forward calculation may be as shown in fig. 1 d: if the area A currently stores input area data of the nth layer, the area B stores temporary data required by the current layer, and the area C stores output area data obtained by calculation, after the nth layer calculates an output result, a pointer of the output area can be assigned to an input area of the (n + 1) th layer, a pointer of the input area of the area A is assigned to an output area of the (n + 1) th layer for storing the output result of the (n + 1) th layer, the area B is also used for storing the temporary data of the (n + 1) th layer, after the (n + 1) th layer finishes processing, the value of n is updated to be 'n + 1', the process is repeated, and the process is repeated, so that the calculation of the whole forward network can be completed. For example, taking the initial value of n equal to 1 as an example, the following may be specific:

when the layer 1 calculation obtains the output result, the pointer of the output area of the layer 1 (namely, the area C of the layer 1) can be assigned to the input area of the layer 2, and the pointer of the input area of the area a is assigned to the output area of the layer 2 for storing the output result of the layer 2, and the area B is also used for storing the temporary data of the layer 2. Similarly, after the layer 2 calculation obtains the output result, the pointer of the output area of the layer 2 (the area C of the layer 2, which is also the area a of the layer 1) may be assigned to the input area of the layer 3, and the pointer of the input area of the area a of the layer 2 (which is the area C of the layer 1) may be assigned to the output area of the layer 3, so as to store the output result of the layer 3, and the area B is also used to store the temporary data of the layer 3, and so on.

Wherein, the process has no other data copy and transmission, even if the pointer assignment operation can be completed in the preprocessing stage.

Therefore, the calculation utilizes a characteristic of deep learning, namely the calculation of the n +1 th layer only needs to use the input area of the n +1 th layer (namely the output area of the n +1 th layer) and the output area of the n +1 th layer, and does not need to use the input area of the n +1 th layer, so that the memory occupied by the input area of the n +1 th layer can be recycled; that is, the operations of all layers are performed in the pre-allocated memory area of "a + B + C", so that the required memory space depends only on the memory space of a certain layer no matter how deep the network layer is, and therefore, the occupation of memory resources can be saved, making it possible to apply a complex deep-level network on the mobile terminal platform. In addition, from the calculation process, only the pointer assignment operation in the memory is performed, so that the method can be very fast and efficient.

205. And the mobile terminal determines the current image to be processed according to the video stream to obtain the current frame.

206. The mobile terminal obtains the coordinates and confidence of the key points of the face of the previous frame of image of the current frame, and then executes step 207.

The face key points refer to information that can reflect face features, such as eyes, eyebrows, nose, mouth, and outer contour of the face. The face key point coordinates refer to coordinates of these face key points.

It should be noted that, if the face keypoint coordinates and the confidence of the previous frame image of the current frame are not obtained, for example, the current frame is the first frame of the video stream, the face keypoint coordinates and the confidence of the current frame may be obtained through detection, that is, step 208 is executed.

207. The mobile terminal determines whether the confidence of the face key point coordinates of the previous frame of image is higher than a preset threshold, if so, the face key point tracking is successful, and step 209 is executed, otherwise, if not, the face key point tracking is failed, and step 208 is executed.

208. The mobile terminal detects the face in the current frame through a face detection algorithm based on the allocated memory resources to determine the coordinates and confidence of the key points of the face in the current frame, and then executes step 210.

(1) The mobile terminal determines the face region of the current frame through a face detection algorithm based on the allocated memory resources, for example, the following may be used:

the mobile terminal obtains the face features in the current frame by calculating an image integral graph based on the allocated memory resources, constructs strong classifiers of a strong face and a non-face according to the face features, and then processes the current frame according to the strong classifiers to obtain the face region of the current frame.

In order to improve the accuracy of face detection, an Adaboost algorithm can be used to construct a strong classifier for distinguishing a face from a non-face, and the strong classifier is cascaded in a system in a cascading manner.

(2) And the mobile terminal predicts the positions of the facial features in the facial region through the network model to obtain the coordinates and confidence of the facial key points of the current frame.

In order to reduce the calculation time and save the calculation resources, the calculation of the face key point coordinates and the confidence may be synchronous.

209. The mobile terminal calculates the coordinates of the key points of the face of the previous frame of image through the network model by using the allocated memory resources to obtain a calculation result, predicts the coordinates of the key points of the face of the current frame according to the calculation result, calculates the confidence of the current frame, and then executes step 210.

For example, the mobile terminal may calculate the coordinates of key points of the face of the previous frame of image through the public network portion of the network model to obtain a calculation result, then process the calculation result through the key point prediction branch to obtain the coordinates of key points of the face of the current frame, and process the calculation result through the confidence prediction branch to obtain the confidence of the current frame, and so on.

For example, an envelope frame of coordinates of key points of a face of the previous frame of image may be calculated by a public network portion of the network model, and then, on one hand, the coordinates of key points of the face of the current frame are obtained by calculating positions of key points of the face in the current frame according to the envelope frame through the key point prediction branch, and on the other hand, accuracy of face recognition, such as whether the image in the envelope frame is a face, is analyzed through the confidence prediction branch, and then, the confidence of the current frame is calculated according to the analysis result.

In order to reduce the computation time and save the computation resources, the computation of the face key point coordinates and the confidence level may be synchronized, that is, the processing of the key point prediction branch and the confidence level prediction branch may be parallel.

210. The mobile terminal determines whether all the images in the video stream are completely identified, if yes, the process is ended, otherwise, the process returns to execute step 205.

Namely, the coordinates and confidence of the key points of the face of the current frame are used as a reference for face tracking of the next frame image, and the process is circulated until all the images in the video stream are recognized.

As can be seen from the above, when the face tracking needs to be performed on the video stream, the embodiment may obtain a corresponding deep learning network model, allocate memory resources to the network model, so that all layers of the network model share the same storage space, and then process the video stream based on the allocated memory resources and the network model, so as to complete the real-time face tracking in the mobile terminal. On one hand, in the scheme, all layers of the network model can share the same storage space, so that an independent storage space does not need to be allocated to each layer of the network model, the occupation of a memory can be greatly saved, the calculation efficiency is improved, and in addition, the allocation operation times can be greatly reduced, the storage fragments are reduced, and the application program performance is favorably improved as the allocation is only needed once; on the other hand, when the face tracking is abnormal, for example, the confidence coefficient is less than or equal to the threshold value or the face key point coordinate and the confidence coefficient of the previous frame cannot be obtained, the tracking reset can be automatically performed (that is, the face key point coordinate and the confidence coefficient are obtained again in a detection mode), so that the continuity of the face tracking can be enhanced.

In addition, the scheme has less memory requirement and higher calculation efficiency, so the requirement on the performance of the equipment is lower, and the method is suitable for equipment such as a mobile terminal, and can track the face more efficiently and flexibly compared with the scheme of placing the deep learning forward algorithm at the server end, thereby being beneficial to improving the user experience.

Example III,

In order to better implement the above method, an embodiment of the present invention further provides a face tracking apparatus, as shown in fig. 3, the face tracking apparatus includes an obtaining unit 301, an allocating unit 302, and a tracking unit 303, as follows:

(1) An acquisition unit 301;

an obtaining unit 301, configured to obtain a video stream that needs to be face tracked and a deep learning network model.

For example, the network model may include a public network portion, a key point prediction branch, a confidence prediction branch, and the like, which may be referred to in the foregoing method embodiments and will not be described herein again.

(2) A distribution unit 302;

an allocating unit 302 is configured to allocate memory resources to the network model, so that all layers of the network model share the same storage space.

For example, the allocation unit 302 may include a calculation subunit and an allocation subunit, as follows:

and the calculating subunit can be used for calculating the storage space required by each layer of the network model.

For example, the calculating subunit may be specifically configured to obtain a configuration file of a network model, and calculate, according to the configuration file, a storage space required by each layer network in the network model, for example, as follows:

the calculating subunit reads the configuration file of the network model, calculates the number of parameters of each network layer according to the configuration file, obtains the sizes of the input area, the output area, and the temporary area of each layer network, and then calculates the storage space required by the layer according to the sizes of the input area, the output area, and the temporary area, that is, the size of the a + B + C area shown in fig. 1C.

And the allocation subunit is configured to use a maximum value in the storage spaces required by the layers as a size of the pre-allocated storage space, and allocate the memory resource to the network model according to the size of the pre-allocated storage space.

(3) A tracking unit 303;

a tracking unit 303, configured to track a face in the video stream based on the allocated memory resource and the network model.

For example, the tracking unit 303 may include a determination subunit, a parameter acquisition subunit, and a prediction subunit, as follows:

the determining subunit is configured to determine, according to the video stream, an image that needs to be processed currently, to obtain a current frame;

and the parameter acquisition subunit is used for acquiring the coordinates and the confidence of the key points of the face of the previous frame of image of the current frame.

The face key points refer to information capable of reflecting features of the face, such as eyes, eyebrows, a nose, a mouth, an outer contour of the face, and the like. The face key point coordinates refer to coordinates of these face key points.

And the prediction subunit is used for predicting the coordinates and the confidence coefficient of the face key point of the current frame based on the allocated memory resources, the network model, the coordinates and the confidence coefficient of the face key point of the previous frame of image, and triggering the determination subunit to execute the operation of determining the current image to be processed according to the video stream until the images in the video stream are processed completely.

the predicting subunit may be specifically configured to, when it is determined that the confidence of the previous frame of image is greater than a preset threshold, calculate, by using the allocated memory resource, the face key point coordinates of the previous frame of image through the network model to obtain a calculation result, predict, according to the calculation result, the face key point coordinates of the current frame, and calculate the confidence of the current frame.

For example, taking the example that the network model includes a public network portion, a keypoint prediction branch, and a confidence prediction branch, the prediction subunit may be specifically configured to:

calculating the coordinates of the key points of the face of the previous frame of image through the public network part to obtain a calculation result; and processing the calculation result through the key point prediction branch to obtain the coordinates of the key point of the face of the current frame, and processing the calculation result through the confidence coefficient prediction branch to obtain the confidence coefficient of the current frame, and the like.

It should be noted that, if the confidence of the previous frame of the current frame is not higher than the preset threshold, it indicates that the reference value of the face key point coordinates of the previous frame is low, and therefore, the face key point coordinates in the current frame can be obtained by using a detection method at this time; similarly, if the face key point coordinates and confidence of the previous frame image of the current frame cannot be obtained, for example, the current frame is the first frame of the video stream, the face key point coordinates in the current frame may also be obtained in a detection manner, that is, optionally, the tracking unit 303 may further include a detection subunit, as follows:

the detection subunit may be configured to, when the coordinates and the confidence level of the face key point of the previous frame image of the current frame are not obtained, or when the confidence level of the previous frame image is determined to be less than or equal to a preset threshold, detect the face in the current frame by using a face detection algorithm based on the allocated memory resources, so as to determine the coordinates and the confidence level of the face key point of the current frame.

For example, the detection subunit may be specifically configured to determine a face region of the current frame by using a face detection algorithm based on allocated memory resources, and predict positions of facial features in the face region by using the network model to obtain coordinates and a confidence of a facial key point of the current frame.

In specific implementation, the above units may be implemented as independent entities respectively, or may be implemented as one or several entities by arbitrary combination, and specific implementations of the above units may refer to the foregoing method embodiments, which are not described herein again.

The face tracking device can be specifically integrated in a mobile terminal and other devices, and the mobile terminal can comprise a mobile phone, a tablet computer or intelligent wearable equipment and the like.

As can be seen from the above, in this embodiment, when a video stream needs to be subjected to face tracking, the obtaining unit 301 may obtain a corresponding deep learning network model, and the allocating unit 302 allocates memory resources to the network model, so that all layers of the network model share the same storage space, and then the tracking unit 303 processes the video stream based on the allocated memory resources and the network model, so as to implement real-time face tracking; in the scheme, all layers of the network model can share the same storage space, so that an independent storage space does not need to be allocated to each layer of the network model, the occupation of a memory can be greatly saved, the calculation efficiency is improved, and in addition, only one-time allocation is needed, so that the frequency of allocation operation can be greatly reduced, the storage fragments are reduced, and the application program performance is favorably improved.

Examples IV,

Accordingly, as shown in fig. 4, the mobile terminal according to an embodiment of the present invention may include a Radio Frequency (RF) circuit 401, a memory 402 including one or more computer-readable storage media, an input unit 403, a display unit 404, a sensor 405, an audio circuit 406, a Wireless Fidelity (WiFi) module 407, a processor 408 including one or more processing cores, and a power supply 409. Those skilled in the art will appreciate that the mobile terminal architecture shown in fig. 4 is not intended to be limiting of mobile terminals and may include more or fewer components than shown, or a combination of certain components, or a different arrangement of components. Wherein:

the RF circuit 401 may be used for receiving and transmitting signals during a message transmission or communication process, and in particular, for receiving downlink information of a base station and then sending the received downlink information to the one or more processors 408 for processing; in addition, data relating to uplink is transmitted to the base station. In general, the RF circuitry 401 includes, but is not limited to, an antenna, at least one Amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM) card, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuitry 401 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communications (GSM), general Packet Radio Service (GPRS), code Division Multiple Access (CDMA), wideband Code Division Multiple Access (WCDMA), long Term Evolution (LTE), email, short Message Service (SMS), and the like.

The memory 402 may be used to store software programs and modules, and the processor 408 executes various functional applications and data processing by operating the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the mobile terminal, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 408 and the input unit 403 access to the memory 402.

The input unit 403 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control. In particular, in a particular embodiment, the input unit 403 may include a touch-sensitive surface as well as other input devices. The touch-sensitive surface, also referred to as a touch display screen or a touch pad, may collect touch operations by a user (such as operations by the user on or near the touch-sensitive surface using a finger, a stylus, or any other suitable object or attachment) thereon or nearby, and drive the corresponding connection device according to a predetermined program. Alternatively, the touch sensitive surface may comprise two parts, a touch detection means and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts it to touch point coordinates, and sends the touch point coordinates to the processor 408, and can receive and execute commands sent from the processor 408. In addition, the touch sensitive surface can be implemented in various types, such as resistive, capacitive, infrared, and surface acoustic wave. The input unit 403 may include other input devices in addition to the touch-sensitive surface. In particular, other input devices may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 404 may be used to display information input by or provided to the user and various graphical user interfaces of the mobile terminal, which may be made up of graphics, text, icons, video, and any combination thereof. The Display unit 404 may include a Display panel, and optionally, the Display panel may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch-sensitive surface may overlay the display panel, and when a touch operation is detected on or near the touch-sensitive surface, the touch operation may be transmitted to the processor 408 to determine the type of touch event, and the processor 408 may then provide a corresponding visual output on the display panel based on the type of touch event. Although in FIG. 4 the touch-sensitive surface and the display panel are shown as two separate components to implement input and output functions, in some embodiments the touch-sensitive surface may be integrated with the display panel to implement input and output functions.

The mobile terminal may also include at least one sensor 405, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display panel according to the brightness of ambient light, and a proximity sensor that may turn off the display panel and/or the backlight when the mobile terminal is moved to the ear. As one of the motion sensors, the gravity acceleration sensor can detect the magnitude of acceleration in each direction (generally three axes), can detect the magnitude and direction of gravity when stationary, and can be used for applications of identifying the gesture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer gesture calibration), vibration identification related functions (such as pedometer and tapping), and the like; other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the mobile terminal, are not described herein again.

Audio circuitry 406, a speaker, and a microphone may provide an audio interface between the user and the mobile terminal. The audio circuit 406 may transmit the electrical signal converted from the received audio data to a speaker, and convert the electrical signal into a sound signal for output; on the other hand, the microphone converts the collected sound signal into an electrical signal, which is received by the audio circuit 406 and converted into audio data, which is then processed by the audio data output processor 408, either by the RF circuit 401 for transmission to, for example, another mobile terminal, or by outputting the audio data to the memory 402 for further processing. The audio circuitry 406 may also include an earbud jack to provide communication of a peripheral headset with the mobile terminal.

WiFi belongs to short distance wireless transmission technology, and the mobile terminal can help the user to send and receive e-mail, browse web page and access streaming media etc. through WiFi module 407, it provides wireless broadband internet access for the user. Although fig. 4 shows the WiFi module 407, it is understood that it does not belong to the essential constitution of the mobile terminal, and can be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 408 is a control center of the mobile terminal, connects various parts of the entire handset using various interfaces and lines, performs various functions of the mobile terminal and processes data by operating or executing software programs and/or modules stored in the memory 402 and calling data stored in the memory 402. Optionally, processor 408 may include one or more processing cores; preferably, the processor 408 may integrate an application processor, which handles primarily the operating system, user interface, applications, etc., and a modem processor, which handles primarily wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 408.

The mobile terminal also includes a power supply 409 (e.g., a battery) for powering the various components, which may preferably be logically coupled to the processor 408 via a power management system that may be configured to manage charging, discharging, and power consumption. The power supply 409 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

Although not shown, the mobile terminal may further include a camera, a bluetooth module, and the like, which will not be described herein. Specifically, in this embodiment, the processor 408 in the mobile terminal loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 408 runs the application program stored in the memory 402, thereby implementing various functions:

the method comprises the steps of obtaining a video stream needing face tracking and a deep learning network model, allocating memory resources for the network model to enable all layers of the network model to share the same storage space, and tracking the face in the video stream based on the allocated memory resources and the network model.

For example, the storage space required by each layer of the network model may be specifically calculated, for example, a configuration file of the network model is obtained, the storage space required by each layer of the network model is calculated according to the configuration file, then, the maximum value of the storage spaces required by each layer is used as the size of the pre-allocated storage space, and the memory resource is allocated to the network model according to the size of the pre-allocated storage space, and so on.

The structure of the network model may be set according to the requirements of the actual application, for example, the network model may include a public network portion, a key point prediction branch, a confidence prediction branch, and the like. In addition, the levels of the public network portion, the key point prediction branch and the confidence degree prediction branch may also be determined according to the requirements of the practical application, which may specifically refer to the foregoing method embodiments, and are not described herein again.

There are various ways to track the face in the video stream based on the allocated memory resources and the network model, for example, the coordinates and confidence of the key point of the face in the previous frame of image in the current frame may be obtained, and then the coordinates and confidence of the key point of the face in the current frame are predicted based on the allocated memory resources, the network model and the coordinates and confidence of the key point of the face in the previous frame of image, and so on, that is, the application program stored in the memory 402 may further implement the following functions:

determining the current image to be processed according to the video stream to obtain a current frame; acquiring the coordinates and confidence of key points of the face of the previous frame of image of the current frame; and predicting the coordinates and the confidence coefficient of the face key point of the current frame based on the allocated memory resources, the network model and the coordinates and the confidence coefficient of the face key point of the previous frame of image, and returning to the step of determining the current image to be processed according to the video stream until the images in the video stream are processed completely.

For example, when it is determined that the confidence of the previous frame of image is greater than the preset threshold, the face key point coordinates of the previous frame of image are calculated through the network model by using the allocated memory resources to obtain a calculation result, then, the face key point coordinates of the current frame are predicted according to the calculation result, and the confidence of the current frame is calculated.

It should be noted that if the confidence of the previous frame is not higher than the preset threshold, it indicates that the reference value of the face key point coordinates of the previous frame is low, so at this time, the face key point coordinates in the current frame can be obtained by adopting a detection mode; similarly, if the face key point coordinates and confidence of the previous frame image of the current frame cannot be obtained, for example, the current frame is the first frame of the video stream, the face key point coordinates in the current frame may also be obtained in a detection manner, that is, the application program stored in the memory 402 may further implement the following functions:

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

As can be seen from the above, when the mobile terminal of this embodiment needs to perform deep learning on a video stream to perform face tracking, it may obtain a corresponding deep learning network model, and allocate memory resources to the network model, so that all layers of the network model share the same storage space, and then process the video stream based on the allocated memory resources and the network model to implement real-time face tracking; in the scheme, all layers of the network model can share the same storage space, so that an independent storage space does not need to be allocated to each layer of the network model, the occupation of a memory can be greatly saved, the calculation efficiency is improved, and in addition, only one-time allocation is needed, so that the frequency of allocation operation can be greatly reduced, the storage fragments are reduced, and the application program performance is favorably improved.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: read Only Memory (ROM), random Access Memory (RAM), magnetic or optical disks, and the like.

The face tracking method and device provided by the embodiment of the invention are described in detail above, a specific example is applied in the text to explain the principle and the implementation of the invention, and the description of the above embodiment is only used to help understanding the method and the core idea of the invention; meanwhile, for those skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed, and in summary, the content of the present specification should not be construed as limiting the present invention.

Claims

1. A face tracking method, comprising:

acquiring a video stream needing face tracking and a network model for deep learning, wherein the network model is a forward network;

reading a configuration file of the network model, and calculating a parameter format of each network layer in the network model according to the configuration file to obtain the sizes of an input area, an output area and a temporary area of each layer network in the network model;

calculating the storage space required by each layer in the network model according to the sizes of the input area, the output area and the temporary area of each layer network;

taking the maximum value of the storage space required by each layer as the size of the pre-allocated storage space;

allocating memory resources for the network model according to the size of the pre-allocated storage space;

tracking the face in the video stream based on the allocated memory resources and the network model, wherein in the forward calculation process of face tracking, the processing data of the video stream is stored by using an input area, an output area and a temporary area in the current network level of the network model; when the video stream is processed by the next network level of the current network level, the processing data of the video stream by the next network level is stored by the input area, the output area and the temporary area through pointer assignment operation, so that all layers of the network model share the same storage space.

2. The method of claim 1, wherein tracking the face in the video stream based on the allocated memory resources and the network model comprises:

determining the current image to be processed according to the video stream to obtain a current frame;

acquiring coordinates and confidence coefficients of key points of the face of a previous frame of image of a current frame;

and predicting the coordinates and the confidence coefficient of the key point of the face of the current frame based on the allocated memory resources, the network model and the coordinates and the confidence coefficient of the key point of the face of the previous frame of image, and returning to the step of determining the image which needs to be processed currently according to the video stream until all the images in the video stream are processed.

3. The method of claim 2, wherein predicting the face keypoint coordinates and confidence of the current frame based on the allocated memory resources, the network model, the face keypoint coordinates and confidence of the previous frame of image comprises:

when the confidence coefficient of the previous frame of image is determined to be larger than a preset threshold value, calculating the coordinates of the key points of the face of the previous frame of image through the network model by using allocated memory resources to obtain a calculation result;

and predicting the coordinates of the key points of the face of the current frame according to the calculation result, and calculating the confidence coefficient of the current frame.

4. The method according to claim 3, wherein the network model includes a public network portion, a key point prediction branch and a confidence degree prediction branch, and the calculating the face key point coordinates of the previous frame of image by the network model to obtain the calculation result includes:

calculating the coordinates of the key points of the face of the previous frame of image through the public network part to obtain a calculation result;

the predicting the coordinates of the key points of the face of the current frame according to the calculation result and calculating the confidence coefficient of the current frame comprise: and processing the calculation result through the key point prediction branch to obtain the face key point coordinates of the current frame, and processing the calculation result through the confidence degree prediction branch to obtain the confidence degree of the current frame.

5. The method of claim 2, further comprising:

6. The method of claim 5, wherein the detecting the face in the current frame by a face detection algorithm based on the allocated memory resources to determine the face keypoint coordinates and the confidence of the current frame comprises:

determining the face region of the current frame through a face detection algorithm based on the allocated memory resources;

and predicting the positions of facial features in the facial region through the network model to obtain the coordinates and confidence of the facial key points of the current frame.

7. The method of claim 6, wherein the determining the face region of the current frame by a face detection algorithm based on the allocated memory resources comprises:

acquiring the human face characteristics in the current frame by calculating an image integrogram based on the allocated memory resources;

constructing strong classifiers of strong faces and non-faces according to the face features, wherein the strong classifiers are cascaded in the same system;

and processing the current frame according to the strong classifier to obtain the face region of the current frame.

8. A face tracking method, comprising:

acquiring a video stream needing face tracking, a network model for deep learning and memory resources allocated by the network model, wherein the allocated memory resources are generated according to the size of an input area, an output area and a temporary area of each network layer in the network model;

in the forward calculation process, storing the processing data of the current network level of the network model to the video stream by using an input area, an output area and a temporary area;

when the video stream is processed by the next network level of the current network level, the processing data of the video stream by the next network level is stored by the input area, the output area and the temporary area through pointer assignment operation, so that all layers of the network model share the same storage space.

9. A method of storing data, comprising:

acquiring data to be processed and a deep learning network model, wherein the network model is a forward network;

taking the maximum value in the storage space required by each layer as the size of the pre-allocated storage space;

performing data processing on the data to be processed based on the allocated memory resources and the network model, wherein in the forward calculation process of data processing, the processing data is stored by using an input area, an output area and a temporary area to store the current network level of the network model; when the data to be processed is processed by using the next network level of the current network level, the data to be processed of the next network level is stored by using the input area, the output area and the temporary area through pointer assignment operation, so that all layers of the network model share the same storage space.

10. A face tracking device, comprising:

the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a video stream needing face tracking and a network model for deep learning, and the network model is a forward network;

the reading unit is used for reading the configuration file of the network model, calculating the parameter format of each network layer in the network model according to the configuration file, and obtaining the sizes of an input area, an output area and a temporary area of each layer network in the network model;

the calculation unit is used for calculating the storage space required by each layer in the network model according to the sizes of the input area, the output area and the temporary area of each layer network; taking the maximum value in the storage space required by each layer as the size of the pre-allocated storage space;

the allocation unit is used for allocating memory resources for the network model according to the size of the pre-allocated storage space;

a tracking unit, configured to track a face in the video stream based on the allocated memory resources and the network model, where in a forward calculation process of face tracking, processing data of the video stream is stored by using an input area, an output area, and a temporary area in a current network hierarchy of the network model; when the video stream is processed by the next network level of the current network level, the processing data of the video stream by the next network level is stored by the input area, the output area and the temporary area through pointer assignment operation, so that all layers of the network model share the same storage space.

11. The apparatus of claim 10, wherein the tracking unit comprises a determining subunit, a parameter obtaining subunit, and a predicting subunit;

a determining subunit, configured to determine, according to the video stream, an image that needs to be processed currently, to obtain a current frame;

the parameter acquisition subunit is used for acquiring the coordinates and the confidence coefficient of the key points of the face of the previous frame of image of the current frame;

and the prediction subunit is used for predicting the coordinates and the confidence coefficient of the key point of the face of the current frame based on the allocated memory resources, the network model and the coordinates and the confidence coefficient of the key point of the face of the previous frame of image, and triggering the determination subunit to execute the operation of determining the image which needs to be processed currently according to the video stream until all the images in the video stream are processed.

12. The apparatus according to claim 11, wherein the predictor unit is specifically configured to:

13. The apparatus of claim 11, wherein the tracking unit further comprises a detection subunit;

the detection subunit is configured to, when the coordinates and the confidence level of the face key point of the previous frame image of the current frame cannot be obtained, or when the confidence level of the previous frame image is determined to be less than or equal to a preset threshold, detect the face in the current frame by using a face detection algorithm based on the allocated memory resources, so as to determine the coordinates and the confidence level of the face key point of the current frame.

14. The apparatus of claim 13,

the detection subunit is specifically configured to determine, based on the allocated memory resources, a face region of the current frame by using a face detection algorithm, and predict, by using the network model, positions of facial features in the face region to obtain coordinates and confidence of a facial key point of the current frame.

15. A face tracking device, comprising:

the system comprises a data acquisition unit, a data processing unit and a data processing unit, wherein the data acquisition unit is used for acquiring a video stream needing face tracking, a deep learning network model and memory resources allocated by the network model, and the allocated memory resources are generated according to the size of an input area, an output area and a temporary area of each network layer in the network model;

a storage unit, for storing the processing data of the current network layer of the network model to the video stream by using an input area, an output area and a temporary area in the forward calculation process;

and the pointer assignment operation unit is used for realizing the storage of the processing data of the video stream of the next network hierarchy by using the input area, the output area and the temporary area through pointer assignment operation when the video stream is processed by using the next network hierarchy of the current network hierarchy, so that all layers of the network model share the same storage space.