CN116228808A

CN116228808A - Expression tracking method, device, equipment and storage medium

Info

Publication number: CN116228808A
Application number: CN202111671707.1A
Authority: CN
Inventors: 张浩贤; 薛唐立
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-12-06
Filing date: 2021-12-31
Publication date: 2023-06-06

Abstract

The application provides an expression tracking method, an expression tracking device, expression tracking equipment and a storage medium, and relates to the field of computer vision of artificial intelligence, wherein the method comprises the following steps: acquiring a neutral face model of a target person and M list bases BS corresponding to the neutral face model, wherein M is more than 0; collecting the face of the target person in a plurality of view angles to obtain a plurality of current images corresponding to the view angles respectively; calculating M BS coefficients which are used for tracking facial expressions in the current images and respectively correspond to the M BSs by using an expression tracking algorithm; a tracking image of the plurality of current images is constructed based on the neutral face model, the M BSs, and the M coefficients. The expression tracking method provided by the application can improve the expression tracking effect.

Description

Expression tracking method, device, equipment and storage medium

The present application claims priority from chinese patent application filed at 12/06 of 2021 under application number 202111478236.2 entitled "expression tracking method, apparatus, device, and storage medium", the entire contents of which are incorporated herein by reference.

Technical Field

Embodiments of the present application relate to the field of computer vision for artificial intelligence, and more particularly, to expression tracking methods, apparatus, devices, and storage media.

Background

Three-dimensional (three dimensional) facial expression tracking refers to performing expression tracking on a character in a video, and driving different 3D face models so that the 3D face models show the corresponding expressions of the characters in the video. In general, the 3D face model may be implemented as a three-dimensional deformable face model (3D Morphable Face Model,3DMM), and the 3DMM is a general three-dimensional face parameterized model, and represents a face with a fixed number of points. The core idea of the 3DMM is that the faces can be matched one by one in a three-dimensional space and orthogonal basis weight linear addition is carried out on a plurality of faces in a database to obtain a face model so as to realize expression tracking.

Specifically, based on the idea of 3DMM, a plurality of faces, namely expression Bases (BS) corresponding to the average face model, can be constructed based on the average face model; based on the above, when the expression tracking is needed, a face model with an expression can be constructed by utilizing 3DMM based on the initial coefficient of an expression Base (BS) corresponding to the average face model, and then the constructed face model is projected onto an image to be tracked to obtain a projection image; then, according to the difference between the face key points of the projection image and the face key points of the image to be tracked, the initial coefficient of the BS corresponding to the average face model can be adjusted until the result is converged, so that the 3DMM can finally construct the face model to be more attached to the face expression of the image to be tracked.

However, due to the difference between different face images, for example, the face shapes of the people in different age groups are different, the average face model is adopted, so that the contours of the people in the average face model and the video are likely to be different, the fitting degree of a projection result is further affected, and the expression tracking effect is reduced. In addition, the expression capability of the BS also affects the expression tracking effect, and the BS corresponding to the average face model is used for tracking, so that the expression capability of the BS is poor, and the expression tracking effect is further reduced.

Disclosure of Invention

The embodiment of the application provides an expression tracking method, an expression tracking device, expression tracking equipment and a storage medium, which can improve an expression tracking effect.

In one aspect, the present application provides an expression tracking method, including:

acquiring a neutral face model of a target person and M list bases BS corresponding to the neutral face model, wherein M is more than 0;

collecting the face of the target person in a plurality of view angles to obtain a plurality of current images corresponding to the view angles respectively;

calculating M BS coefficients which are used for tracking facial expressions in the current images and respectively correspond to the M BSs by using an expression tracking algorithm;

a tracking image of the plurality of current images is constructed based on the neutral face model, the M BSs, and the M coefficients.

On the other hand, the application provides an expression tracking device, including:

an acquisition unit, configured to acquire a neutral face model of a target person and M table request bases BS corresponding to the neutral face model, where M > 0;

the acquisition unit is used for acquiring the face of the target person in a plurality of view angles to obtain a plurality of current images corresponding to the view angles respectively;

a calculating unit, configured to calculate M BS coefficients corresponding to the M BSs, respectively, for tracking facial expressions in the plurality of current images using an expression tracking algorithm;

and a construction unit for constructing tracking images of a plurality of current images based on the neutral face model, the M BSs and the M coefficients.

In another aspect, the present application provides an electronic device, including:

a processor adapted to implement computer instructions; the method comprises the steps of,

a computer readable storage medium storing computer instructions adapted to be loaded by a processor and to perform the method of the first aspect described above.

In another aspect, embodiments of the present application provide a computer-readable storage medium storing computer instructions that, when read and executed by a processor of a computer device, cause the computer device to perform the method of the first aspect.

In another aspect, embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device performs the method of the first aspect described above.

Based on the technical scheme, the method introduces the neutral face model of the target person and M expression request base BSs corresponding to the neutral face model, and compared with the scheme of carrying out expression tracking based on the average face model, the method can improve the expression capacity of the BS and enable the facial expression in the tracking image to be more in line with the outline of the target person, and further improves the expression tracking effect based on the tracking image constructed by the neutral face model of the target person and the M expression request base BSs corresponding to the neutral face model.

In addition, the method and the device are based on the thought of multiple visual angles, the multiple current images are designed to acquire the faces of the target person in the multiple visual angle directions to obtain images corresponding to the multiple visual angles respectively, so that expression tracking is conducted on the multiple current images, reference information of an expression tracking algorithm can be enriched, correspondingly, accuracy of M BS coefficients can be improved, and expression tracking effect is further improved.

In addition, the expression tracking effect is improved, and meanwhile, the operability of the expression tracking algorithm in actual product landing is facilitated.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of a face reference point according to an embodiment of the present application.

Fig. 2 is a schematic diagram of an application scenario according to an embodiment of the present application.

Fig. 3 is a schematic flowchart of an expression tracking method provided in an embodiment of the present application.

Fig. 4 is a schematic diagram of a BS provided in an embodiment of the present application.

Fig. 5 is a schematic diagram of constructing a tracking image provided in an embodiment of the present application.

Fig. 6 is a schematic diagram of a plurality of current images provided in an embodiment of the present application.

Fig. 7 is a schematic diagram of a neutral face model constructed based on multiple videos acquired in multiple view directions according to an embodiment of the present application.

Fig. 8 is a schematic diagram of image-based neutral face model construction according to an embodiment of the present application.

Fig. 9 is a schematic block diagram of an expression tracking apparatus provided in an embodiment of the present application.

Fig. 10 is a schematic block diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

The solution provided by the present application may relate to artificial intelligence technology.

Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

It should be appreciated that artificial intelligence techniques are a comprehensive discipline involving a wide range of fields, both hardware-level and software-level techniques. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

With research and advancement of artificial intelligence technology, research and application of artificial intelligence technology is being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, automatic driving, unmanned aerial vehicles, robots, smart medical treatment, smart customer service, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and with increasing importance value.

The embodiment of the application can relate to Computer Vision (CV) technology in artificial intelligence technology, wherein the CV is a science for researching how to make a machine "see", and further refers to a method for using a camera and a Computer to replace human eyes to recognize, track and measure a target, and further performing graphic processing, so that the Computer is processed into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision technologies generally include technologies such as image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning, map construction, and the like, and common biological feature recognition technologies such as face recognition, fingerprint recognition, and the like.

Embodiments of the present application may also relate to Machine Learning (ML) in artificial intelligence technology, where ML is a multi-domain interdisciplinary, and relates to multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, and algorithm complexity theory. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

The scheme provided by the embodiment of the application also relates to a video processing technology in the field of network media. Network media, unlike conventional audio and video devices, relies on techniques and equipment provided by Information Technology (IT) device developers to transmit, store and process audio and video signals. The conventional Serial Digital (SDI) transmission mode lacks network switching characteristics in a true sense. Much work is required to create a portion of the network functionality like that provided by ethernet and Internet Protocol (IP) using SDI. Thus, network media technology in the video industry has grown. Further, the video processing technology of the network medium may include transmission, storage and processing of audio and video signals and audio and video.

In addition, the scheme provided by the embodiment of the application can also relate to a tracking technology of facial expressions.

In order to facilitate understanding of the technical solution provided in the present application, the following description is related to facial expression tracking.

2D face key point detection: a set of predefined face fiducial points (e.g., eye corner points, mouth corner points) are automatically located.

As shown in fig. 1, the face reference points may be marked around the outline, the eye corner positions, and the mouth corner positions of the face to realize detection of the face or the facial expression.

3D facial expression tracking: and carrying out expression tracking on the character roles in the video, and driving different 3D face models so that the 3D face models show expressions corresponding to the video characters. It should be understood that the various intermediate models, neutral face models, and expression Base (BS) referred to in this application are all 3D face models.

Three-dimensional deformable face model (3D Morphable models,3DMM): the 3DMM is a general three-dimensional face parameterized model, and the faces are represented by fixed points. The core idea of the 3DMM is that the faces can be matched one by one in a three-dimensional space and orthogonal basis weight linear addition is carried out on a plurality of faces in a database to obtain a face model so as to realize expression tracking.

Each three-dimensional face can be represented in a base vector space formed by all faces in a database, so as to solve a model of any three-dimensional face, and the problem of solving coefficients of each base vector is equivalent in practice.

Basic attributes of faces include shape and texture, and each face may be represented as a linear superposition of shape vectors and texture vectors.

Shape Vector: s= (X1, Y1, Z1, X2, Y2, Z2, yn, zn),

texture Vector: t= (R1, G1, B1, R2, G2, B2,) Rn, bn,

where n is the number of face samples in the data set, xi, yi, zi are coordinates of the shape vector of the ith face sample in the data set, and Ri, gi, bi are coordinates of the texture vector of the ith face sample in the data set.

Any face model can be weighted by m face models in the dataset as follows:

wherein S is _model Is a three-dimensional face shape model, a _i Target value of face shape parameter, i= … m, m is number of face samples in data set, S _i For the shape vector of the ith face sample in the dataset,

t is the mean value of the shape vectors of all face samples in the data set _model B is a three-dimensional face texture model _i Is the target value of the face texture parameter, T _i And T is the average value of the texture vectors of all face samples in the data set.

Constraint: referring to finding an element given a function, the element can minimize or maximize a certain index. Constraints may also be referred to as mathematical programming (e.g., linear programming). Wherein the function may be referred to as an objective function or a cost function. A feasible solution that minimizes or maximizes an objective function of a certain index is called an optimal solution. For the purposes of this application, the expression tracking algorithm referred to in this application may be used to: and solving an optimal solution under the constructed constraints, and taking the solved optimal solution as M BS coefficients which are used for tracking facial expressions in a plurality of current images and respectively correspond to M BSs.

Average face: can be a synthetic appearance of a population obtained by computer technology. For example, facial features may be extracted from a number of common faces, averaged from the measured data, and a composite face obtained using computer technology to obtain an attractive face. That is, a group of people with a long average face has a similar appearance to a great extent, so that the chance of face collision is greatly increased.

Blend shape (Blend shape): a technique for deforming a single mesh to achieve a combination of many predefined shapes and any number is referred to as deformation targets in Maya/3ds Max. For example, a single mesh may be the basic shape of a default shape, such as an unoriented face, i.e., a neutral face model to which the present application relates. Other shapes of the basic shape are used for mixing/morphing, are different expressions (smiling, frowning, closing eyelid), and these other shapes are collectively referred to as mixing shapes or morphing targets, i.e., M BSs corresponding to the neutral face model referred to in this application.

As shown in fig. 2, includes an acquisition device 101, a computing device 102, and a display device 103. The acquisition device 101 is configured to acquire N face images of a user, where the N face images may be face images acquired by the user in N viewing directions. The computing device 102 is configured to reconstruct a three-dimensional face model according to the N face images acquired by the acquiring device 101 by providing an expression tracking method according to the embodiment of the present application. The display device 103 is used to display the three-dimensional face model reconstructed by the computing device 102.

By way of example, the computing device 101 may be a user device such as a cell phone, tablet, notebook, palm top, mobile internet device (mobile internet device, MID) or other terminal device with browser-mounted functionality.

Illustratively, the computing device 102 may be a server. The server may be one or more. Where the server is multiple, there are at least two servers for providing different services and/or there are at least two servers for providing the same service, such as providing the same service in a load balancing manner, which embodiments of the present application are not limited. A reconstruction model may be provided in a server that provides support for training and application of the reconstruction model. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligence platforms, and the like. Servers may also become nodes of the blockchain.

For example, when computing device 102 has display functionality, display device 103 may be a display in computing device 102.

Illustratively, the display device 103 is a different device than the computing device 102, and the display device 103 is connected to the computing device 102 via a network. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a global system for mobile communications (Global System of Mobile communication, GSM), wideband code division multiple access (Wideband Code Division Multiple Access, WCDMA), a 4G network, a 5G network, bluetooth (Bluetooth), wi-Fi, a telephony network, etc.

It should be noted that, the application scenario of the present application includes, but is not limited to, facial expression tracking or online facial expression tracking of a person in a 3D game/3D movie work. For example, the scheme provided by the application can be suitable for manufacturing a 3D virtual person, the 3D virtual person can be driven by using the captured facial expression, and the 3D virtual person can be a game character or an animation character.

Illustratively, expression tracking may be accomplished in the steps of:

step 1: the existing characteristic point detection (landmark detection) scheme is utilized to detect characteristic points (landmark) of a human face in an image to be tracked.

Step 2: and labeling the key points of the face of the image to be tracked based on the characteristic points of the face in the image to be tracked.

Step 3: the tracking result of a certain image to be tracked in the video is solved by the following steps:

step 3-1: and reconstructing the face model with the expression by using the initial coefficient of the BS corresponding to the average face model.

Step 3-2: projecting the face model reconstructed in the step 3-1 onto the image to be tracked according to pose information of the average face model to obtain a projection image; based on the above, the BS coefficients of the BS corresponding to the average face model may be updated by calculating an error between the face key point of the image to be tracked and the face key point of the projection head portrait, and using the error as the supervision information.

Step 3-3: and repeating the step 3-1 and the step 3-2 until convergence (namely, the error between the face key points of the image to be tracked and the face key points of the projection head portrait is minimized), obtaining the BS coefficient of the BS corresponding to the average face model under the image to be tracked, and finally constructing the tracking image based on the BS coefficient of the BS corresponding to the average face model.

In other words, based on the idea of 3DMM, a plurality of faces, that is, expression Bases (BS) corresponding to the average face model, may be constructed based on the average face model; based on the above, when the expression tracking is needed, a face model with an expression can be constructed by utilizing 3DMM based on the initial coefficient of an expression Base (BS) corresponding to the average face model, and then the constructed face model is projected onto an image to be tracked to obtain a projection image; then, according to the difference between the face key points of the projection image and the face key points of the image to be tracked, the initial coefficient of the BS corresponding to the average face model can be adjusted until the result is converged, so that the 3DMM can finally construct the face model to be more attached to the face expression of the image to be tracked.

In view of this, the embodiments of the present application provide an expression tracking method, apparatus, device, and storage medium, which can improve the expression tracking effect.

Specifically, on one hand, the method introduces the neutral face model of the target person and M expression request base BSs corresponding to the neutral face model, and compared with the scheme of carrying out expression tracking based on the average face model, the method can enable the facial expression in the tracking image to more accord with the outline of the target person and can further improve the fitting degree of a projection result and the expression capability of the BS, and further improve the expression tracking effect based on the tracking image constructed by the neutral face model of the target person and the M expression request base BSs corresponding to the neutral face model. On the other hand, based on the thought of multiple visual angles, the multiple current images are designed to acquire the faces of the target person in the multiple visual angle directions to obtain images corresponding to the multiple visual angles respectively, so that expression tracking is performed on the multiple current images, reference information of an expression tracking algorithm can be enriched, and accordingly accuracy of M BS coefficients can be improved, and expression tracking effects are further improved.

Fig. 4 shows a schematic flow chart of an expression tracking method 200 according to an embodiment of the present application, the expression tracking method 200 may be performed by any electronic device having data processing capabilities. For example, the electronic device may be implemented as a server. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, a cloud database, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, basic cloud computing services such as big data and an artificial intelligent platform, and the server may be directly or indirectly connected through a wired or wireless communication manner. For convenience of description, the expression tracking method provided in the present application will be described below by taking an expression tracking device as an example.

As shown in fig. 4, the expression tracking method 200 may include:

s210, acquiring a neutral face model of a target person and M list bases BS corresponding to the neutral face model, wherein M is more than 0;

s220, collecting the face of the target person in a plurality of view angles to obtain a plurality of current images corresponding to the view angles respectively;

S230, calculating M BS coefficients which are used for tracking facial expressions in the current images and respectively correspond to the M BSs by using an expression tracking algorithm;

s240, constructing tracking images of a plurality of current images based on the neutral face model, the M BSs and the M coefficients.

In this embodiment, the neutral face model of the target person and the M expression request base BS corresponding to the neutral face model are introduced, and compared with the scheme of performing expression tracking based on the average face model, the tracking image constructed based on the neutral face model of the target person and the M expression request base BS corresponding to the neutral face model not only can improve the expression capability of the BS, but also can enable the facial expression in the tracking image to more conform to the outline of the target person, and further, the expression tracking effect is improved.

As shown in fig. 4, the 9 BSs on the left side are BSs corresponding to the average face model, and the 9 BSs on the right side are BSs corresponding to the neutral face model, so that it can be seen that the BS corresponding to the neutral face model better conforms to the contour of the target person, and accordingly, when the tracking image is constructed based on the BS corresponding to the neutral face model and the neutral face model, the expression capability of the BS can be improved, the facial expression in the tracking image can be more conforming to the contour of the target person, and further, the expression tracking effect is improved.

The value of M is not particularly limited in this application. Illustratively, m=185. Of course, in other alternative embodiments, the value of M may be other values, for example, the value of M is an integer.

As shown in fig. 5, after the neutral face model and the M BSs are acquired, a plurality of tracking images of the current image may be constructed based on M coefficients corresponding to the M BSs, respectively. Specifically, the tracking image can be constructed according to the following formula:

Wherein alpha is _i Represents the ith coefficient of M coefficients, B _i Represents the ith BS, B of M BSs ₀ Representing a neutral face model.

It should be noted that, the implementation manner of acquiring the plurality of current images is not particularly limited in the present application. For example, in one implementation, images in one view direction may be acquired and then converted in multiple view directions to obtain multiple current images, or the images may be acquired in multiple view directions simultaneously to obtain the multiple current images. Alternatively, the plurality of current images may be images in Red Green Blue (RGB) format, or may be images in other formats, which is not specifically limited in this application.

In the scheme provided by the application, the expression tracking of the target person can be performed based on a plurality of current images respectively corresponding to a plurality of view angles. Correspondingly, the application also provides a supervision mode of the expression tracking algorithm based on multiple view angles, on one hand, the geometric difference between the face key points of the constructed face model projection image with the expression and the face key feature points of the current image can be restrained in the multiple view angles so as to improve the matching area of the face key points and further improve the constraint precision, and in addition, the application also provides a mode of carrying out error constraint and stacking constraint on the time domain so as to improve the accuracy of the expression tracking algorithm in solving the BS coefficient under different expressions and further improve the tracking effect, and the implementation mode of various constraint is exemplified below.

In some embodiments, the S230 may include:

constructing an intermediate image with expression by using the M BSs and M initial coefficients respectively corresponding to the M BSs; based on the pose information of the neutral face model, respectively projecting the intermediate image to the plurality of current images to obtain a plurality of projection images; calculating at least one loss of the plurality of projection images; based on the at least one loss, the M initial coefficients are adjusted to obtain the M BS coefficients.

Illustratively, after the at least one penalty is obtained, the expression tracking effect may be supervised using the at least one penalty to adjust the M initial coefficients and obtain BS coefficients that are ultimately used to construct a tracking image of the plurality of current images; or, by adjusting M initial coefficients so that the at least one loss satisfies a constraint for expression tracking construction, and determining BS coefficients that are adjusted and satisfy the constraint as BS coefficients of a tracking image that is finally used to construct the plurality of current images; accordingly, after determining the BS coefficients for the M BS coefficients, a tracking image of the current image may be constructed based on the M BS coefficients.

The intermediate image may be projected to the plurality of current images, respectively, by an internal reference of the camera to obtain a plurality of projected images. Of course, the projection may be performed by other means, which is not particularly limited in this application.

In this embodiment, the plurality of projection images are designed to be images obtained by respectively projecting the intermediate image to the plurality of current images based on the pose information of the neutral face model, so that facial expressions of the plurality of projection images more conform to the contours of the target person, and accordingly, the accuracy of the at least one loss can be improved, and further, the accuracy and expression tracking effect of the M BSs can be improved.

As shown in fig. 6, three current images may be acquired in three viewing angle directions, i.e., a first current image may be acquired in an intermediate viewing angle direction, a second current image may be acquired in a left viewing angle direction, and a third current image may be acquired in a right viewing angle direction.

In combination with the scheme of the application, after an intermediate image with an expression is constructed, based on the pose information of the neutral face model, the intermediate image is projected to the first current image to obtain a first projection image, the intermediate image is projected to the second current image to obtain a second projection image, and the intermediate image is projected to the third current image to obtain a third projection image; the at least one loss may then be calculated based on at least one of: a difference between the first current image and the first projected image, a difference between the second current image and the second projected image, and a difference between the third current image and the third projected image. Wherein the at least one penalty is used to supervise the expression tracking effect. That is, the condition tracking effect may be supervised based on at least one of: a difference between the first current image and the first projected image, a difference between the second current image and the second projected image, and a difference between the third current image and the third projected image.

In some embodiments, the at least one penalty includes a first penalty; the first penalty is characterized by the difference between the face keypoints of the first current image and the face keypoints of the first projected image; the first current image is an image obtained by collecting a face of the target person in a middle view angle direction of the view angle directions, and the first projection image is a projection image obtained by projecting the middle image to the first current image.

In the embodiment, the constraint of the expression tracking algorithm is designed as the geometric constraint in the middle view angle direction; the method comprises the steps of projecting an intermediate image (namely a 3D model with expressions) onto a first current image to obtain a first projection image, designing an objective function or a loss function to be the difference between the face key points of the first current image and the face key points of the first projection image based on the first projection image, and finally supervising the effect of expression tracking based on the calculated difference.

Illustratively, this first loss may be calculated by the following formula:

wherein ε _lm Representing the first loss, 86 representing the number of face keypoints; w (w) ⁱ The weight of the ith face key point is represented, and P (·) represents a reprojection function used for projecting the intermediate image to the first current image; v (V) _i ⁰ An ith face key point alpha representing the neutral face model _j Represents the j-th initial coefficient, deltaV, of the M initial coefficients _i ^j Represents the ith face key point in the jth BS in the M BSs, L _i And representing the ith face key point of the first current image.

In some embodiments, the at least one penalty includes a second penalty; the second loss is characterized by at least one of: an error between a face key point of a left region in the second current image and a face key point of a left region in the second projection image, and a difference between a face key point of a right region in the third current image and a face key point of a right region in the third projection image; the second current image is an image obtained by collecting the face of the target person in the left side view angle direction in the multiple view angle directions, and the second projection image is a projection image obtained by projecting the intermediate image to the second current image; the third current image is an image obtained by collecting the face of the target person in the right side view angle direction in the multiple view angle directions, and the third projection image is a projection image obtained by projecting the intermediate image to the third current image.

In the embodiment, the constraint of the expression tracking algorithm is designed to be a geometric constraint in the left side view angle direction and a geometric constraint in the right side view angle direction; i.e. projecting the intermediate image (i.e. the 3D model with the expression) onto the second current image to obtain a second projection image, and projecting the intermediate image onto the third current image to obtain a third projection image, based on which the objective function or the loss function is designed as: and finally, supervising the effect of expression tracking based on the calculated difference.

The second current image is an image obtained by transforming the first current image based on a transformation matrix from the middle viewing angle direction to the left viewing angle direction, and similarly, the third current image is an image obtained by transforming the first current image based on a transformation matrix from the middle viewing angle direction to the right viewing angle direction.

Illustratively, this second loss may be determined by the following equation:

Wherein ε _geo Representing the second loss, w ^j The weight of the j-th face key point is represented, and P (·) represents a reprojection function used for projecting the intermediate image to the second current image and the third current image; t (T) _mid Representing the first current image, deltaT _mid→left Representing the transformation matrix from the middle viewing angle direction to the left viewing angle direction,

represents the j-th face key point, deltaT in the second current image _mid→right Transformation matrix representing middle view direction to right view direction,/for the middle view direction>

A j-th face key point representing the third current image; j represents the number of face keypoints for the left region in the second current image and the number of face keypoints for the right region in the third current image.

In some embodiments, the at least one penalty includes a third penalty characterized by optical flow errors between the plurality of projected images and a plurality of reference images, respectively; the plurality of reference images are projection images obtained by respectively projecting the intermediate image into a plurality of adjacent images, and the plurality of adjacent images are images which are acquired in the plurality of view angles before the plurality of current images are acquired and are used for tracking facial expressions and are adjacent to the plurality of current images.

In this embodiment, the constraint of the expression tracking algorithm is designed as a time domain constraint between the front and rear images, that is, an optical flow constraint between the front and rear images, where optical flow refers to a displacement between a pixel point on the frame i-1 (for example, a point on a moving automobile) to a corresponding pixel point on the frame i. Specifically, projecting an intermediate image (i.e., a 3D model with expressions) onto the plurality of current images to obtain a plurality of projection images, and respectively projecting the intermediate image onto a plurality of adjacent images, which are acquired in the plurality of viewing angle directions before the plurality of current images are acquired, for tracking facial expressions and are adjacent to the plurality of current images, to obtain a plurality of reference images; based on this, an objective function or a loss function is designed as an optical flow difference between the plurality of projection images and the plurality of reference images, and finally, the effect of expression tracking is supervised based on the calculated optical flow difference.

Illustratively, the third penalty is characterized by optical flow errors between the face regions of the plurality of projection images and the face regions of the plurality of reference images, respectively. For example, the face regions of the plurality of projection images and the face regions of the plurality of reference images may be acquired by using a face segmentation algorithm, and then optical flow errors between the face regions of the plurality of projection images and the face regions of the plurality of reference images, respectively, may be calculated, to finally obtain the third loss.

Illustratively, this third loss may be calculated by the following formula:

wherein ε _flow Representing the third loss, G represents the number of effective optical flow points, w ^g Indicating the first point of effective optical flowWeights of g effective light stream points, P (·) representing a re-projection function for projecting the intermediate image to the plurality of current images and to a plurality of adjacent images; t (T) ⁱ Representing the head pose of the ith image, alpha ⁱ Representing BS coefficients employed by an ith image, which may be an image of a plurality of current images, T ^i-1 Representing the head pose of the i-1 st image, alpha ^i-1 Representing BS coefficients employed by an i-1 th picture, which may be a picture of the plurality of reference pictures,

representing the optical flow between the i-1 th image and the i-th image.

In some embodiments, the at least one penalty includes a fourth penalty characterized by at least one of: the sum of the M initial coefficients, the difference between the M initial coefficients and the M reference coefficients, and the difference between the pose information of the plurality of current images and the pose information of the plurality of adjacent images, respectively; wherein the plurality of neighboring images are images acquired in the plurality of viewing directions before the plurality of current images are acquired, used for tracking facial expressions, and neighboring the plurality of current images, and the M reference coefficients are BS coefficients used for constructing tracking images of the plurality of neighboring images.

In this embodiment, the constraint of the expression tracking algorithm is designed as the constraint of the sum of the M initial coefficients, the smoothness constraint of the posture information in the time domain, and the smoothness constraint of the BS coefficient in the time domain, that is, the objective function or the loss function is designed as at least one of the following: the sum of the M initial coefficients, the difference between the M initial coefficients and the M reference coefficients, and the difference between the pose information of the plurality of current images and the pose information of the plurality of neighboring images, respectively, and supervising the effect of expression tracking based on the calculated differences.

Illustratively, this fourth loss may be calculated by the following formula:

wherein w is _lm Weights, ε, representing first loss _lm Representing the first loss, w _geo Weight, ε, representing the second loss _geo Representing the second loss, w _flow Weights, ε, representing third loss _flow Represents a third loss, w _reg Weights representing the sum of the M initial coefficients, alpha _j Represents the j-th initial coefficient, w, of the M initial coefficients _pos Weights representing differences between the pose information of the plurality of current images and the pose information of the plurality of neighboring images, respectively, T' representing images of the plurality of neighboring images, T representing images of the plurality of current images, w _exp Weights representing differences between the M initial coefficients and M reference coefficients, α' _j Represents the j-th reference coefficient, alpha, of the M reference coefficients _j Represents the j-th initial coefficient of the M initial coefficients.

In some embodiments, the S240 may include:

if the M BSs respectively correspond to M areas of the face and M is an even number, determining average values of a first coefficient corresponding to a first area and a second coefficient corresponding to a second area in the M coefficients when the first coefficient is different from the second coefficient corresponding to the second area and the first area is symmetrical with the second area; determining the average value as a coefficient corresponding to the first region and a coefficient corresponding to the second region; and constructing tracking images of the plurality of current images based on the neutral face model, the M BSs, coefficients corresponding to regions other than the first region and the second region of the M regions, coefficients corresponding to the first region, and coefficients corresponding to the second region.

In this embodiment, the M BSs are designed to be BSs corresponding to symmetric regions, and the BS corresponding coefficients that are symmetric in left and right directions are averaged, so that the expression on the left face and the expression on the right face can be ensured to be symmetric, so that the expression in the tracking image is more in accordance with the expression habit of the face, and further, the expression tracking effect can be improved.

In some embodiments, the S210 may include:

acquiring a plurality of videos acquired in the plurality of view angle directions;

three-dimensional construction is carried out on the face of the target person by utilizing the videos, so that a neutral face model is obtained;

acquiring a BS corresponding to the average face model;

and migrating the BS corresponding to the average face model to the neutral face model to obtain the M BSs.

For example, the BS corresponding to the average face model may be an artificially created BS, and the BS corresponding to the average face model may be migrated to the neutral face model by using an expression migration technique to obtain the M BSs. For example, the BS corresponding to the average face model may be migrated to the neutral face model using EBR technology to obtain the M BSs.

In some embodiments, when the multiple videos are utilized to perform three-dimensional construction on the face of the target person, the multiple videos may be first respectively frame-extracted to obtain N images; then selecting K images from the N images; n is more than or equal to K > 0; and finally, carrying out three-dimensional construction on the face of the target person based on the K images to obtain the neutral face model.

For example, the frames of the plurality of videos may be respectively extracted at preset time intervals to obtain N images; then K images with better quality are selected from the N images; and finally, carrying out three-dimensional construction on the face of the target person based on the K images to obtain the neutral face model. For example, K images suitable for building a neutral face model may be selected to three-dimensionally construct the face of the target person to obtain the neutral face model.

In some embodiments, when K images are selected from the N images, at least one of the following images in the N images may be deleted, resulting in the K images: the method comprises the steps of (1) an image with a difference between a face key point and a face key point in a neutral face image exceeding a preset range, an image with a variance smaller than or equal to a first preset threshold value, and a second image in two adjacent images; wherein the variance of the difference between the second image and the first image of the two images is less than or equal to a second preset threshold.

For example, when deleting an image in which the difference between the face key points in the N images and the face key points in the neutral face image exceeds a preset range, the neutral face image in the neutral face state may be determined in the N images first; then H distances between face key points in the neutral face image are calculated, and the H distances are used for representing the mouth shape of the target person and the eye shape of the target person in the neutral face image; h > 0; calculating the distance between the key points of the face in each image of the N images; and deleting images of which the differences between the distances between the face key points and the K distances in the N images exceed a preset range. Alternatively, the first image extracted from the video of the intermediate view angle among the plurality of view angles may be determined as the neutral face image.

For example, when deleting an image whose variance is less than or equal to a first preset threshold value among the N images, the variance of each of the N images may be calculated first; and deleting the images with variances smaller than or equal to the first preset threshold value in the N images. For example, the image may be convolved with the laplace operator, and then the variance of the image is calculated. For example, the laplace operator may use 3*3 or other forms of matrices.

Illustratively, when deleting the second one of the adjacent two of the N images, the variance of the difference value of the pixel values of the second one of the adjacent two of the N images and the first image may be calculated first; if the difference is smaller than or equal to a second preset threshold, deleting the second image; otherwise, the second image is retained.

In this embodiment, when K images are selected from the N images, deleting an image in which a difference between a face key point in the N images and a face key point in the neutral face image exceeds a preset range, which is equivalent to deleting an image in which a expression of the N images is obviously inconsistent with that of the neutral face image; deleting images of which the variance is less than or equal to a first preset threshold value in the N images, wherein the images are equivalent to deleted images of blur (blur) in the N images; deleting the second image of the adjacent two images among the N images corresponds to deleting the deformed image of the N images, for example, the image of the camera moving speed block, and based on this, the quality of the constructed neutral face model can be improved.

In some embodiments, three-dimensional construction is performed on the face of the target person based on the K images to obtain a pre-constructed model; based on pose information of the pre-built model, respectively projecting the pre-built model to the K images to obtain K projection images; respectively calculating the number of face key points matched between the K images and the K projection images; selecting images with the number of the matched face key points being greater than or equal to a third preset threshold value from the K images to obtain T images; k is more than or equal to T > 1; and carrying out three-dimensional construction on the face of the target person based on the T images to obtain the neutral face model.

By means of the result of the motion-based reconstruction (Structure From Motion, SFM) algorithm, for example, the re-projection error, the number of matching points and the like, T images which are important in constructing the neutral face model can be selected from K images, for example, T images with the number of matching face key points being greater than or equal to a third preset threshold, and then the three-dimensional construction of the face of the target person can be performed again based on the selected T images which are important in constructing the neutral face model, so that the final neutral face model can be obtained.

In the embodiment, the relatively important T images can be effectively deleted through the pre-constructed model, and further, the three-dimensional construction is performed on the face of the target person again based on the relatively important T images, so that the images with little contribution to the establishment of the neutral face model can be further removed, and the model quality of the neutral face model is effectively improved.

As shown in fig. 7, after acquiring a plurality of videos acquired in a plurality of viewing directions, the plurality of videos may be frame-decimated to obtain N images; after the N images are acquired, deleting the images which are obviously inconsistent with the neutral facial expression, the blurred images and the deformed images to obtain K images; thirdly, three-dimensional construction can be carried out on the faces of the target person by the K images to obtain a pre-constructed model; after the pre-built model is obtained, T images with the number of the matched face key points being greater than or equal to a third preset threshold value can be selected from the K images based on the pre-built model; and finally, carrying out three-dimensional construction on the face of the target person based on the T images to obtain the neutral face model.

In some embodiments, three-dimensional construction is performed on the face of the target person based on the T images to obtain a first intermediate model; smoothing the vertex coordinates of the intermediate model to obtain a second intermediate model; repairing the deformed region in the second intermediate model to obtain a third intermediate model; the third intermediate model is scanned using a Non-rigid iterative closest point algorithm (Non-rigid Iterative Closest Point, NICP) algorithm to obtain the neutral face model.

As shown in fig. 8, the smoothness of the first intermediate model is low, and a third intermediate model with better quality can be obtained through smoothing treatment and repairing treatment of the first intermediate model; further, for the third intermediate model, based on the NICP algorithm, the template model may be used to scan the third intermediate model, which is equivalent to that a 3D model, i.e., a neutral face model, with a clean topology relationship consistent with the template model and a shape consistent with the third intermediate model may be obtained by using the NICP algorithm, so that the model quality of the neutral face model is further improved.

The model obtained by three-dimensionally constructing the face of the target person based on the T images is obtained by video frame calculation, and has a limited quality and high noise, and can be an original high-modulus. In this embodiment, smoothing processing is performed on the first intermediate model, repairing of the deformed region of the second intermediate model, and scanning processing is performed on the third intermediate model by using the NICP algorithm, which corresponds to performing denoising processing on the first intermediate model three times, so that the model quality of the finally obtained neutral face model can be ensured.

It should be appreciated that for rigid bodies, the deformation modes may generally include rotational deformation and translational deformation, in which case an iterative closest point (Iterative Closest Point, ICP) algorithm can be used to solve the problem of rigid body registration. For objects such as human faces, hands and bodies, the deformation of the objects comprises both rigid body deformation (local rotation translation caused by different postures) and non-rigid body deformation (such as fat, thin, height and the like caused by different shapes). The NICP algorithm allows non-rigid deformation within the source point set when seeking a matching relationship between the two point sets (source point set and target point set).

The NICP algorithm may utilize the template model to scan the scan template to obtain an output model, which may be a clean 3D model with a topology consistent with the template model and a shape consistent with the scan model.

The preferred embodiments of the present application have been described in detail above with reference to the accompanying drawings, but the present application is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solutions of the present application within the scope of the technical concept of the present application, and all the simple modifications belong to the protection scope of the present application. For example, the specific features described in the above embodiments may be combined in any suitable manner, and in order to avoid unnecessary repetition, various possible combinations are not described in detail. As another example, any combination of the various embodiments of the present application may be made without departing from the spirit of the present application, which should also be considered as disclosed herein.

It should be further understood that, in the various method embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic of the processes, and should not constitute any limitation on the implementation process of the embodiments of the present application.

The method provided by the embodiment of the application is described above, and the device provided by the embodiment of the application is described below.

Fig. 9 is a schematic block diagram of an expression tracking apparatus 300 provided in an embodiment of the present application.

As shown in fig. 9, the expression tracking apparatus 300 may include:

an obtaining unit 310, configured to obtain a neutral face model of a target person and M table request bases BS corresponding to the neutral face model, where M is greater than 0;

the acquisition unit 320 is configured to acquire faces of the target person in multiple view angles, so as to obtain multiple current images corresponding to the multiple view angles respectively;

a calculating unit 330, configured to calculate M BS coefficients corresponding to the M BSs, respectively, for tracking facial expressions in the plurality of current images using an expression tracking algorithm;

a construction unit 340, configured to construct tracking images of a plurality of current images based on the neutral face model, the M BSs, and the M coefficients.

In some embodiments, the computing unit 330 is specifically configured to:

constructing an intermediate image with expression by using the M BSs and M initial coefficients respectively corresponding to the M BSs;

based on the pose information of the neutral face model, respectively projecting the intermediate images to the plurality of current images to obtain a plurality of projection images;

calculating at least one loss of the plurality of projection images;

and adjusting the M initial coefficients based on the at least one loss to obtain M BS coefficients.

In some embodiments, the at least one penalty comprises a first penalty; the first loss is characterized by a difference between a face key point of the first current image and a face key point of the first projection image;

the first current image is an image obtained by collecting a face of the target person in a middle view angle direction of the view angle directions, and the first projection image is a projection image obtained by projecting the middle image to the first current image.

In some embodiments, the at least one penalty comprises a second penalty; the second loss is characterized by at least one of: an error between a face key point of a left region in the second current image and a face key point of a left region in the second projection image, and a difference between a face key point of a right region in the third current image and a face key point of a right region in the third projection image;

The second current image is an image obtained by collecting the face of the target person in the left side view angle direction in the multiple view angle directions, and the second projection image is a projection image obtained by projecting the intermediate image to the second current image;

the third current image is an image obtained by collecting the face of the target person in the right side view angle direction in the view angle directions, and the third projection image is a projection image obtained by projecting the intermediate image to the third current image.

In some embodiments, the at least one penalty includes a third penalty characterized by optical flow errors between the plurality of projected images and a plurality of reference images, respectively;

the plurality of reference images are projection images obtained by respectively projecting the intermediate image to a plurality of adjacent images, and the plurality of adjacent images are images which are acquired in the plurality of view angles before the plurality of current images are acquired and are used for tracking facial expressions and are adjacent to the plurality of current images.

In some embodiments, the at least one penalty includes a fourth penalty characterized by at least one of:

The sum of the M initial coefficients, the difference between the M initial coefficients and the M reference coefficients, and the difference between the pose information of the plurality of current images and the pose information of the plurality of adjacent images, respectively;

wherein the plurality of neighboring images are images acquired in the plurality of viewing directions before the plurality of current images are acquired, used for tracking facial expressions, and neighboring the plurality of current images, and the M reference coefficients are BS coefficients used for constructing tracking images of the plurality of neighboring images.

In some embodiments, the construction unit 340 is specifically configured to:

if the M BSs respectively correspond to M areas of the face and M is an even number, determining average values of a first coefficient corresponding to a first area and a second coefficient corresponding to a second area in the M coefficients when the first coefficient is different from the second coefficient corresponding to the second area and the first area and the second area are symmetrical;

determining the average value as a coefficient corresponding to the first region and a coefficient corresponding to the second region;

and constructing tracking images of the plurality of current images based on the neutral face model, the M BSs, coefficients corresponding to regions except the first region and the second region in the M regions, coefficients corresponding to the first region and coefficients corresponding to the second region.

In some embodiments, the obtaining unit 310 is specifically configured to:

three-dimensional construction is carried out on the face of the target person by utilizing the videos, so that the neutral face model is obtained;

acquiring a BS corresponding to the average face model;

In some embodiments, the obtaining unit 310 is specifically configured to:

respectively extracting frames from the plurality of videos to obtain N images;

selecting K images from the N images; n is more than or equal to K > 0;

and carrying out three-dimensional construction on the face of the target person based on the K images to obtain the neutral face model.

In some embodiments, the obtaining unit 310 is specifically configured to:

deleting at least one of the following images in the N images to obtain the K images:

the method comprises the steps of (1) an image with a difference between a face key point and a face key point in a neutral face image exceeding a preset range, an image with a variance smaller than or equal to a first preset threshold value, and a second image in two adjacent images;

wherein the variance of the difference between the second image and the first of the two images is less than or equal to a second preset threshold.

In some embodiments, the obtaining unit 310 is specifically configured to:

carrying out three-dimensional construction on the face of the target person based on the K images to obtain a pre-constructed model;

based on pose information of the pre-built model, respectively projecting the pre-built model to the K images to obtain K projection images;

respectively calculating the number of face key points matched between the K images and the K projection images;

selecting images with the number of the matched face key points being greater than or equal to a third preset threshold value from the K images to obtain T images; k is more than or equal to T > 1;

and carrying out three-dimensional construction on the face of the target person based on the T images to obtain the neutral face model.

In some embodiments, the obtaining unit 310 is specifically configured to:

three-dimensional construction is carried out on the face of the target person based on the T images, so that a first intermediate model is obtained;

smoothing the vertex coordinates of the intermediate model to obtain a second intermediate model;

repairing the deformed region in the second intermediate model to obtain a third intermediate model;

and scanning the third intermediate model by using a non-rigid iterative closest point NICP algorithm to obtain the neutral face model.

It should be understood that apparatus embodiments and method embodiments may correspond with each other and that similar descriptions may refer to the method embodiments. To avoid repetition, no further description is provided here. Specifically, the expression tracking apparatus 300 may correspond to a corresponding main body in executing the method 200 in the embodiment of the present application, and each unit in the expression tracking apparatus 300 is for implementing a corresponding flow in the method 200, and for brevity, will not be described herein.

It should also be understood that each unit in the expression tracking apparatus 300 according to the embodiments of the present application may be separately or all combined into one or several other units to form the unit, or some unit(s) thereof may be further split into a plurality of units with smaller functions to form the unit(s), which may achieve the same operation without affecting the implementation of the technical effects of the embodiments of the present application. The above units are divided based on logic functions, and in practical applications, the functions of one unit may be implemented by a plurality of units, or the functions of a plurality of units may be implemented by one unit. In other embodiments of the present application, the expression tracking apparatus 300 may also include other units, and in practical applications, these functions may also be implemented with assistance by other units, and may be implemented by cooperation of a plurality of units. According to another embodiment of the present application, the expression tracking apparatus 300 and the expression tracking method of the embodiment of the present application may be constructed by running a computer program (including program code) capable of executing steps involved in the respective methods on a general-purpose computing device of a general-purpose computer including a processing element such as a Central Processing Unit (CPU), a random access storage medium (RAM), a read-only storage medium (ROM), and the like, and implementing the expression tracking method of the embodiment of the present application. The computer program may be recorded on a computer readable storage medium, and loaded into an electronic device through the computer readable storage medium and executed therein to implement the corresponding method provided by the embodiments of the present application.

In other words, the units referred to above may be implemented in hardware, or may be implemented by instructions in software, or may be implemented in a combination of hardware and software. Specifically, each step of the method embodiments in the embodiments of the present application may be implemented by an integrated logic circuit of hardware in a processor and/or an instruction in software form, and the steps of the method disclosed in connection with the embodiments of the present application may be directly implemented as a hardware decoding processor or implemented by a combination of hardware and software in the decoding processor. Alternatively, the software may reside in a well-established storage medium in the art such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, and the like. The storage medium is located in a memory, and the processor reads information in the memory, and in combination with hardware, performs the steps in the above method embodiments.

Fig. 10 is a schematic structural diagram of an electronic device 400 provided in an embodiment of the present application.

As shown in fig. 10, the electronic device 400 includes at least a processor 410 and a computer-readable storage medium 420. Wherein the processor 410 and the computer-readable storage medium 420 may be connected by a bus or other means. The computer readable storage medium 420 is for storing a computer program 421, the computer program 421 including computer instructions, and the processor 410 is for executing the computer instructions stored by the computer readable storage medium 420. Processor 410 is a computing core and a control core of electronic device 400 that are adapted to implement one or more computer instructions, in particular to load and execute one or more computer instructions to implement a corresponding method flow or a corresponding function.

As an example, the processor 410 may also be referred to as a central processing unit (Central Processing Unit, CPU). The processor 410 may include, but is not limited to: a general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.

By way of example, computer readable storage medium 420 may be high speed RAM memory or Non-volatile memory (Non-VolatileMemorye), such as at least one magnetic disk memory; alternatively, it may be at least one computer-readable storage medium located remotely from the aforementioned processor 410. In particular, computer-readable storage media 420 includes, but is not limited to: volatile memory and/or nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (Double Data Rate SDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and Direct memory bus RAM (DR RAM).

As shown in fig. 10, the electronic device 400 may also include a transceiver 430.

The processor 410 may control the transceiver 430 to communicate with other devices, and in particular, may send information or data to other devices or receive information or data sent by other devices. Transceiver 430 may include a transmitter and a receiver. Transceiver 430 may further include antennas, the number of which may be one or more.

It will be appreciated that the various components in the communication device 400 are connected by a bus system that includes, in addition to a data bus, a power bus, a control bus, and a status signal bus.

In one implementation, the electronic device 400 may be any electronic device having data processing capabilities; the computer readable storage medium 420 has stored therein first computer instructions; first computer instructions stored in computer readable storage medium 420 are loaded and executed by processor 410 to implement corresponding steps in the method embodiment shown in fig. 1; in particular, the first computer instructions in the computer readable storage medium 420 are loaded by the processor 410 and perform the corresponding steps, and are not repeated here.

According to another aspect of the present application, the embodiments of the present application also provide a computer-readable storage medium (Memory), which is a Memory device in the electronic device 400, for storing programs and data. Such as computer readable storage medium 420. It is understood that the computer readable storage medium 420 herein may include a built-in storage medium in the electronic device 400, and may include an extended storage medium supported by the electronic device 400. The computer-readable storage medium provides storage space that stores an operating system of the electronic device 400. Also stored in this memory space are one or more computer instructions, which may be one or more computer programs 421 (including program code), adapted to be loaded and executed by the processor 410.

According to another aspect of the present application, embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. Such as computer program 421. At this time, the data processing apparatus 400 may be a computer, and the processor 410 reads the computer instructions from the computer-readable storage medium 420, and the processor 410 executes the computer instructions so that the computer performs the expression tracking method provided in the above-described various alternatives.

In other words, when implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, runs the processes or implements the functions of the embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, from one website, computer, server, or data center by wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means.

Those of ordinary skill in the art will appreciate that the elements and process steps of the examples described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or as a combination of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

Finally, it should be noted that the above is only a specific embodiment of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about the changes or substitutions within the technical scope of the present application, and the changes or substitutions are covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An expression tracking method, comprising:

and constructing tracking images of a plurality of current images based on the neutral face model, the M BSs and the M coefficients.

2. The method of claim 1, wherein calculating M BS coefficients for tracking facial expressions in the plurality of current images and corresponding to the M BSs, respectively, using an expression tracking algorithm, comprises:

calculating at least one loss of the plurality of projection images;

3. The method of claim 2, wherein the at least one penalty comprises a first penalty; the first loss is characterized by a difference between a face key point of the first current image and a face key point of the first projection image;

4. The method of claim 2, wherein the at least one penalty comprises a second penalty; the second loss is characterized by at least one of: an error between a face key point of a left region in the second current image and a face key point of a left region in the second projection image, and a difference between a face key point of a right region in the third current image and a face key point of a right region in the third projection image;

5. The method of claim 2, wherein the at least one penalty comprises a third penalty characterized by optical flow errors between the plurality of projected images and a plurality of reference images, respectively;

6. The method of claim 2, wherein the at least one penalty comprises a fourth penalty characterized by at least one of:

The sum of the M initial coefficients, the difference between the M initial coefficients and the M reference coefficients, and the difference between the pose information of the plurality of current images and the pose information of the plurality of adjacent images respectively;

wherein the plurality of neighboring images are images which are acquired in the plurality of viewing directions before the plurality of current images are acquired, are used for tracking facial expressions, and are neighboring the plurality of current images, and the M reference coefficients are BS coefficients used for constructing tracking images of the plurality of neighboring images.

7. The method according to any one of claims 1 to 6, wherein the constructing a tracking image of a plurality of current images based on the neutral face model, the M BSs, and the M coefficients comprises:

8. The method according to any one of claims 1 to 6, wherein the acquiring the neutral face model of the target person and the M number of form BSs corresponding to the neutral face model includes:

acquiring a BS corresponding to the average face model;

9. The method of claim 8, wherein the three-dimensionally constructing the face of the target person using the plurality of videos to obtain the neutral face model comprises:

respectively extracting frames from the plurality of videos to obtain N images;

selecting K images from the N images; n is more than or equal to K > 0;

10. The method of claim 8, wherein selecting K images from the N images comprises:

11. The method of claim 8, wherein the three-dimensionally constructing the face of the target person using the plurality of videos to obtain the neutral face model comprises:

12. The method of claim 11, wherein the three-dimensionally constructing the face of the target person based on the T images to obtain the neutral face model comprises:

13. An expression tracking device, comprising:

an acquisition unit, configured to acquire a neutral face model of a target person and M table request bases BS corresponding to the neutral face model, where M is greater than 0;

and the construction unit is used for constructing tracking images of a plurality of current images based on the neutral face model, the M BSs and the M coefficients.

14. An electronic device, comprising:

a processor adapted to execute a computer program;

a computer readable storage medium having stored therein a computer program which, when executed by the processor, implements the method of any of claims 1 to 12.

15. A computer readable storage medium storing a computer program for causing a computer to perform the method of any one of claims 1 to 12.

16. A computer program product comprising a computer program and/or instructions which, when executed by a processor, implement the method of any one of claims 1 to 12.