CN111311712B

CN111311712B - Video frame processing method and device

Info

Publication number: CN111311712B
Application number: CN202010112686.9A
Authority: CN
Inventors: 赵洋; 陈睿智
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-02-24
Filing date: 2020-02-24
Publication date: 2023-06-16
Anticipated expiration: 2040-02-24
Also published as: CN111311712A

Abstract

The embodiment of the invention discloses a video frame processing method and device. One embodiment of the method comprises the following steps: acquiring a face key point in a video frame containing a target face; projecting key points of a preset three-dimensional expression into a two-dimensional plane where key points of a human face are located, so that a reprojection error between a projection result and the key points of the human face is minimized; determining an expression coefficient of each preset three-dimensional expression in the three-dimensional expression corresponding to the target face based on the minimized reprojection error; and generating a three-dimensional animation with the three-dimensional expression corresponding to the target face based on the determined expression coefficient. The scheme provided by the embodiment of the disclosure can accurately determine the expression coefficient by using the minimized reprojection error. And moreover, the generation of more accurate three-dimensional animation can be facilitated by utilizing the preset three-dimensional expression.

Description

Video frame processing method and device

Technical Field

The embodiment of the disclosure relates to the technical field of computers, in particular to a video frame processing method and device.

Background

With the development of internet technology, more and more live broadcast platforms and short video platforms are layered, and users can watch video programs on the platforms anytime and anywhere.

In a scene of self-timer, live broadcast, or the like, the user may choose to convert his face into an animated character. In this process, the user's face may be mapped to a two-dimensional or three-dimensional animated character, so that the user's facial motion may be represented by the animated character, thereby enhancing the interest of the picture.

Disclosure of Invention

The embodiment of the disclosure provides a video frame processing method and device.

In a first aspect, an embodiment of the present disclosure provides a video frame processing method, including: acquiring a face key point in a video frame containing a target face; projecting key points of a preset three-dimensional expression into a two-dimensional plane where key points of a human face are located, so that a reprojection error between a projection result and the key points of the human face is minimized; determining an expression coefficient of each preset three-dimensional expression in the three-dimensional expression corresponding to the target face based on the minimized reprojection error; and generating a three-dimensional animation with the three-dimensional expression corresponding to the target face based on the determined expression coefficient.

In some embodiments, determining an expression coefficient of each preset three-dimensional expression in the three-dimensional expression corresponding to the target face based on the minimized reprojection error includes: for key points of a preset three-dimensional expression, presetting a key point subset of each item in five sense organs, and determining a sub-projection result of the key point subset and a minimized re-projection error of the face key points of the item in the face key points; and for each item in the preset five sense organs, determining the expression coefficient of each preset three-dimensional expression in the three-dimensional expression corresponding to the target face based on the minimized reprojection error corresponding to the item.

In some embodiments, determining an expression coefficient of each preset three-dimensional expression in the three-dimensional expression corresponding to the target face based on the minimized reprojection error further includes: for the minimized reprojection error of the projection result of the key point of the preset three-dimensional expression, determining the pose of the target face corresponding to the minimized reprojection error; and for each item in the preset five sense organs, determining an expression coefficient of each preset three-dimensional expression in the three-dimensional expression corresponding to the target face based on the minimized re-projection error corresponding to the item, wherein the method comprises the following steps: and taking the determined pose as an iteration initial value, and iterating the pose of the target face and the expression coefficients of each preset three-dimensional expression in the total surface condition of the target face so as to minimize the sub-projection result of the key point subset of each item in the preset five sense organs and the re-projection error of the key point of the item in the key points of the face.

In some embodiments, acquiring face keypoints in a video frame containing a target face comprises: and aligning the key points in each video frame to obtain the aligned face key points.

In some embodiments, generating a three-dimensional animation having a three-dimensional expression corresponding to the target face based on the determined expression coefficients, includes: and carrying out weighted average on the expression coefficients of two video frames in each video frame, and updating the expression coefficients of the next frame in the two video frames into a weighted average result, wherein the two video frames are adjacent frames, or the number of video frames at intervals of the two video frames is a preset number and does not exceed a preset threshold value.

In a second aspect, an embodiment of the present disclosure provides a video frame processing apparatus, including: an acquisition unit configured to acquire face key points in a video frame containing a target face; the projection unit is configured to project key points of a preset three-dimensional expression into a two-dimensional plane where the key points of the human face are located, so that a reprojection error between a projection result and the key points of the human face is minimized; a determining unit configured to determine, based on the minimized re-projection error, an expression coefficient of each preset three-dimensional expression in the three-dimensional expressions corresponding to the target face; and a generation unit configured to generate a three-dimensional animation having a three-dimensional expression corresponding to the target face based on the determined expression coefficient.

In some embodiments, the determining unit is further configured to determine an expression coefficient of each preset three-dimensional expression in the three-dimensional expression corresponding to the target face based on the minimized re-projection error as follows: for key points of a preset three-dimensional expression, presetting a key point subset of each item in five sense organs, and determining a sub-projection result of the key point subset and a minimized re-projection error of the face key points of the item in the face key points; and for each item in the preset five sense organs, determining the expression coefficient of each preset three-dimensional expression in the three-dimensional expression corresponding to the target face based on the minimized reprojection error corresponding to the item.

In some embodiments, the determining unit is further configured to determine an expression coefficient of each preset three-dimensional expression in the three-dimensional expression corresponding to the target face based on the minimized re-projection error as follows: for the minimized reprojection error of the projection result of the key point of the preset three-dimensional expression, determining the pose of the target face corresponding to the minimized reprojection error; and a determining unit further configured to perform, for each of the preset five sense organs, determining an expression coefficient of each preset three-dimensional expression in the three-dimensional expression corresponding to the target face based on the minimized re-projection error corresponding to the item, in the following manner: and taking the determined pose as an iteration initial value, and iterating the pose of the target face and the expression coefficients of each preset three-dimensional expression in the total surface condition of the target face so as to minimize the sub-projection result of the key point subset of each item in the preset five sense organs and the re-projection error of the key point of the item in the key points of the face.

In some embodiments, the acquiring unit is further configured to perform acquiring face keypoints in a video frame containing the target face as follows: and aligning the key points in each video frame to obtain the aligned face key points.

In some embodiments, the determining unit is further configured to perform generating a three-dimensional animation having a three-dimensional expression corresponding to the target face based on the determined expression coefficient in the following manner: and carrying out weighted average on the expression coefficients of two video frames in each video frame, and updating the expression coefficients of the next frame in the two video frames into a weighted average result, wherein the two video frames are adjacent frames, or the number of video frames at intervals of the two video frames is a preset number and does not exceed a preset threshold value.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including: one or more processors; and a storage device having one or more programs stored thereon, which when executed by the one or more processors, cause the one or more processors to implement the method as described in any of the implementations of the first aspect.

In a fourth aspect, embodiments of the present disclosure provide a computer-readable medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements a method as described in any of the implementations of the first aspect.

The embodiment of the disclosure provides a video frame processing method and device, which acquire face key points in a video frame containing a target face. And then, projecting the key points of the preset three-dimensional expression into a two-dimensional plane where the key points of the human face are located, so as to minimize the reprojection error between the projection result and the key points of the human face. And then, determining the expression coefficients of each preset three-dimensional expression in the three-dimensional expression corresponding to the target face based on the minimized reprojection error. And finally, generating a three-dimensional animation with the three-dimensional expression corresponding to the target face based on the determined expression coefficient. The scheme provided by the embodiment of the disclosure can accurately determine the expression coefficient by using the minimized reprojection error. And moreover, the generation of more accurate three-dimensional animation can be facilitated by utilizing the preset three-dimensional expression.

Drawings

Other features, objects and advantages of the present disclosure will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings:

FIG. 1 is an exemplary system architecture diagram in which an embodiment of the present disclosure may be applied;

FIG. 2 is a flow chart of one embodiment of a video frame processing method according to the present disclosure;

FIG. 3 is a schematic illustration of one application scenario of a video frame processing method according to the present disclosure;

FIG. 4 is a flow chart of yet another embodiment of a video frame processing method according to the present disclosure;

FIG. 5 is a schematic diagram of the architecture of one embodiment of a video frame processing apparatus according to the present disclosure;

fig. 6 is a schematic diagram of a computer system suitable for use in implementing an embodiment of the present disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.

It should be noted that, without conflict, the embodiments of the present disclosure and features of the embodiments may be combined with each other. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary system architecture 100 to which a video frame processing method or video frame processing apparatus of embodiments of the present disclosure may be applied.

As shown in fig. 1, a system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the

terminal devices

101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the

terminal devices

101, 102, 103 to receive or send messages or the like. Various communication client applications, such as a live broadcast application, a short video application, a web browser application, a shopping class application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc., may be installed on the

terminal devices

101, 102, 103.

The

terminal devices

101, 102, 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablet computers, electronic book readers, laptop and desktop computers, and the like. When the

terminal devices

101, 102, 103 are software, they can be installed in the above-listed electronic devices. Which may be implemented as multiple software or software modules (e.g., to provide distributed services), or as a single software or software module. The present invention is not particularly limited herein.

The server 105 may be a server that provides various services, such as a background server that provides support for three-dimensional animations displayed on the

terminal devices

101, 102, 103. The background server may analyze and process the received video frame containing the face (such as the target face), and feed back the processing result (for example, the three-dimensional animation corresponding to the target face) to the terminal device.

It should be noted that, the video frame processing method provided by the embodiments of the present disclosure may be performed by the server 105 or the

terminal devices

101, 102, 103, and accordingly, the video frame processing apparatus may be disposed in the server 105 or the

terminal devices

101, 102, 103.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to fig. 2, a flow 200 of one embodiment of a video frame processing method according to the present disclosure is shown. The video frame processing method comprises the following steps:

step 201, obtaining a face key point in a video frame containing a target face.

In this embodiment, the execution body of the video frame processing method (e.g., the server shown in fig. 1) may acquire the face key points in the video frame. The video frame contains a target face, and all or part of key points of the target face can be used as the determined face key points. The executing body may determine the face key points in the video frame in various manners. For example, the executing body may directly obtain the face key point from a local or other electronic device, or may first obtain the video frame, and detect the key point from the video frame as the face key point.

In some alternative implementations of the present embodiment, step 201 may include: and aligning the key points in each video frame to obtain the aligned face key points.

In these alternative implementations, the executing entity may align each of the keypoints in each of the video frames for each of the video frames. The execution body may then use the aligned face keypoints as face keypoints for determining the minimized re-projection error.

These implementations facilitate subsequent continuous animation of multiple video frames by aligning face keypoints between the video frames.

Step 202, projecting the key points of the preset three-dimensional expression into the two-dimensional plane where the key points of the face are located, so as to minimize the reprojection error between the projection result and the key points of the face.

In this embodiment, the execution body may acquire key points of a preset three-dimensional expression, and project the key points into a two-dimensional plane where the key points of the face are located, so as to minimize a reprojection error between the projection result and the key points of the face. Specifically, the two-dimensional plane in which the face key points are located may be a two-dimensional image containing the face key points. In the two-dimensional plane, the coordinate difference between the projection result and the key point of the face is a reprojection error. The reprojection error has more influencing parameters, such as at least one of the following parameters: facial gestures (i.e., head gestures), internal references of cameras that capture video frames, key points referencing three-dimensional faces (anechoic faces), key points of each preset three-dimensional expression, and expression coefficients of each preset three-dimensional expression. The execution body may adjust a parameter of the at least one parameter, so as to adjust a size of the re-projection error to minimize the obtained re-projection error when the at least one parameter is reasonable (for example, a parameter value of each parameter is within a corresponding preset range).

Step 203, determining an expression coefficient of each preset three-dimensional expression in the three-dimensional expressions corresponding to the target face based on the minimized reprojection error.

In this embodiment, the execution body may determine, based on the minimized reprojection error obtained in step 202, an expression coefficient of each preset three-dimensional expression in the three-dimensional expressions corresponding to the expression of the target face (i.e., the three-dimensional expressions corresponding to the target face). In practice, the execution subject uses the determined expression coefficients of each preset three-dimensional expression, so that the projected reprojection error can be minimized. Specifically, the expression presented by the target face is a two-dimensional expression. The three-dimensional expression corresponding to the target face can be combined and embodied by one or more preset three-dimensional expressions.

For example, the plurality of preset three-dimensional expressions may include smiling, mouth opening, blinking, and the like. For example, the preset three-dimensional expression mouth opening has an expression coefficient of 0.4, the preset three-dimensional expression smile has an expression coefficient of 0.6, and the three-dimensional expression corresponding to the target face formed by the two preset expressions is smile.

In the projection process, a preset formula or algorithm can be utilized, and key points of the three-dimensional expression corresponding to the target face are represented by key points of the preset three-dimensional expression. Specifically, the executing body may weight the preset three-dimensional expressions, so as to obtain the three-dimensional expression corresponding to the target face. The weighted coefficient is the expression coefficient. In addition, the executing body may weight differences between coordinates of key points of each preset three-dimensional expression and coordinates of key points of the reference three-dimensional face, and take a sum of the weighted results and coordinates of the key points of the reference three-dimensional face as the three-dimensional expression corresponding to the target face. The weighted weight is the expression coefficient.

In practice, parameters affecting the re-projection error can be expressed in the following projection formula:

P _2D ＝Proj(Rt×P _3D )，

P _3D ＝Bo+∑(a _i ×ΔB _i )，

wherein P is _2D P is the projection result _3D For the three-dimensional expression corresponding to the target face, rt is the pose of the target face (i.e. the head pose of the target face), proj is the internal reference of the camera shooting the video frame, bo is the key point of the reference three-dimensional face, and DeltaB _i A is the coordinate (vector coordinate) difference value between the key point of the ith preset three-dimensional expression and the key point of the reference three-dimensional face _i Presetting an expression coefficient of a three-dimensional expression for the ith. In the projection formula, the differences between coordinates of key points of each preset three-dimensional expression and key points of the reference three-dimensional face can be weighted.

Step 204, based on the determined expression coefficient, generating a three-dimensional animation with the three-dimensional expression corresponding to the target face.

In this embodiment, the executing body may generate the three-dimensional animation having the three-dimensional expression corresponding to the target face based on the expression coefficients determined for the respective preset three-dimensional expressions. Specifically, the executing body may input the expression coefficient into the animation driving model, so that the generated animation has the three-dimensional expression of the target face.

In some optional implementations of this embodiment, the step 204 may include: and carrying out weighted average on the expression coefficients of two video frames in each video frame, and updating the expression coefficients of the next frame in the two video frames into a weighted average result, wherein the two video frames are adjacent frames, or the number of video frames at intervals of the two video frames is a preset number and does not exceed a preset threshold value.

In these alternative implementations, the executing entity may smooth the determined expression coefficients, thereby updating the expression coefficients. Specifically, the execution subject may perform weighted average on the expression coefficients of two video frames among the respective video frames, and take the result of the weighted average as the expression coefficient in the following frame. The emoticons of the previous frames herein are updated or need not be updated. In two video frames, the weight of the preceding frame and the weight of the following frame are preset. It should be noted that, here, the expression coefficients of each preset expression are updated.

For example, the executing body may take the expression coefficients of the 1 st frame and the 3 rd frame for weighted average, where the weight of the 1 st frame is a preset X (X is between 0 and 1), and the weight of the 3 rd frame is 1-X. And taking the weighted average result as the updated expression coefficient of the 3 rd frame. Then, a next frame in the next frame of the two frames may be taken as a preceding frame in the at least two frames of the next smoothing process. That is, the 4 th frame and the 6 th frame may be taken.

The implementation modes can avoid that the extreme value of the individual video frame caused by inaccurate expression coefficient determination influences the stability of the picture, thereby ensuring that the picture of the generated animation is more stable

With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the video frame processing method according to the present embodiment. In the application scenario of fig. 3, the execution subject 301 acquires a face key point 302 in a video frame containing a target face. The key points of the preset three-dimensional expression are projected into the two-dimensional plane where the key points of the face are located, so that the reprojection error between the projection result 303 and the key points of the face is minimized. The execution subject 301 determines an expression coefficient 305 of each preset three-dimensional expression in the three-dimensional expressions corresponding to the target face based on the minimized re-projection error 306. The execution subject 301 generates a three-dimensional animation 306 having a three-dimensional expression corresponding to the target face based on the determined expression coefficient 305.

The method provided by the embodiment of the disclosure can accurately determine the expression coefficient by using the minimized reprojection error. And moreover, the generation of more accurate three-dimensional animation can be facilitated by utilizing the preset three-dimensional expression.

With further reference to fig. 4, a flow 400 of yet another embodiment of a video frame processing method is shown. The video frame processing method flow 400 includes the steps of:

step 401, obtaining a face key point in a video frame containing a target face.

In this embodiment, the execution body of the video frame processing method (e.g., the server shown in fig. 1) may acquire the face key points in the video frame. The video frame comprises a target face, and all or part of key points of the target face are the determined face key points. The executing body may determine the face key points in the video frame in various manners. For example, the executing body may directly obtain the face key point from a local or other electronic device, or may first obtain the video frame, and detect the key point from the video frame as the face key point.

Step 402, projecting the key points of the preset three-dimensional expression into the two-dimensional plane where the key points of the face are located, so as to minimize the reprojection error between the projection result and the key points of the face.

In this embodiment, the execution body may acquire key points of a preset three-dimensional expression, and project the key points into a two-dimensional plane where the key points of the face are located, so as to minimize a reprojection error between the projection result and the key points of the face. Specifically, the two-dimensional plane in which the face key points are located may be a two-dimensional image containing the face key points. In the two-dimensional plane, the coordinate difference between the projection result and the key point of the face is a reprojection error.

Step 403, for each key point subset of the preset five sense organs in the key points of the preset three-dimensional expression, determining the sub-projection result of the key point subset and the minimized re-projection error of the face key points of the item in the face key points.

In this embodiment, among the key points of the preset three-dimensional expression, the key points of each of the preset five sense organs are included. The execution body may make up, for each item of keypoints, a sub-projection result of the keypoint subset in the projection result, and a minimized re-projection error between the face keypoint of the item of the face keypoint. In this way, the execution subject can calculate the minimized re-projection error separately for each of the preset five sense organs.

Step 404, for each item in the preset five sense organs, determining an expression coefficient of each preset three-dimensional expression in the three-dimensional expression corresponding to the target face based on the minimized re-projection error corresponding to the item.

In this embodiment, the execution body may determine, for each item in the preset five sense organs, an expression coefficient of each preset three-dimensional expression based on the minimized reprojection error corresponding to the item, where the expression coefficient corresponds to the item. In this way, the expression coefficient can be determined for each item of preset five sense organs respectively, and the unified expression coefficient is not adopted for each preset three-dimensional expression in the whole face.

In some optional implementations of this embodiment, the method may further include: for the minimized reprojection error of the projection result of the key point of the preset three-dimensional expression, determining the pose of the target face corresponding to the minimized reprojection error; and, step 404 may include: and taking the determined pose as an iteration initial value, and iterating the pose of the target face and the expression coefficients of each preset three-dimensional expression in the three-dimensional expression corresponding to the target face so as to minimize the sub-projection result of the key point subset of each item in the preset five sense organs and the re-projection error of the key point of the item in the key points of the face.

In these optional implementations, the executing body determines, for the projection results of the key points of each preset three-dimensional expression on the two-dimensional plane, a minimized reprojection error corresponding to the projection result, and determines a face pose corresponding to the minimized reprojection error, that is, a pose of the target face. In practice, the face pose and the expression coefficient can be iterated, and the iteration target is minimized in the reprojection error. The iteration process is carried out on the whole target face, and the accuracy of the obtained expression coefficient is not high, so that only the face pose obtained by iteration can be adopted.

After the above-mentioned determination of the minimized re-projection error for the key points of the whole face is performed to obtain the pose of the face, the above-mentioned execution body may separately determine the minimized re-projection error for each term in the five sense organs to iterate out the expression coefficient of the term. In practice, in the iterative process of determining the minimized re-projection error, the face pose and the expression coefficient are updated continuously, and the corresponding re-projection error is reduced in gradient. Specifically, the iteration may be performed using the projection formula described above.

These implementations may iterate multiple times with minimized re-projection errors as a goal, resulting in a more accurate expression coefficient.

Step 405, based on the determined expression coefficient, generating a three-dimensional animation with a three-dimensional expression corresponding to the target face.

In this embodiment, the execution body may generate the three-dimensional animation having the three-dimensional expression corresponding to the target face based on the determined respective expression coefficients. Specifically, the executing body may input the expression coefficients determined for each of the preset five sense organs into the animation driving model, so that the generated animation has the three-dimensional expression of the target face.

The embodiment can avoid the error judgment of the expression caused by the difference of the relations among the five sense organs, such as the difference of the distances. For example, if the distance between the eyebrows and the eyes is large, when the expression is judged for the entire face, the face may be judged as an expression for picking the eyebrows. The embodiment can pay attention to factors such as the shape of the five sense organs, and the like, and improves the accuracy of expression judgment.

With further reference to fig. 5, as an implementation of the method shown in the foregoing figures, the present disclosure provides an embodiment of a video frame processing apparatus, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus is particularly applicable to various electronic devices.

As shown in fig. 5, the video frame processing apparatus 500 of the present embodiment includes: an acquisition unit 501, a projection unit 502, a determination unit 503, and a generation unit 504. Wherein, the acquiring unit 501 is configured to acquire a face key point in a video frame containing a target face; a projection unit 502 configured to project key points of a preset three-dimensional expression into a two-dimensional plane in which the key points of the face are located, so as to minimize a reprojection error between the projection result and the key points of the face; a determining unit 503 configured to determine, based on the minimized re-projection error, an expression coefficient of each preset three-dimensional expression in the three-dimensional expressions corresponding to the target face; a generating unit 504 configured to generate a three-dimensional animation having a three-dimensional expression corresponding to the target face based on the determined expression coefficient.

In this embodiment, the specific processing and the technical effects of the acquiring unit 501, the projecting unit 502, the determining unit 503 and the generating unit 504 of the video frame processing apparatus 500 may refer to the relevant descriptions of the

steps

201, 202, 203 and 204 in the corresponding embodiment of fig. 2, and are not repeated herein.

In some optional implementations of this embodiment, the determining unit is further configured to determine an expression coefficient of each preset three-dimensional expression in the three-dimensional expression corresponding to the target face based on the minimized re-projection error according to: for key points of a preset three-dimensional expression, presetting a key point subset of each item in five sense organs, and determining a sub-projection result of the key point subset and a minimized re-projection error of the face key points of the item in the face key points; and for each item in the preset five sense organs, determining the expression coefficient of each preset three-dimensional expression in the three-dimensional expression corresponding to the target face based on the minimized reprojection error corresponding to the item.

In some optional implementations of this embodiment, the determining unit is further configured to determine an expression coefficient of each preset three-dimensional expression in the three-dimensional expression corresponding to the target face based on the minimized re-projection error according to: for the minimized reprojection error of the projection result of the key point of the preset three-dimensional expression, determining the pose of the target face corresponding to the minimized reprojection error; and a determining unit further configured to perform, for each of the preset five sense organs, determining an expression coefficient of each preset three-dimensional expression in the three-dimensional expression corresponding to the target face based on the minimized re-projection error corresponding to the item, in the following manner: and taking the determined pose as an iteration initial value, and iterating the pose of the target face and the expression coefficients of each preset three-dimensional expression in the total surface condition of the target face so as to minimize the sub-projection result of the key point subset of each item in the preset five sense organs and the re-projection error of the key point of the item in the key points of the face.

In some optional implementations of this embodiment, the acquiring unit is further configured to perform acquiring the face key points in the video frame containing the target face as follows: and aligning the key points in each video frame to obtain the aligned face key points.

In some optional implementations of the present embodiment, the determining unit is further configured to perform generating a three-dimensional animation having a three-dimensional expression corresponding to the target face based on the determined expression coefficient in the following manner: and carrying out weighted average on the expression coefficients of two video frames in each video frame, and updating the expression coefficients of the next frame in the two video frames into a weighted average result, wherein the two video frames are adjacent frames, or the number of video frames at intervals of the two video frames is a preset number and does not exceed a preset threshold value.

As shown in fig. 6, the electronic device 600 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 601, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data required for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

In general, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, magnetic tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 shows an electronic device 600 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead. Each block shown in fig. 6 may represent one device or a plurality of devices as needed.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via communication means 609, or from storage means 608, or from ROM 602. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing means 601. It should be noted that the computer readable medium of the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In an embodiment of the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. Whereas in embodiments of the present disclosure, the computer-readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave, with computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present application may be implemented by software, or may be implemented by hardware. The described units may also be provided in a processor, for example, described as: a processor includes an acquisition unit, a projection unit, a determination unit, and a generation unit. The names of these units do not constitute limitations on the unit itself in some cases, and for example, the acquisition unit may also be described as "a unit that acquires a face key point in a video frame containing a target face".

As another aspect, the present application also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: acquiring a face key point in a video frame containing a target face; projecting key points of a preset three-dimensional expression into a two-dimensional plane where key points of a human face are located, so that a reprojection error between a projection result and the key points of the human face is minimized; determining an expression coefficient of each preset three-dimensional expression in the three-dimensional expression corresponding to the target face based on the minimized reprojection error; and generating a three-dimensional animation with the three-dimensional expression corresponding to the target face based on the determined expression coefficient.

The foregoing description is only of the preferred embodiments of the present application and is presented as a description of the principles of the technology being utilized. It will be appreciated by persons skilled in the art that the scope of the invention referred to in this application is not limited to the specific combinations of features described above, but it is intended to cover other embodiments in which any combination of features described above or equivalents thereof is possible without departing from the spirit of the invention. Such as the above-described features and technical features having similar functions (but not limited to) disclosed in the present application are replaced with each other.

Claims

1. A video frame processing method, comprising:

acquiring a face key point in a video frame containing a target face;

projecting key points of a preset three-dimensional expression into a two-dimensional plane where the key points of the face are located so as to minimize a reprojection error between a projection result and the key points of the face, wherein the two-dimensional plane where the key points of the face are located comprises: a two-dimensional image containing the face key points;

determining an expression coefficient of each preset three-dimensional expression in the three-dimensional expression corresponding to the target face based on the minimized reprojection error;

and generating a three-dimensional animation with the three-dimensional expression corresponding to the target face based on the determined expression coefficient.

2. The method of claim 1, wherein determining, based on the minimized re-projection error, an expression coefficient of each preset three-dimensional expression in the three-dimensional expressions corresponding to the target face includes:

for key points of a preset three-dimensional expression, presetting a key point subset of each item in five sense organs, and determining a sub-projection result of the key point subset and a minimized re-projection error of the face key points of the item in the face key points;

and for each item in the preset five sense organs, determining the expression coefficient of each preset three-dimensional expression in the three-dimensional expression corresponding to the target face based on the minimized reprojection error corresponding to the item.

3. The method of claim 2, wherein determining the expression coefficient of each preset three-dimensional expression in the three-dimensional expression corresponding to the target face based on the minimized re-projection error, further comprises:

for the minimized reprojection error of the projection result of the key point of the preset three-dimensional expression, determining the pose of the target face corresponding to the minimized reprojection error; and

for each item in the preset five sense organs, determining the expression coefficient of each preset three-dimensional expression in the three-dimensional expression corresponding to the target face based on the minimized reprojection error corresponding to the item, including:

and taking the determined pose as an iteration initial value, and iterating the pose of the target face and the expression coefficients of each preset three-dimensional expression in the total surface condition of the target face so as to minimize the sub-projection result of the key point subset of each item in the preset five sense organs and the re-projection error of the key point of the item in the key points of the face.

4. The method of claim 1, wherein the acquiring face keypoints in the video frame containing the target face comprises:

and aligning the key points in each video frame to obtain the aligned face key points.

5. The method of claim 1, wherein the generating a three-dimensional animation having the corresponding three-dimensional expression of the target face based on the determined expression factor comprises:

and carrying out weighted average on the expression coefficients of two video frames, and updating the expression coefficients of the next frame in the two video frames to be the weighted average result, wherein the two video frames are adjacent frames or the number of video frames at intervals of the two video frames is a preset number and does not exceed a preset threshold value.

6. A video frame processing apparatus comprising:

an acquisition unit configured to acquire face key points in a video frame containing a target face;

the projection unit is configured to project key points of a preset three-dimensional expression into a two-dimensional plane where the key points of the face are located so as to minimize a reprojection error between a projection result and the key points of the face, wherein the two-dimensional plane where the key points of the face are located comprises: a two-dimensional image containing the face key points;

a determining unit configured to determine, based on the minimized re-projection error, an expression coefficient of each preset three-dimensional expression in the three-dimensional expressions corresponding to the target face;

and the generation unit is configured to generate a three-dimensional animation with the three-dimensional expression corresponding to the target face based on the determined expression coefficient.

7. The apparatus of claim 6, the determining unit further configured to perform the determining of the expression coefficients of the respective preset three-dimensional expressions in the three-dimensional expressions corresponding to the target face based on the minimized re-projection error in such a manner that:

8. The apparatus of claim 7, the determining unit further configured to perform the determining of the expression coefficients of the respective preset three-dimensional expressions in the three-dimensional expressions corresponding to the target face based on the minimized re-projection error in such a manner that:

the determining unit is further configured to perform, for each of the preset five sense organs, determining, based on the minimized re-projection error corresponding to the item, an expression coefficient of each of the preset three-dimensional expressions in the three-dimensional expression corresponding to the target face in the following manner:

9. The apparatus of claim 6, wherein the acquisition unit is further configured to perform the acquiring face keypoints in a video frame containing a target face in the following manner:

10. The apparatus of claim 6, wherein the determining unit is further configured to perform the generating the three-dimensional animation having the three-dimensional expression corresponding to the target face based on the determined expression coefficient in a manner as follows:

11. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs,

when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-5.

12. A computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the method of any of claims 1-5.