CN114140563A

CN114140563A - Virtual object processing method and device

Info

Publication number: CN114140563A
Application number: CN202111467068.7A
Authority: CN
Inventors: 秦泽奎; 李强; 刘明聪
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-12-03
Filing date: 2021-12-03
Publication date: 2022-03-04

Abstract

The present disclosure relates to a method and an apparatus for processing a virtual object, the method including: acquiring a virtual object to be processed, expression parameters and posture parameters, wherein the expression parameters are obtained by extracting real facial expressions, and the posture parameters are obtained by extracting real facial postures; inputting the virtual object and the expression parameters into a pre-trained expression driving model, and performing expression conversion on the virtual object through the expression parameters to obtain a first target virtual object; and inputting the first target virtual object and the attitude parameters into a pre-trained attitude driving model to obtain a second target virtual object.

Description

Virtual object processing method and device

Technical Field

The present disclosure relates to the field of image processing, and more particularly, to a method and an apparatus for processing a virtual object.

Background

Currently, an action migration technology is adopted to drive a virtual object (e.g., a quadratic character, a cartoon character or a cartoon animal) to change an expression and a posture according to an instruction, and in some application scenarios, for example, in a live broadcast, a anchor can use the quadratic character which is adapted to the expression and the posture of the anchor and synchronously changes to perform the live broadcast, so that privacy is protected and interest of the live broadcast is increased. In a driving scheme of a quadratic figure, expression and posture driving of the quadratic figure is realized by resolving motion migration into changes of a plurality of organs. However, the operation efficiency of this scheme is relatively low, and the scheme cannot be run in real time on a GPU (Graphics Processing Unit), and thus a virtual object driving technique with higher operation efficiency is also required.

Disclosure of Invention

The present disclosure provides a method and an apparatus for processing a virtual object, so as to at least solve the problems in the related art.

According to a first aspect of the embodiments of the present disclosure, a method for processing a virtual object is provided, including: acquiring a virtual object to be driven, expression parameters and posture parameters, wherein the expression parameters are obtained by extracting real facial expressions, and the posture parameters are obtained by extracting real facial postures; inputting the virtual object and the expression parameters into a pre-trained expression driving model, and performing expression conversion on the virtual object through the expression parameters to obtain a first target virtual object; and inputting the first target virtual object and the attitude parameters into a pre-trained attitude driving model, and carrying out attitude adjustment on the first target virtual object through the attitude parameters to obtain a second target virtual object.

Optionally, the expression-driven model is generated in a manner that: acquiring a first training sample, wherein the first training sample comprises a first initial virtual object without a target expression, expression parameters representing the target expression and a second initial virtual object with the target expression; obtaining a first facial optical flow and a first image feature through a first convolutional neural network based on the first initial virtual object and the expression parameters, wherein the first facial optical flow represents the condition of pixel movement caused when the target expression acts on the first initial virtual object; based on the first surface optical flow, carrying out grid sampling on the first initial virtual object to obtain a first pre-estimated image simulating the target expression; based on the first image characteristics, correcting the first pre-estimated image to obtain a second pre-estimated image simulating the target expression; calculating a value of an expression driving loss function based on the second pre-estimated image and the second initial virtual object; and updating the parameters of the first convolution neural network based on the value of the expression driving loss function to obtain a trained expression driving model.

Optionally, the second initial virtual object includes a first type image and a second type image, the first type image is obtained by inputting the first initial virtual object and the expression parameter into a first image model, the second type image is obtained by inputting the first initial virtual object and the expression parameter into a second image model, and the first type image and the second type image have different facial cleanliness.

Optionally, the method further comprises: cutting out an image within a preset range from the central position of the first initial virtual object, wherein the preset range is determined according to the range of pixel movement caused when the target expression acts on the first initial virtual object; the obtaining of the first facial optical flow and the first image feature through the first convolutional neural network based on the first initial virtual object and the expression parameter includes: and obtaining a first surface light stream and a first image characteristic through a first convolution neural network based on the cut image in the preset range and the expression parameters.

Optionally, the obtaining, by the first convolutional neural network based on the first initial virtual object and the expression parameter, a first facial optical flow and a first image feature includes: and performing downsampling on a first convolution layer of the first convolution neural network based on the first initial virtual object and the expression parameters, and obtaining a first facial optical flow and a first image feature through a subsequent convolution layer based on data obtained by downsampling.

Optionally, the number of channels of each convolutional layer of the convolutional layers of the first convolutional neural network is a preset number, where the preset number represents a lower limit number of channels corresponding to each convolutional layer when the convolutional layer is operated.

Optionally, the generating manner of the posture-driven model includes: acquiring a second training sample, wherein the second training sample comprises a third initial virtual object without a target gesture, a gesture parameter representing the target gesture and a fourth initial virtual object with the target gesture, and the third initial virtual object and the fourth initial virtual object have the same expression; obtaining a second facial optical flow through a second convolutional neural network based on the third initial virtual object and the pose parameters, wherein the second facial optical flow represents the condition of pixel movement caused when the target pose acts on the second initial virtual object; based on the second facial light stream, carrying out grid sampling on the second initial virtual object to obtain a pre-estimated virtual object simulating the target posture; calculating a value of a pose drive loss function based on the estimated virtual object simulating the target pose and the fourth initial virtual object; and updating the parameters of the second convolutional neural network based on the value of the attitude driving loss function to obtain a trained attitude driving model.

Optionally, the fourth initial virtual object includes a third type image and a fourth type image, the third type image is obtained by inputting the third initial virtual object and the posture parameter into a third image model, the fourth type image is obtained by inputting the third initial virtual object and the posture parameter into a fourth image model, and the third type image and the fourth type image have different face cleanliness degrees.

Optionally, the second convolutional neural network includes a plurality of convolutional layers, and the deriving the second facial optical flow through the second convolutional neural network based on the third initial virtual object and the pose parameter includes: and based on the third initial virtual object and the attitude parameters, performing down-sampling on a first convolution layer of the second convolution neural network, and based on data obtained by the down-sampling, obtaining a second facial optical flow through a subsequent convolution layer.

Optionally, the number of channels of each convolutional layer of the second convolutional neural network is a preset number, where the preset number represents a lower limit number of channels corresponding to each convolutional layer when the convolutional layer is in operation.

Optionally, the attitude drive loss function comprises a cyclic consistent loss function, wherein the cyclic consistent loss function is represented as:

L_cycle＝‖x_i-x_c‖₁

wherein x is_iRepresenting said third initial virtual object, x_cAnd representing the image after the posture of the posture driving model is changed according to the posture parameters or representing the image after the posture of the posture driving model is changed according to the reversed posture parameters.

According to a second aspect of the embodiments of the present disclosure, there is provided an apparatus for processing a virtual object, including: an acquisition unit configured to: acquiring a virtual object to be processed, expression parameters and posture parameters, wherein the expression parameters are obtained by extracting real facial expressions, and the posture parameters are obtained by extracting real facial postures; an expression driving unit configured to: inputting the virtual object and the expression parameters into a pre-trained expression driving model, and performing expression conversion on the virtual object through the expression parameters to obtain a first target virtual object; a posture driving unit configured to: and inputting the first target virtual object and the attitude parameters into a pre-trained attitude driving model, and carrying out attitude adjustment on the first target virtual object through the attitude parameters to obtain a second target virtual object.

Optionally, the expression-driven model is generated in a manner that: acquiring a first training sample, wherein the first training sample comprises a first initial virtual object without a target expression, expression parameters representing the target expression and a second initial virtual object with the target expression; obtaining a first facial optical flow and a first image feature through a first convolutional neural network based on the first initial virtual object and the expression parameters, wherein the first facial optical flow represents the condition of pixel movement caused when the target expression acts on the first initial virtual object; based on the first surface optical flow, carrying out grid sampling on the first initial virtual object to obtain a first pre-estimated image simulating the target expression; based on the first image characteristics, correcting the first pre-estimated image to obtain a second pre-estimated image simulating the target expression; calculating a value of an expression driving loss function based on the second pre-estimated image and the second initial virtual object; updating parameters of the first convolutional neural network based on the value of the expression-driven loss function to train the expression-driven model.

Optionally, the expression driving unit is configured to: cutting out an image within a preset range from the central position of the first initial virtual object, wherein the preset range is determined according to the range of pixel movement caused when the target expression acts on the first initial virtual object; the expression-driven model is configured to: and obtaining a first surface light stream and a first image characteristic through a first convolution neural network based on the cut image in the preset range and the expression parameters.

Optionally, the first convolutional neural network comprises a plurality of convolutional layers, the expression-driven model is configured to: and performing downsampling on a first convolution layer of the first convolution neural network based on the first initial virtual object and the expression parameters, and obtaining a first facial optical flow and a first image feature through a subsequent convolution layer based on data obtained by downsampling.

Optionally, the number of channels of each convolutional layer of the first convolutional neural network is a preset number, where the preset number represents a lower limit number of channels corresponding to each convolutional layer when the convolutional layer operates.

Optionally, the generating manner of the posture-driven model includes: acquiring a second training sample, wherein the second training sample comprises a third initial virtual object without a target gesture, a gesture parameter representing the target gesture and a fourth initial virtual object with the target gesture, and the third initial virtual object and the fourth initial virtual object have the same expression; obtaining a second facial optical flow through a second convolutional neural network based on the third initial virtual object and the pose parameters, wherein the second facial optical flow represents the condition of pixel movement caused when the target pose acts on the second initial virtual object; based on the second face optical flow, carrying out grid sampling on the third initial virtual object to obtain a pre-estimated virtual object simulating the target posture; calculating a value of a pose drive loss function based on the estimated virtual object simulating the target pose and the fourth initial virtual object; and updating the parameters of the second convolutional neural network based on the value of the attitude driving loss function to obtain a trained attitude driving model.

Optionally, the second convolutional neural network comprises a plurality of convolutional layers, the attitude-driven model is configured to: and based on the third initial virtual object and the attitude parameters, performing down-sampling on a first convolution layer of the second convolution neural network, and based on data obtained by the down-sampling, obtaining a second facial optical flow through a subsequent convolution layer.

L_cycle＝‖x_i-x_c‖₁

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform a method of processing a virtual object according to the present disclosure.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform a method of processing a virtual object according to the present disclosure.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product, instructions in which are executable by a processor of a computer device to perform a method of processing a virtual object according to the present disclosure.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

according to the virtual object processing method and device, the virtual object can simulate the real facial expression and the real facial gesture through the expression driving model and the gesture driving model based on the expression parameters extracted by the real facial expression and the gesture parameters extracted by the real facial gesture, and the virtual object is driven. Because the driving is realized only through the expression driving model and the posture driving model, the calculated amount can be reduced to a great extent, the operation efficiency is improved, and the real-time operation can be realized on the GPU.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

Fig. 1 is a schematic view illustrating an overall process of processing a virtual object according to an exemplary embodiment of the present disclosure.

Fig. 2 is a flowchart illustrating a processing method of a virtual object according to an exemplary embodiment of the present disclosure.

Fig. 3 is a schematic diagram illustrating an overall framework of an expression driving model according to an exemplary embodiment of the present disclosure.

Fig. 4 is a schematic diagram illustrating an overall framework of a posture-driven model according to an exemplary embodiment of the present disclosure.

Fig. 5 is a block diagram illustrating a processing apparatus of a virtual object according to an exemplary embodiment of the present disclosure.

Fig. 6 is a block diagram illustrating an electronic device 600 according to an example embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The embodiments described in the following examples do not represent all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In this case, the expression "at least one of the items" in the present disclosure means a case where three types of parallel expressions "any one of the items", "a combination of any plural ones of the items", and "the entirety of the items" are included. For example, "include at least one of a and B" includes the following three cases in parallel: (1) comprises A; (2) comprises B; (3) including a and B. For another example, "at least one of the first step and the second step is performed", which means that the following three cases are juxtaposed: (1) executing the step one; (2) executing the step two; (3) and executing the step one and the step two.

Currently, motion migration techniques have been used to drive virtual objects (e.g., quadric characters, cartoon characters or cartoon animals, etc.) to change expressions and poses according to instructions. The motion migration can be classified into a method using a 3D technique and a method using only a 2D technique.

The 3D method generally uses a 3D portable Model (3D DMM) Model to control the expression of a person and the movement of facial organs, and uses a 3D DMM Model for face scanning to implement motion migration, for example, in one technical scheme, a texture map is generated for the 3D DMM by using a generated countermeasure network, so as to improve the reality of driving, but the scheme drives a real person.

The 2D method treats motion migration as an image translation task, i.e., translating an input graph into a target graph of a specified modality. For example, in one solution, motion migration is implemented by predicting the apparent optical flow of a person and sampling on an original image; in another technical scheme, key points of a person are predicted in an unsupervised mode, and then motion migration is described as local affine transformation with a plurality of key points as centers; the technical scheme is that the expression and posture driving of the secondary virtual character is realized by using a large number of 3D secondary models and decomposing motion migration into changes of a plurality of organs, specifically, the scheme drives the secondary virtual character through expression instructions and posture instructions, the expression driving comprises changes of eyebrow shapes and positions of eyebrows, eyes are opened and closed, eyeballs move and shrink, mouths are opened and closed, and mouth shapes, and the posture driving comprises rotation of the head of the secondary virtual character in three directions of pitching, yawing and rolling. The driving effect of the scheme is achieved through a deep neural network, and the driving effect comprises five modules of eyebrow segmentation, eyebrow deformation, eye mouth deformation, two head rotation and head rotation result fusion, but due to the fact that serial input and output dependency exists among the five modules, front and back processing logic is complex, algorithm debugging is difficult, theoretical calculation amount is very high (the total of the five modules is 230G), operation efficiency is low, and real-time operation on a GPU is difficult. In addition, the scheme only supports the drive of a 256-resolution secondary virtual character, the overall definition is low, and for example, a certain degree of blurring exists when the head of the secondary virtual character rotates.

In order to solve the problem of low driving efficiency of a virtual object and improve definition, the disclosure provides a processing method and a processing device of the virtual object, and specifically, the virtual object can simulate real facial expressions and real facial gestures through an expression driving model and a gesture driving model based on expression parameters extracted from the real facial expressions and gesture parameters extracted from the real facial gestures, so that the virtual object is driven. The driving is realized only through the expression driving model and the posture driving model, so that the calculated amount can be reduced to a great extent, the operation efficiency is improved, the real-time operation can be realized on the GPU, in addition, the front down sampling and channel pruning processing are also carried out on the expression driving model and the posture driving model, the attention clipping is also carried out on the input image of the expression driving model, the size of the input image is reduced, the change area caused by the expression is more concerned, the calculated amount can be further reduced, and the operation efficiency is improved; in addition, when the posture driving model is trained, a cyclic consistent loss function is added, so that driving flaws generated when the posture driving is carried out on the virtual object can be obviously weakened, and the definition and the cleanliness of the image are improved. Hereinafter, a method and apparatus for processing a virtual object according to an exemplary embodiment of the present disclosure will be described in detail with reference to fig. 1 to 6.

Fig. 1 is a schematic view illustrating an overall process of processing a virtual object according to an exemplary embodiment of the present disclosure. Here, the virtual object is an avatar distinguished from a real person, and may be, for example, a two-dimensional avatar, a cartoon character, a cartoon animal, or the like.

Referring to fig. 1, the virtual object is a secondary character, and in some application scenarios, for example, during live broadcasting, the secondary character can be used to replace the image of the anchor, and in order to ensure the effect of live broadcasting, the expression and posture of the secondary character need to be synchronized with the real expression and posture of the anchor in real time. At this time, the expression information and the pose information of the anchor can be obtained through the face model, for example, in fig. 1, the expression parameters and the pose parameters of the real face of the anchor are respectively extracted through an Animoji model and a 3DMM model, then the secondary figure to be driven and the extracted expression parameters are input into the expression driving model, a secondary figure simulating the expression of the real face can be obtained, and then the secondary figure simulating the expression of the real face and the extracted pose parameters are input into the pose driving model, so that the secondary figure simulating the expression and the pose of the real face can be obtained. The driving of the virtual object is realized only through the expression driving model and the posture driving model, so that the calculated amount can be reduced to a great extent, the operation efficiency is improved, and the real-time operation on the GPU can be realized.

Referring to fig. 2, in step 201, a virtual object to be processed, expression parameters and posture parameters may be obtained, where the expression parameters are obtained by extracting real facial expressions, and the posture parameters are obtained by extracting real facial postures. Here, the real facial expression and the real facial pose may be extracted through the face model to obtain the expression parameters and the pose parameters.

In step 202, the virtual object and the expression parameters may be input into a pre-trained expression driving model, so as to perform expression conversion on the virtual object through the expression parameters, thereby obtaining a first target virtual object.

According to an exemplary embodiment of the present disclosure, the virtual object and the expression parameters may be input into a pre-trained expression driving model, resulting in a first target virtual object, which may simulate a real facial expression. Specifically, the expression driving model is generated in a manner including: first, a first training sample is obtained, wherein the first training sample comprises a first initial virtual object without a target expression, an expression parameter representing the target expression, and a second initial virtual object with the target expression. Here, The first initial virtual object may be downloaded from a special virtual object database (for example, but not limited to, The animal Gallery website, etc.), and The target expression may be an artificially specified expression to be displayed by The virtual object, such as "happy", "sad", "angry", etc., and for each target expression, there is a corresponding expression parameter (for example, a degree of opening of eyes, a degree of moving eyebrows, or a degree of opening of a mouth, etc.). Here, the second initial virtual object having the target expression exists as a label in the training process. According to an exemplary embodiment of the present disclosure, the second initial virtual object may include a first type image and a second type image, wherein the first type image may be obtained by inputting the first initial virtual object and the expression parameter into the first image model, and the second type image may be obtained by inputting the first initial virtual object and the expression parameter into the second image model, and the first type image and the second type image have different degrees of facial cleanliness. Here, the first image model, for example, but not limited to, is a 3D rendering model, and the second initial virtual object obtained through the model is accurate and clean, so that the probability of the image obtained through the expression-driven model being defective can be reduced; the second image model, for example, but not limited to, is a talking-head-animal model, and the second initial virtual object obtained through the model has certain flaws (for example, black spots appear on the face of the virtual object, etc.), so that the expression-driven model obtained by training using the second type of image has good generalization. When the model is trained, the first type image or the second type image is randomly captured as a label, so that the trained expression driving model can output expression virtual objects with fewer flaws while the generalization is good.

And returning to continue describing the training process of the expression driving model. After obtaining the first training sample, a first facial optical flow characterizing pixel movement caused by the target expression acting on the first initial virtual object and a plurality of first image features may be obtained through the first convolutional neural network based on the first initial virtual object and the expression parameters, and each first feature image feature reflects a feature of the first initial virtual object. For example, fig. 3 is a schematic diagram showing an overall framework of an expression-driven model according to an exemplary embodiment of the present disclosure, and referring to fig. 3, a first facial optical flow and a plurality of first image features may be obtained by inputting a quadratic element character image without an expression and expression parameters into a first convolution neural network, and fig. 3 exemplarily shows four image features, but more features may be set to be acquired according to actual situations, which is not limited thereto. According to an exemplary embodiment of the present disclosure, in order to reduce the amount of calculation of the expression driving model, an image within a preset range may be cut out from the center position of the first initial virtual object, and then the first facial optical flow and the first image feature may be obtained through the first convolutional neural network based on the cut out image and the expression parameter. Here, the preset range may be determined according to a range of pixel movement caused when the target expression acts on the first initial virtual object, and in some embodiments, the preset range may be a 128 × 128 image area sufficient to encompass an area of total change caused when the target expression acts on the first virtual object.

According to an exemplary embodiment of the present disclosure, the first convolutional neural network includes a plurality of convolutional layers, and also in order to reduce the amount of computation of the expression-driven model, a down-sampling may be performed on a first convolutional layer of the first convolutional neural network based on the first initial virtual object and the expression parameters, and the first facial optical flow and the first image feature may be obtained by a subsequent convolutional layer based on data obtained by the down-sampling. Therefore, the resolution of the overall network characteristic diagram can be reduced, and the calculation amount of the expression driving model is remarkably reduced. In addition, according to an exemplary embodiment of the present disclosure, the number of channels per convolutional layer of the first convolutional neural network is a preset number, which represents a lower limit number of channels corresponding to each convolutional layer when operating (i.e., a minimum number of channels that enables each convolutional layer to operate normally), so as to further reduce the amount of computation of the expression driving model.

And returning to continue describing the training process of the expression driving model. The first initial virtual object may be grid sampled based on the obtained first facial light flow to obtain a first estimated image simulating the target expression. Here, the grid sampling may be implemented by a first grid sampling module (e.g., grid _ sample () function) shown in fig. 3, and the obtained first predicted image simulating the target expression may have a flaw (e.g., a mark such as a black spot or a scratch appears at some positions of the face, or when a mouth is closed and a mouth is opened, there is no change in the inside of the mouth, etc.), and therefore, the first predicted image may be further modified based on the first image feature to obtain a second predicted image simulating the target expression. Here, the first prediction image may be corrected by alpha correction twice. Specifically, referring to fig. 3, first, a first image feature (for example, referred to as feature a) is fused with a first estimated image, and another first image feature (a segmentation map after feature a is fused with the contour of a first initial virtual object) is used as a reference object in the fusion process to perform first alpha correction, and the second alpha correction process is similar to this, and can be understood with reference to fig. 3, and is not described again here. Next, a value of an expression driving loss function may be calculated based on the second pre-estimated image and the second initial virtual object, and a parameter of the first convolution neural network may be updated based on the value of the expression driving loss function, so as to obtain a trained expression driving model, where the expression driving loss function may include three functions, which are a generation countermeasure loss function, a pixel reconstruction loss function, and a perception loss function, respectively.

Referring back to fig. 2, in step 203, the first target virtual object and the pose parameters may be input to a pre-trained pose driving model to perform pose adjustment on the first target virtual object through the pose parameters, so as to obtain a second target virtual object.

According to an exemplary embodiment of the present disclosure, a first target virtual object simulating a real facial expression and a pose parameter may be input to a pre-trained pose driving model to obtain a second target virtual object, which may simulate a real facial expression and a real facial pose virtual object. The generation mode of the attitude driving model comprises the following steps: first, a second training sample may be obtained, wherein the second training sample includes a third initial virtual object without a target pose, pose parameters representing the target pose, and a fourth initial virtual object with the target pose, and the third initial virtual object and the fourth initial virtual object have the same expression. Here, The third initial virtual object may be downloaded from a special virtual object database (for example, but not limited to, The animal Gallery website, etc.), and The target pose may be a pose that is artificially defined to be implemented in The virtual object, such as "head up", "head down", "turn left", etc., and for each target pose, there is a corresponding pose parameter (for example, for "head up", The pose parameter corresponds to The angle of head up pitch). Here, the fourth initial virtual object exists as a label in the training process. According to an exemplary embodiment of the present disclosure, the fourth initial virtual object may include a third type image and a fourth type image, wherein the third type image may be obtained by inputting the third initial virtual object and the posture parameter into the third image model, the fourth type image may be obtained by inputting the third initial virtual object and the posture parameter into the fourth image model, and the third type image and the fourth type image have different degrees of face cleanliness. Similar to the expression driving model, the third type image is accurate and clean, so that the probability of flaws of the image obtained through the posture driving model can be reduced; since the fourth type image has a certain defect (for example, a scratch-like mark appears on the face of the virtual object), the posture-driven model trained using the fourth type image has a good generalization. By performing model training by mixing the third-type image and the fourth-type image as labels, the trained posture driving model has good generalization and less defects in output virtual objects.

And returning to continue describing the training process of the posture driving model. After obtaining the second training sample, a second facial optical flow may be derived by the second convolutional neural network based on the third initial virtual object and the pose parameters, wherein the second facial optical flow characterizes a pixel movement caused when the target pose acts on the third initial virtual object. For example, fig. 4 is a schematic diagram showing an overall framework of a pose driving model according to an exemplary embodiment of the present disclosure, and referring to fig. 4, a second facial optical flow is obtained by inputting a quadratic character having an expression without a pose and expression parameters into a second convolutional neural network. According to an exemplary embodiment of the present disclosure, the second convolutional neural network includes a plurality of convolutional layers, and in order to reduce the calculation amount of the attitude driving model, downsampling may be performed on a first convolutional layer of the first convolutional neural network based on the third initial virtual object and the attitude parameter to reduce the resolution of the image, so that the second facial optical flow is obtained through a subsequent convolutional layer based on data obtained by downsampling, and the calculation amount of the attitude driving model can be significantly reduced. In addition, according to an exemplary embodiment of the present disclosure, the number of channels per convolutional layer of the second convolutional neural network is a preset number, which represents a lower limit number of channels corresponding to each convolutional layer when operating (i.e., a minimum number of channels that enables each convolutional layer to operate normally), so as to further reduce the amount of computation of the attitude driving model.

And returning to continue describing the training process of the posture driving model. The third initial virtual object may be grid sampled based on the second facial optical flow to obtain a virtual object of the estimated simulated target pose, and the process may be, for example, grid sampled by a second grid sampling module shown in fig. 4 with reference to fig. 4. Here, for example, but not limited to, grid _ sample () function may be employed to implement grid sampling. Next, a value of the attitude drive loss function may be calculated based on the estimated virtual object simulating the target attitude and the fourth initial virtual object; and updating parameters of the second convolutional neural network based on the value of the attitude driving loss function to obtain a trained attitude driving model, wherein the attitude driving loss function also comprises three functions, namely a generation countermeasure loss function, a pixel reconstruction loss function and a perception loss function, and the difference from the expression driving loss function is that processing objects in the functions are different, and the three loss functions are also commonly used in virtual object driving, so that the skilled person can understand and use the three loss functions to drive the virtual object, and therefore, the description is not excessive. According to the exemplary embodiments of the present disclosure, in order to improve the problem that the virtual object causes picture blurring when performing pose change (for example, the secondary character head rotates to cause picture blurring), improve the clarity and the cleanliness of the driving, and introduce a cyclic consistency loss function to perform pose consistency constraint on the pose driving model, specifically, for example, for a secondary virtual character, it is expected that after performing pose change according to a certain pose parameter, the pose parameter is inverted, so that the secondary virtual character after performing pose change performs pose transformation according to the inverted pose parameter, and the result should be consistent with the original secondary virtual character without performing pose change, and by this pose constraint, the situation that picture blurring may exist in the result of the pose driving model operation can be significantly reduced, the definition is improved. While the round-robin uniform loss function, for example, but not limited to, may be expressed as:

L_cycle＝‖x_i-x_c‖₁

wherein x is_iRepresenting a third initial virtual object, x_cAnd the image representing the posture change of the posture driving model according to the posture parameters or the image representing the posture change of the posture driving model according to the inverted posture parameters.

Referring to fig. 5, a processing apparatus 500 of a virtual object according to an exemplary embodiment of the present disclosure may include an acquisition unit 501, an expression driving unit 502, and a gesture driving unit 503.

The obtaining unit 501 may obtain a virtual object to be processed, expression parameters and posture parameters, where the expression parameters are obtained by extracting real facial expressions, and the posture parameters are obtained by extracting real facial postures; the expression driving unit 502 may input the virtual object and the expression parameters into a pre-trained expression driving model, so as to perform expression conversion on the virtual object through the expression parameters to obtain a first target virtual object; the posture driving unit 503 may input the first target virtual object and the posture parameters to a posture driving model trained in advance, so as to perform posture adjustment on the first target virtual object through the posture parameters, thereby obtaining a second target virtual object.

Since the processing method of the virtual object shown in fig. 2 can be executed by the processing apparatus 500 of the virtual object shown in fig. 5, and the obtaining unit 501, the expression driving unit 502, and the gesture driving unit 503 can respectively execute the operations corresponding to step 201, step 202, and step 203 in fig. 2, any relevant details related to the operations executed by the units in fig. 5 can be referred to the corresponding description of fig. 2, and are not repeated here.

Referring to fig. 6, the electronic device 600 includes at least one memory 601 and at least one processor 602, the at least one memory 601 having stored therein a set of computer-executable instructions that, when executed by the at least one processor 602, perform a method of processing a virtual object according to an exemplary embodiment of the present disclosure.

By way of example, the electronic device 600 may be a PC computer, tablet device, personal digital assistant, smartphone, or other device capable of executing the set of instructions described above. Here, the electronic device 600 need not be a single electronic device, but can be any arrangement or collection of circuits capable of executing the above-described instructions (or sets of instructions), either individually or in combination. The electronic device 600 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).

In the electronic device 600, the processor 602 may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

The processor 602 may execute instructions or code stored in the memory 601, wherein the memory 601 may also store data. The instructions and data may also be transmitted or received over a network via a network interface device, which may employ any known transmission protocol.

The memory 601 may be integrated with the processor 602, for example, with RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, memory 601 may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The memory 601 and the processor 602 may be operatively coupled or may communicate with each other, e.g., through I/O ports, network connections, etc., such that the processor 602 can read files stored in the memory.

Further, the electronic device 600 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device 600 may be connected to each other via a bus and/or a network.

According to an exemplary embodiment of the present disclosure, there may also be provided a computer-readable storage medium storing instructions, which when executed by at least one processor, cause the at least one processor to perform a method of processing a virtual object according to the present disclosure. Examples of the computer-readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD + RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD + RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or compact disc memory, Hard Disk Drive (HDD), solid-state drive (SSD), card-type memory (such as a multimedia card, a Secure Digital (SD) card or a extreme digital (XD) card), magnetic tape, a floppy disk, a magneto-optical data storage device, an optical data storage device, a hard disk, a magnetic tape, a magneto-optical data storage device, a hard disk, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, A solid state disk, and any other device configured to store and provide a computer program and any associated data, data files, and data structures to a processor or computer in a non-transitory manner such that the processor or computer can execute the computer program. The computer program in the computer-readable storage medium described above can be run in an environment deployed in a computer apparatus, such as a client, a host, a proxy device, a server, and the like, and further, in one example, the computer program and any associated data, data files, and data structures are distributed across a networked computer system such that the computer program and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.

According to an exemplary embodiment of the present disclosure, there may also be provided a computer program product in which instructions are executable by a processor of a computer device to perform a method of processing a virtual object according to an exemplary embodiment of the present disclosure.

According to the virtual object processing method and device, the virtual object can simulate the real facial expression and the real facial gesture through the expression driving model and the gesture driving model based on the expression parameters extracted by the real facial expression and the gesture parameters extracted by the real facial gesture, and the virtual object is driven. The driving is realized only through the expression driving model and the posture driving model, so that the calculated amount can be reduced to a great extent, the operation efficiency is improved, the real-time operation can be realized on the GPU, in addition, the front down sampling and channel pruning processing are also carried out on the expression driving model and the posture driving model, the attention clipping is also carried out on the input image of the expression driving model, the size of the input image is reduced, the change area caused by the expression is more concerned, the calculated amount can be further reduced, and the operation efficiency is improved; in addition, when the posture driving model is trained, a cyclic consistent loss function is added, so that driving flaws generated when the posture driving is carried out on the virtual object can be obviously weakened, and the definition and the cleanliness of the image are improved.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for processing a virtual object, comprising:

acquiring a virtual object to be processed, expression parameters and posture parameters, wherein the expression parameters are obtained by extracting real facial expressions, and the posture parameters are obtained by extracting real facial postures;

inputting the virtual object and the expression parameters into a pre-trained expression driving model, and performing expression conversion on the virtual object through the expression parameters to obtain a first target virtual object;

and inputting the first target virtual object and the attitude parameters into a pre-trained attitude driving model, and carrying out attitude adjustment on the first target virtual object through the attitude parameters to obtain a second target virtual object.

2. The method for processing the virtual object according to claim 1, wherein the expression-driven model is generated in a manner that includes:

acquiring a first training sample, wherein the first training sample comprises a first initial virtual object without a target expression, expression parameters representing the target expression and a second initial virtual object with the target expression;

obtaining a first facial optical flow and a first image feature through a first convolutional neural network based on the first initial virtual object and the expression parameters, wherein the first facial optical flow represents the condition of pixel movement caused when the target expression acts on the first initial virtual object;

based on the first surface optical flow, carrying out grid sampling on the first initial virtual object to obtain a first pre-estimated image simulating the target expression;

based on the first image characteristics, correcting the first pre-estimated image to obtain a second pre-estimated image simulating the target expression;

calculating a value of an expression driving loss function based on the second pre-estimated image and the second initial virtual object;

and updating the parameters of the first convolution neural network based on the value of the expression driving loss function to obtain a trained expression driving model.

3. The method for processing a virtual object according to claim 2,

the second initial virtual object comprises a first type image and a second type image, the first type image is obtained by inputting the first initial virtual object and the expression parameters into a first image model, the second type image is obtained by inputting the first initial virtual object and the expression parameters into a second image model, and the first type image and the second type image have different face cleanliness degrees.

4. The method for processing a virtual object according to claim 2, further comprising:

cutting out an image within a preset range from the central position of the first initial virtual object, wherein the preset range is determined according to the range of pixel movement caused when the target expression acts on the first initial virtual object;

the obtaining of the first facial optical flow and the first image feature through the first convolutional neural network based on the first initial virtual object and the expression parameter includes:

and obtaining a first surface light stream and a first image characteristic through a first convolution neural network based on the cut image in the preset range and the expression parameters.

5. The method for processing the virtual object according to claim 2, wherein the first convolutional neural network comprises a plurality of convolutional layers, and the obtaining the first facial optical flow and the first image feature through the first convolutional neural network based on the first initial virtual object and the expression parameters comprises:

and performing downsampling on a first convolution layer of the first convolution neural network based on the first initial virtual object and the expression parameters, and obtaining a first facial optical flow and a first image feature through a subsequent convolution layer based on data obtained by downsampling.

6. The method for processing the virtual object according to claim 2, wherein the number of channels of each convolutional layer of the first convolutional neural network is a preset number, and the preset number represents a lower limit number of channels corresponding to each convolutional layer when the convolutional layer runs.

7. An apparatus for processing a virtual object, comprising:

an acquisition unit configured to: acquiring a virtual object to be processed, expression parameters and posture parameters, wherein the expression parameters are obtained by extracting real facial expressions, and the posture parameters are obtained by extracting real facial postures;

an expression driving unit configured to: inputting the virtual object and the expression parameters into a pre-trained expression driving model, and performing expression conversion on the virtual object through the expression parameters to obtain a first target virtual object;

a posture driving unit configured to: and inputting the first target virtual object and the attitude parameters into a pre-trained attitude driving model, and carrying out attitude adjustment on the first target virtual object through the attitude parameters to obtain a second target virtual object.

8. An electronic device, comprising:

at least one processor;

at least one memory storing computer-executable instructions,

wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform a method of processing a virtual object as claimed in any one of claims 1 to 6.

9. A computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform a method of processing a virtual object as claimed in any one of claims 1 to 6.

10. A computer program product comprising computer instructions, characterized in that the computer instructions, when executed by at least one processor, implement a method of processing a virtual object according to any one of claims 1 to 6.