CN115100707A

CN115100707A - Model training method, video information generation method, device and storage medium

Info

Publication number: CN115100707A
Application number: CN202210689472.7A
Authority: CN
Inventors: 张隆昊; 陈长国; 申丽
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-06-16
Filing date: 2022-06-16
Publication date: 2022-09-23

Abstract

The application provides a model training method, a video information generation method, equipment and a storage medium. The training method of the model comprises the following steps: the method comprises the steps of obtaining a first video sample presenting dynamic expression information of a face, carrying out iterative training on a model according to image frames in the first video sample, and obtaining a depth recognition model when a value of a loss function is converged to a preset value. In the ith iterative training process, inputting the ith image frame into an initial recognition model to obtain a depth image of the ith image frame, wherein i is a positive integer, and determining a loss function value of the initial recognition model under the ith iterative training according to the depth image of the ith image frame and the (i + n) th image frame; the depth recognition model is used for recognizing and obtaining a depth image of the image to be recognized. The identified depth image can accurately reflect the depth information of the image, so that the target face image in the video determined based on the depth image is more vivid, and the presentation effect of the dynamic picture is improved.

Description

Model training method, video information generation method, device and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a model training method, a video information generating method, a device, and a storage medium.

Background

The metauniverse (Metaverse) is a virtual world that is linked and created by scientific means, and is mapped and interacted with the real world. With the explosion of the metauniverse ecology, the production demand of the virtual character (or digital character) image is increasing. In the process of modeling a virtual character image, how to enable the virtual character to vividly present facial expressions and improve the presentation quality of a dynamic picture is a problem to be solved urgently at present.

Disclosure of Invention

The embodiment of the application provides a model training method, a video information generating method, a device and a storage medium, so as to improve the presenting effect of the facial expression of a virtual character and improve the presenting quality of a dynamic picture.

In a first aspect, an embodiment of the present application provides a method for training a model, including: acquiring a first video sample, wherein the first video sample presents dynamic expression information of a face; performing iterative training of a model according to the image frames in the first video sample until the value of the loss function converges to a preset value, and obtaining a depth recognition model; in the ith iterative training process, inputting an ith image frame into an initial recognition model to obtain a depth image of the ith image frame, wherein i is a positive integer; determining a loss function value of the initial recognition model under the ith iterative training according to the depth image of the ith image frame and the (i + n) th image frame, wherein n is a positive integer; the depth recognition model is used for recognizing a depth image of an image to be recognized, the depth image is used for reflecting the three-dimensional geometric shape of a visible surface in the image to be recognized, and the image to be recognized presents facial expressions.

In a second aspect, an embodiment of the present application provides a video information generating method, including: acquiring a face image and a driving image, wherein the driving image is an image frame in a driving video, and the driving video comprises dynamic expression information for synthesizing the face video; acquiring depth images corresponding to the face image and the driving image respectively through a depth recognition model, wherein the depth images are used for reflecting the three-dimensional geometric shape of a visible surface in the image, the depth recognition model is obtained based on training of every two image frames with the interval of n in a video sample, and n is a positive integer; and performing feature transformation on the face image according to the depth images respectively corresponding to the face image and the driving image to obtain a target face image of the face image under the driving image, wherein the similarity between the facial expression presented by the target face image and the facial expression presented by the driving image is greater than a threshold value.

In a third aspect, an embodiment of the present application provides a training apparatus for a model, including: a sample acquisition unit configured to acquire a first video sample presenting dynamic expression information of a face; the training unit is used for carrying out iterative training on the model according to the image frames in the first video sample until the value of the loss function is converged to a preset value, and then a depth recognition model is obtained; in the ith iterative training process, inputting an ith image frame into an initial recognition model to obtain a depth image of the ith image frame, wherein i is a positive integer; determining a loss function value of the initial recognition model under the ith iterative training according to the depth image of the ith image frame and the (i + n) th image frame, wherein n is a positive integer; the depth recognition model is used for recognizing a depth image of an image to be recognized, the depth image is used for reflecting the three-dimensional geometric shape of a visible surface in the image to be recognized, and the image to be recognized presents facial expressions.

In a fourth aspect, an embodiment of the present application provides a system for generating video information, including: the image acquisition component is used for acquiring a face image and a driving image, wherein the driving image is an image frame in a driving video, and the driving video comprises dynamic expression information for synthesizing the face video; the video information generation component is used for acquiring depth images corresponding to the face image and the driving image respectively through a depth recognition model, the depth images are used for reflecting the three-dimensional geometrical shape of a visible surface in the image, the depth recognition model is obtained based on the training of every two image frames with the interval of n in a video sample, and n is a positive integer; the video information generation component is further configured to perform feature transformation on the face image according to the depth images corresponding to the face image and the driving image respectively to obtain a target face image of the face image in the driving image, where a similarity between a facial expression presented by the target face image and a facial expression presented by the driving image is greater than a threshold.

In a fifth aspect, an embodiment of the present application provides an electronic device, including: at least one processor and memory; the memory stores computer-executable instructions; the at least one processor executes computer-executable instructions stored by the memory, causing the at least one processor to perform the method as provided by the first or second aspects.

In a sixth aspect, embodiments of the present application provide a computer-readable storage medium, where computer-executable instructions are stored, and when a processor executes the computer-executable instructions, the method as provided in the first aspect or the second aspect is implemented.

In a seventh aspect, an embodiment of the present application provides a computer program product, which includes computer instructions, and when executed by a processor, the computer instructions implement the first aspect or the method provided in the second aspect.

In the embodiment of the application, the face transformation relation among the image frames is learned based on different image frames of the same video sample, so that the depth information of the image can be accurately reflected by the recognized depth image, and further, the target face image in the video determined based on the depth image is more vivid, and the presenting effect of a dynamic picture is improved.

Drawings

Fig. 1 is a schematic structural diagram of a video information generating system according to an embodiment of the present application;

fig. 2 is a schematic diagram of an internal architecture of a video information generating system according to an embodiment of the present application;

fig. 3 is a schematic block diagram of a feature transformation unit provided in an embodiment of the present application;

FIG. 4 is a schematic block diagram of an attention unit provided in an embodiment of the present application;

FIG. 5 is a schematic diagram of a model training apparatus according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of model training provided in an embodiment of the present application;

FIG. 7 is a schematic diagram of a first stage model training provided by an embodiment of the present application;

fig. 8 is a schematic flowchart of a method for generating video information according to an embodiment of the present application;

FIG. 9 is a schematic flowchart of a method for training a model according to an embodiment of the present disclosure;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The facial expression simulation scheme provided by the embodiment of the application is a face driving technology or a face replaying technology, can be applied to scenes such as man-machine interaction, artificial intelligence and the like in various fields, for example, live broadcast is carried out by taking virtual characters as anchor broadcast in live broadcast scenes, including scenes facing social entertainment live broadcast, scenes facing live broadcast with goods, scenes facing knowledge propagation and the like, for example, in intelligent vehicles, road conditions are broadcasted through the virtual characters, and for example, in a video conference, the virtual characters participate in conference discussion and the like.

In order to generate a simulated facial expression in a virtual environment, a dynamic picture of the facial expression is often generated based on two-dimensional (2D) image features of the facial expression learned by a network. However, for areas where three-dimensional (3D) features are obvious in facial expressions, such as eyes, mouth, and the like, it is difficult to acquire depth information of the areas, so that the generated dynamic pictures of facial expressions present too smooth image effects in the areas where the 3D features are obvious, and the overall face presentation effect is not realistic enough.

In order to solve the above problem, the scheme for generating facial expressions provided by the embodiment of the application introduces a depth recognition model capable of capturing depth information in a facial image, so that dynamic facial expressions generated based on the depth information are more vivid.

It should be noted that the face in the embodiment of the present application may refer to a face of a real person, a face of a virtual cartoon character, a face of an animal, and the like, which are presented in an image or a video, and the present application is not limited thereto.

Fig. 1 is a schematic structural diagram of a video information generation system according to an embodiment of the present application. As shown in fig. 1, the system 100 includes: an image acquisition component 110 and a video information generation component 120. The image acquisition component 110 and the video information generation component 120 may be connected via a network.

The embodiments of the present application do not limit the deployment manner of the image acquisition component 110 and the video information generation component 120. Optionally, the image acquisition component 110 and the video information generation component 120 are implemented as distributed deployments, and the overall system function is implemented by means of a cloud computing system. For example, the image acquisition component 110 and the video information generation component 120 may both be deployed in the cloud, such as in a data center or a data center cloud system. Alternatively, the video information generation component 120 may be deployed in the cloud, such as a data center or a central cloud system, and implemented as a cloud server, so as to run various network models by virtue of resources on the cloud; compared with cloud deployment, the image acquisition component 110 can be deployed at the end sides of various e-commerce platforms and user terminals, so as to facilitate collection of facial images and drive videos uploaded by users. Regardless of the location at which the image acquisition component 110 is deployed, the image acquisition component 110 can be deployed at a terminal device or server. The video information generation component 120 and the image acquisition component 110 have similar deployment manners, and are not described herein again.

The terminal device may be, for example, a Mobile Phone (Mobile Phone), a tablet computer (Pad), a desktop computer, a terminal device in industrial control (industrial control), or the like. The terminal equipment in the embodiment of the application can also be wearable equipment, the wearable equipment can also be called wearable intelligent equipment, and the wearable equipment is a general term for applying wearable technology to carry out intelligent design and develop wearable equipment on daily wearing, such as glasses, gloves, watches, clothes, shoes and the like. A wearable device is a portable device that is worn directly on the body or integrated into the clothing or accessories of the user.

The server may be, for example, a conventional server, a cloud server, or a cluster of servers, etc.

The image capturing component 110 is configured to capture a facial image and a driving video, for example, the image capturing component 110 may receive the facial image and the driving video input by the user, and for example, the image capturing component 110 may capture the facial image and the driving video from a network. As described above, the face image may be a 2D avatar of a real person, a virtual person, or an animal, and the driving video includes dynamic expression information for synthesizing the face video. For example, the facial image is a photograph of person a, the driving video is a video of the face of person B, and generally, the facial expression of person B is dynamically changed in the video of the face of person B.

Note that the face is presented in the face image, and other presentation contents of the face image are not limited, for example, the body of the person, an object in the surrounding environment, and the like are also included in the face image. Similarly, a face is presented in the drive video without limitation to other presentation content in the drive video, such as limbs of a task, objects in the surrounding environment, and so on, that are also presented in the drive video. In general, a face is present in each frame of an image of a driving video, and it is of course not excluded that a face is not present in some image frames.

It should also be understood that multiple image frames are included in the drive video. Each image frame in the drive video may be treated as a drive image; or select some image frames from the driving video as the driving image, for example, the image frames separated by a preset number of frames in the driving video are used as the driving image. The image acquisition component 110 may determine the driving image from the driving video after acquiring the driving video, or may directly acquire the driving image in the driving video. Wherein the driving image includes a facial expression.

The video information generation component 120 may receive the face image and the driving image sent by the image acquisition component 110, and based on the face image and the driving image, obtain a target face image of the face image under the driving image, in other words, a facial expression in the target face image is the same as or similar to a facial expression in the driving image, and a character feature in the target face image is the same as or similar to a character feature in the face image. For example, if the face image is a photograph of person a, the driving image is a face video of person B, the driving image shows the closed-eye expression of person B, and the target face image should show the closed-eye expression of person a.

It should be understood that after the video information generation component 120 generates a corresponding target face image for each driving image, a plurality of target face images may constitute a target face video that realizes a simulation of the facial expression of a virtual character, which takes the above-mentioned face image as an example, and realizes a dynamic expression of the character a.

Video information generation component 120 may be implemented as a video information generation model that may be trained based on generating a countermeasure network. The process of generating a model of generating video information using training against a network will be described below.

The video information generation component 120 may input both the face image and the driving image into the depth recognition model to obtain depth images corresponding to the face image and the driving image, respectively, and then perform feature transformation on the face image according to the depth images corresponding to the face image and the driving image, respectively, to obtain a target face image of the face image in the driving image. Optionally, when the video information generation component 120 is implemented as a video information generation model, the depth recognition model may be a component in the video information generation model.

In some embodiments, the video information generating component 120 performs feature transformation on the face image according to the depth images corresponding to the face image and the driving image respectively to obtain feature transformation vectors of the face image under the driving image, and further generates the target face image based on the feature transformation vectors, so that the target face image has image features of the face image, and the face image has a face depth and has a better depth expression especially in the mouth, eyes, and other regions, and at the same time, the target face image is fused with facial expression information of the driving image, and the facial expression of the target face image is the same as or similar to the facial expression of the driving image, in other words, the similarity between the facial expression of the target face image and the facial expression of the driving image is greater than a threshold. It should be noted that expression-related image features may be extracted from the target face image and the driving image, respectively, and the similarity between the two may be determined based on the extracted image features.

Continuing with the above embodiment, in order to make the generated image more realistic, the video information generating component 120 may reconstruct the generated feature transformation vector according to the depth image corresponding to the face image based on the self-attention mechanism, further filter the irrelevant information, improve the attention of the neural network to the depth information, and further obtain the target face image based on the reconstructed feature transformation vector.

Illustratively, the video information generation component 120 may include: some or all of the depth recognition unit 121, the keypoint estimation unit 122, the feature transformation unit 123, and the attention unit 124. The embodiment of the present application does not limit the deployment manner of each unit in the video information generation component 120. Optionally, the units in the video information generating component 120 may be implemented as a distributed deployment of the cloud, for example, deployed in different servers or server clusters, or may be implemented as a centralized deployment of the cloud, for example, deployed in the same server or server cluster.

Among them, the depth recognition unit 121 may include a depth recognition model. As shown in fig. 2, the depth recognition unit 121 may convert the face image I into the above-described face image I _s And driving the image I _d Inputting a depth recognition model to obtain a face image I _s Depth image D of _s And driving the image I _d Depth image D of _d . In addition, the face image I _s Depth image D of _s Can be used for reflecting face image I _s Of the visible surface, driving the image I _d Depth image D of _d Can be used to reflect the driving image I _d The three-dimensional geometry of the visible surface of (a). It is to be understood that the depth recognition unit 121 may recognize the face image I in a parallel or serial manner _s And driving the image I _d To obtain respective depth images.

The depth recognition model may be obtained by training based on adjacent image frames in the video sample or every two image frames with an interval of n, where n is a positive integer, and it is understood that two image frames with an interval of 1 are adjacent image frames. The training process of the depth recognition model is explained below.

Still referring to FIG. 2, video information generation component 120 may output a depth image D of the depth recognition model _s With the face image I _s Carrying out image splicing to obtain a first image and carrying out depth image D _d And driving the image I _d And performing image stitching to obtain a second image, and inputting the first image and the second image into the key point estimation unit 122 to realize key point estimation. Optionally, the stitching of the first image and the second image may be based on color channels, such as depth image D _s With the face image I _s The image mosaic can comprise a depth image D _s With the face image I _s All of the color channels of (1).

For example, the keypoint estimation unit 122 may generate keypoints of the first image based on the first image and generate keypoints of the second image based on the second image, where the keypoints of the first image and the keypoints of the second image have a correspondence, that is, the keypoint estimation unit 122 may generate keypoint pairs based on the first image and the second image.

For example, the keypoint estimation unit 122 may determine a set of keypoints for the first image and a set of keypoints for the second image, respectively, based on the following formula (1):

wherein the content of the first and second substances,

representing a keypoint estimation function, I _s ||D _s Representing a first image, I _d ||D _d Representing the second image, K is the total number of keypoints, p is the keypoint identification, and p can take an integer from 1 to K (including 1 and K).

Referring to fig. 2, the feature transformation unit 123 may transform the key point pairs according to the above, for example

And

generating motion field W between face image and drive image _m Further, the feature conversion unit 123 may convert the face image into the driving image based on the motion field W between the face image and the driving image _m And carrying out feature conversion on the image feature vector of the face image to obtain a feature transformation vector of the face image under the driving image.

As shown in fig. 3, the feature transformation unit 123 may perform motion estimation on the first image and the second image based on the following formula (2), and determine a motion field W between the face image and the driving image _m 。

Wherein z is a two-dimensional dense coordinate graph.

Still referring to fig. 3, for example, the feature transformation unit 123 may transform the motion field W _m The facial image is subjected to feature transformation to obtain an initial feature transformation vector, and the initial feature transformation vector is input into the shielding estimator tau to obtain a motion flow mask M _w And feature occlusion map M ₀ . Wherein, the characteristic occlusion map M ₀ Including a characteristic region to be restored by face rotation, e.g., the face image is the right side face of person A, the driving image is the front face of person B, and the face image is on the field W _m The face of the person a is changed from the right side to the front side after the feature transformation is performed under the action of (2), and in this case, the initial result is obtainedThe original feature transformation vector lacks the facial features of the left part of the front of the person a, or the values of the corresponding elements of the facial features of the left part of the front of the person a in the original feature transformation vector are zero; motion stream masks may be used for motion field W _m The missing feature part (e.g., the motion field part corresponding to the element with the feature value of zero) of (a) is masked.

Continuing with the above example, feature transformation unit 123 may perform dot product on the motion field and the motion stream mask to obtain a mask motion field, perform feature transformation on the image feature vector of the face image based on the mask motion field to obtain an initial feature vector, and further perform feature occlusion map M on the initial feature vector ₀ Performing dot product or feature fusion to obtain feature transformation vector F _w 。

The image feature vector of the face image may be an input face image to the feature encoder ∈ ₁ And (4) obtaining the product.

Based on the process shown in FIG. 3, feature transform vector F _w The following formula (3) may be satisfied:

wherein the content of the first and second substances,

representing a feature transformation function. Based on the above implementation process, the feature transformation vector F _w The 2D image characteristics of the face in the face image, and the head movement information between the face image and the driving image can be simultaneously retained.

It will be appreciated that the above-described feature transformation unit 123 implements determining the feature transformation vector F _w The process of (a) is merely an example and is not a restrictive illustration, and any feature transformation method can be adopted to transform the face image into the target face image under the driving image. Also, the video information generation component 120 may not include the above-mentioned key point estimation unit 122, in which case the feature transformation unit 123 may be based on a preset motion field or other devicesThe transmitted motion field performs feature transformation on the face image.

Attention unit 124 may convert vector F based on features obtained from attention mechanism to feature conversion unit 123 _w The reconstruction is performed to make the generated target face image more realistic.

The self-attention mechanism will be explained first. The self-attention mechanism is one of the attention mechanisms, and can realize a feature transformation vector F _w The relevance among all vectors is established by autonomous learning among all the vectors, and the vectors with higher relevance are focused further. At present, matrix calculation can be performed based on Query vectors (Q vectors for short), Key vectors (K vectors for short), and Value vectors (V vectors for short), so as to implement a self-attention mechanism. The Q vector, the K vector, and the V vector may be obtained by multiplying the input vector I by different coefficients, for example, the Q vector, the K vector, and the V vector corresponding to the input vector I satisfy the following equations (4) to (6), respectively:

Q＝W ^q ·I (4)

K＝W ^k ·I (5)

V＝W ^V ·I (6)

wherein, W ^q 、W ^k 、W ^V Respectively, the coefficients of different vectors.

The above-described self-attention mechanism is implemented based only on the vector of the inputs. In the embodiment of the application, the presenting effect of the target face image can be further improved, so that the target face image is presented more vividly. The feature transformation vector is reconstructed by using the depth information of the face image (such as the depth image of the face image), so that the reconstructed feature transformation vector can contain more depth information of the face, and the presentation effect of the target face image is improved.

Referring to FIG. 4, the attention unit 124 may be implemented by a depth encoder ε _d Depth image D corresponding to face image _s Depth feature vector F is obtained by depth feature mapping _d And mapping the obtained depth feature vector F _d Convolution is carried out to obtain a vector F _q Corresponding to the above Q vectorVector F _q The geometric characteristics of the face image are reflected; also, the attention unit 124 transforms the vector F according to the feature of the face image under the driving image _w Separately convolved to obtain a vector F _k (corresponding to the above K vector) and vector F _v (corresponding to the V vector). Note that the present application is not limited to the structure of the convolutional layer in any of the above convolution processes, and 1 × 1 convolutional layer may be sampled in the above convolution process in general.

Continuing with the above example, the attention unit 124 may perform feature fusion on the Q vector and the K vector to obtain a depth feature vector F _d And feature transform vector F _w The similarity vector a between them is also called similarity matrix. Further, the attention unit 124 may perform feature fusion on the similarity vector a and the V vector to obtain a reconstructed feature transformation vector F _g 。

Referring to fig. 2, the video information generation component 120 may further include a decoder that may transform the vector F based on the features _g And constructing the image to obtain a target face image.

In some embodiments, video information generation component 120 may not include attention unit 124, in which case the decoder may transform vector F based on features _w And constructing the image to obtain a target face image.

Fig. 5 is a schematic diagram of a model training apparatus according to an embodiment of the present disclosure. As shown in fig. 5, the training apparatus 200 for the model includes: a sample acquisition unit 210 and a training unit 220. A sample acquisition unit 210 for acquiring a first video sample presenting dynamic expression information of a face; a training unit 220, configured to perform iterative training of a model according to the image frames in the first video sample, until a value of the loss function converges to a preset value, to obtain a depth recognition model; in the ith iterative training process, inputting an ith image frame into an initial recognition model to obtain a depth image of the ith image frame, wherein i is a positive integer; determining a loss function value of the initial recognition model under the ith iterative training according to the depth image of the ith image frame and the (i + n) th image frame, wherein n is a positive integer; the depth recognition model is used for recognizing a depth image of an image to be recognized, and the depth image is used for reflecting the three-dimensional geometric shape of a visible surface in the image to be recognized.

The training apparatus 200 of the model may be independent of the video information generation system 100 described above, or the training apparatus 200 of the model may be disposed in the video information generation system 100. Hereinafter, the model training apparatus 200 will be described as being deployed in the video information generation system 100. When the training apparatus 200 of the model is deployed in the video information generating system 100, the sample acquiring component 210 may be, for example, the image acquiring component 110 in the video information generating system 100, and the training unit 220 may be, for example, the video information generating component 120 in the video information generating system 100. Of course, this should not be construed as limiting the present application in any way, e.g., the sample acquisition component 210 may also be a component of the video information generation system 100 that is separate from the image acquisition component 110, and the training unit 220 may also be a component of the video information generation system 100 that is separate from the video information generation component 120.

Fig. 6 is a schematic diagram of model training according to an embodiment of the present disclosure. With reference to fig. 6, in the embodiment of the present application, model training is performed through the following two stages to obtain the depth recognition model and the video information generation model.

The first stage is as follows: referring to fig. 6, a first video sample is obtained, where the first video sample presents dynamic facial expression information, for example, the first video sample may be a lecture video, a news broadcast video, or the like. The first video sample includes a plurality of image frames, and generally, facial expressions of each image frame are different, but it is not excluded that some adjacent image frames have slight facial expression changes and are regarded as the same facial expression. Model training may be performed by using every two image frames with an interval of n in the first video sample as training samples, where n is a positive integer, and when n is equal to 1, the two image frames are adjacent image frames. For example, n may be set equal to 1 when the facial expression difference in each image frame is significant, n may be set greater than 1 when the facial expression difference in each image frame is small, and the value of n is correlated with the degree of difference between each image frame. For example, every two image frames with an interval of n in the first video sample may be used as a pair of training samples to form a first training set.

In the first stage, the depth recognition model training unit may perform iterative training on the initial recognition model based on the first training set until the value of the loss function converges to a preset value, so as to obtain the depth recognition model. It should be noted that the deep recognition model training unit may be a unit independent from the video information generation system 100, or may be a unit deployed in the video information generation system 100, which is not limited in this application.

For example, in the ith iterative training process, the depth recognition model training unit may train the ith image frame I _i Inputting the image data into an initial recognition model to obtain the ith image frame I _i Of the depth image

i is a positive integer. Further, the depth recognition model training unit may train the depth recognition model according to the ith image frame I _i Depth image of

And the I + n th image frame I _i+n And determining a loss function value of the initial model under the ith iterative training.

As shown in FIG. 7, the depth recognition model training unit may train the (I + n) th image frame I _i+n And the ith image frame I _i The spliced image is input into the attitude network to obtain the ith image frame I _i And the (I + n) th image frame I _i+n The camera internal parameter may be preset when the transformation parameter only includes the camera external parameter, and the camera external parameter may be preset or transmitted by other devices when the transformation parameter only includes the camera internal parameter. Optionally, the camera external parameter may include a rotation matrix (or called camera pose) between the ith image frame and the (i + n) th image frame

The camera external parameters may further include a translation matrix between the ith image frame and the (i + n) th image frame

Continuing with the above example, the depth recognition model training unit may determine the loss function value of the initial recognition model according to the depth image of the ith image frame, the (i + n) th image frame, and the transformation parameter. For example, the depth recognition model training unit may train the depth image I according to the ith image frame _i I + n th image frame I _i+n The pixel matrix and the transformation parameters are geometrically transformed to obtain the ith image frame I _i Of a pixel q on the transformed depth image _k The following formula (7) can be satisfied:

wherein p is _j Representing the ith image frame I _i The pixel of (c).

Furthermore, the depth recognition model training unit may perform image reconstruction according to the converted depth image of the ith image frame to obtain a reconstructed image of the ith image frame

Illustratively, the reconstructed image

The following formula (8) can be satisfied:

wherein, B _I (. cndot.) is a differentiable bilinear interpolation function, and N is the total number of pixels of the transformed depth image.

Further, the deep recognition model training unit may be based on theReconstructed image of ith image frame

Determining the ith image frame I _i And the reconstructed image

And the consistency error is taken as the loss function value. Illustratively, the consistency error Pe satisfies the following formula (9):

wherein SSIM is used to calculate the ith image frame I _i And the reconstructed image

The structural similarity between the two parts is similar,

the representation is based on the ith image frame I _i And the reconstructed image

An L2 loss function is calculated, α being a preset weight value.

And a second stage: referring to fig. 6, the image acquisition component 110 may acquire a second training set, which may include one or more second training samples, and one training sample may include one frame image (which may be referred to as a driving image sample) of the driving video samples and one face image sample, and the driving image sample includes dynamic expression information for synthesizing the face video. The video information generation component 120 may sample and generate the confrontation network, and perform model training on the initial generation model by using the second training set, so as to obtain the video generation model. It should be noted that the initial generative model includes the depth recognition model, and the depth recognition model in the initial generative model may be trained in the first stage, or may be preconfigured, which is not limited in this application.

In the second stage, video information generation component 120 is based on drive image sample I 'in the training sample' _d And face image sample I' _s Generating a target face image I' _g With the above-described video information generation component 120 based on the face image I _s And driving the image I _d Generating a target face image I _g The process of (a) is the same or similar. For example, the video information generation component may determine the driving image sample I 'through the depth recognition unit 121 during one training' _d And face image sample I' _s Respectively corresponding depth image D' _d And D' _s Further, image sample I 'will be driven' _d And depth image D' _d Splicing, face image sample I' _s And depth image D' _s Splicing, inputting the two spliced images into the key point estimation unit 122 to obtain key point pairs

And

the feature transformation unit 123 pairs key points based on

And

to face image sample I' _s Feature transformation is carried out to obtain a feature transformation vector F' _w Further, attention unit 124 may transform the feature vector F' _w Reconstructing to obtain a reconstructed feature transformation vector F' _g And finally, based on the reconstructed feature transform vector F 'by a decoder' _g And constructing the image to obtain a target face image.

It should be noted that, from the perspective of the Model training principle, the initial generation Model is composed of a generation Model (generic Model) and a discriminant Model (discriminant Model), and in the above-mentioned second stage of Model training process, the generation Model mainly learns the distribution of real images, so that the images generated by the generation Model are more real, and the discriminant Model cannot distinguish whether the data is real data. The discriminant model is used to determine whether the received image is an image of the real world, i.e., the probability that the output data is a real image rather than a generated image. The discriminant model may provide the loss of model training as feedback to the generative model, which improves the ability of the generative model to generate a near-reality image. The whole process can be regarded as a game of a generating model and a judging model, and finally two network models reach a dynamic balance through continuous alternate iteration, namely the judging model cannot judge whether data given by the generating model are real images, the judging probability is about 0.5, and the judging probability is similar to random guess.

Referring to fig. 6, the video information generation component 120 may also include a discriminator. The discriminator is used for realizing the discrimination model. In the second stage, the generation model and the decision model of the initial generation model are directly subjected to countermeasure training to obtain a video generation model. In the present embodiment, the confrontation training refers to a training process in which a generative model and a discriminant model are balanced with each other.

In the second stage, the component for performing the video generation model training may be a component in the video information generation system, or may be a component independent from the component in the video information generation system, which is not limited in this application, and the embodiment of the present application only takes the component for performing the video generation model training as the video information generation component 120 in the video information generation system as an example for description.

In the embodiment of the present application, in addition to providing a video information generation system, a video information generation method (see fig. 8) is also provided, which can realize generation of video information in scenes such as human-computer interaction, artificial intelligence and the like in various fields, and a model training method (see fig. 9).

As shown in fig. 8, the method 300 for generating video information includes:

s310, acquiring a face image and a driving image, wherein the driving image is an image frame in a driving video, and the driving video comprises dynamic expression information for synthesizing the face video;

s320, obtaining depth images corresponding to the face image and the driving image respectively through a depth recognition model, wherein the depth images are used for reflecting the three-dimensional geometrical shape of the visible surface in the image, the depth recognition model is obtained based on the training of every two image frames with the interval of n in the video sample, and n is a positive integer;

and S330, performing feature transformation on the face image according to the depth images respectively corresponding to the face image and the driving image to obtain a target face image of the face image under the driving image, wherein the facial expression presented by the target face image has the facial expression in the driving image.

In some embodiments, the performing feature transformation on the face image according to the depth images corresponding to the face image and the driving image respectively to obtain a target face image of the face image under the driving image includes: according to the depth images corresponding to the face image and the driving image respectively, performing feature transformation on the face image to obtain a feature transformation vector of the face image under the driving image; reconstructing a feature transformation vector of the face image according to the depth image corresponding to the face image based on a self-attention mechanism; and obtaining the target face image according to the reconstructed feature transformation vector.

In some embodiments, the reconstructing the feature transformation vector of the face image according to the depth image corresponding to the face image based on the self-attention mechanism includes: obtaining a Query vector by convolution according to the depth feature vector, and obtaining a Key Key vector and a Value vector by convolution respectively according to the feature transformation vector of the face image under the driving image, wherein the depth feature vector is obtained by performing feature mapping on the depth image corresponding to the face image; performing feature fusion on the Query vector and the Key vector to obtain a similarity vector between the depth feature vector and the feature transformation vector; and performing feature fusion on the similarity vector and the Value vector to obtain a reconstructed feature transformation vector.

In some embodiments, the performing feature transformation on the face image according to the depth images corresponding to the face image and the driving image respectively to obtain a feature transformation vector of the face image under the driving image includes: generating a key point pair according to a first image and a second image, wherein the first image is a spliced image of the face image and the depth image of the face image, and the second image is a spliced image of the driving image and the depth image of the driving image; generating a motion field between the face image and the driving image according to the key point pair; and according to the motion field, performing feature conversion on the image feature vector of the face image to obtain a feature conversion vector of the face image under the driving image.

In some embodiments, the performing feature transformation on the image feature vector of the face image according to the motion field to obtain a feature transformation vector of the face image under the driving image includes: performing feature transformation on the face image according to the motion field to obtain an initial transformation feature map; inputting the initial feature transformation image into an occlusion estimator to obtain a motion flow mask and a feature occlusion image, wherein the feature occlusion image comprises a feature region to be repaired due to the rotation of the face; after dot product is carried out on the motion field and the motion flow mask, feature conversion is carried out on the image feature vector of the face image to obtain an initial feature vector; and performing dot product on the initial feature vector and the feature occlusion image to obtain the feature transformation vector.

According to the method for generating the video information, the depth recognition model obtained by training based on different image frames in the same video sample is introduced, the depth image obtained by the depth recognition model can accurately reflect the depth information of the image, and the presentation effect of the target face image determined based on the depth image is more vivid.

As shown in fig. 9, the training method 400 of the model includes:

s410, acquiring a first video sample, wherein the first video sample presents dynamic expression information of a face;

s420, performing iterative training of the model according to the image frames in the first video sample until the value of the loss function converges to a preset value, and obtaining a depth recognition model;

in the ith iterative training process, inputting an ith image frame into an initial recognition model to obtain a depth image of the ith image frame, wherein i is a positive integer;

determining a loss function value of the initial recognition model under the ith iterative training according to the depth image of the ith image frame and the (i + n) th image frame, wherein n is a positive integer;

the depth recognition model is used for recognizing a depth image of an image to be recognized, the depth image is used for reflecting the three-dimensional geometric shape of a visible surface in the image to be recognized, and the image to be recognized presents facial expressions.

In some embodiments, the determining the loss function value of the initial recognition model under the i-th iterative training according to the depth image of the i-th image frame and the i + n-th image frame includes: inputting the spliced image of the (i + n) th image frame and the ith image frame into an attitude network to obtain a transformation parameter between the ith image frame and the (i + n) th image frame; and determining a loss function value of the initial identification model according to the depth image of the ith image frame, the (i + n) th image frame and the transformation parameter.

In some embodiments, the determining the loss function value of the initial recognition model according to the depth image of the i-th image frame, the i + 1-th image frame and the transformation parameter includes: performing geometric transformation according to the depth image of the ith image frame, the pixel matrix of the (i + n) th image frame and the transformation parameter to obtain a transformation depth image of the ith image frame; carrying out image reconstruction according to the converted depth image of the ith image frame to obtain a reconstructed image of the ith image frame; and determining a consistency error between the ith image frame and the reconstructed image, and taking the consistency error as the loss function value.

In some embodiments, the transformation parameters include camera internal parameters and/or camera external parameters.

In some embodiments, the method further comprises: acquiring a training set, wherein the training set comprises a driving video sample and a facial image sample, and the driving video sample comprises dynamic expression information for synthesizing a facial video;

and performing model training on the initial generation model by using a generation countermeasure network and utilizing the training set to obtain a video generation model, wherein the initial generation model comprises the depth recognition model.

According to the model training method provided by the embodiment of the application, the face transformation relation between adjacent or n image frames at intervals can be learned based on different image frames of the same video sample, so that the identified depth image can accurately reflect the depth information of the image.

It should be noted that the execution subjects of the steps of the methods provided in the above embodiments may be the same device, or different devices may be used as the execution subjects of the methods. For example, the execution subjects of steps 310 to 330 may be device a; for another example, the execution subject of step 310 may be device a, and the execution subjects of steps 320 and 330 may be device B; and so on.

Referring to fig. 10, the embodiment of the present application is only illustrated in fig. 10, and does not mean that the present application is limited thereto.

Fig. 10 is a schematic structural diagram of an electronic device 500 according to an embodiment of the present application. As shown in the figure, the electronic device 500 may be a terminal device or a server, and when the electronic device 500 is deployed in a cloud as the server, the electronic device 500 may be implemented as the cloud server, and the electronic device 500 includes a processor 510, and the processor 510 may call and execute a computer program from a memory to implement the method in the embodiment of the present application.

Optionally, as shown in FIG. 10, the electronic device 500 may also include a memory 530. From memory 530, processor 510 may invoke and execute a computer program to implement the methods of the embodiments of the present application.

The memory 530 may be a separate device from the processor 510, or may be integrated in the processor 510.

Optionally, as shown in fig. 10, the electronic device 500 may further include a transceiver 520, and the processor 510 may control the transceiver 520 to communicate with other devices, and specifically, may transmit information or data to the other devices or receive information or data transmitted by the other devices.

The transceiver 520 may include a transmitter and a receiver, among others. The transceiver 520 may further include one or more antennas.

Optionally, the electronic device 500 may implement a corresponding process corresponding to the video information generation system in each method of the embodiment of the present application, or may implement a corresponding process corresponding to the training apparatus of the model in each method of the embodiment of the present application, and for brevity, details are not described here again.

It should be understood that the processor of the embodiments of the present application may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method embodiments may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The Processor may be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off the shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor.

It will be appreciated that the memory in the embodiments of the subject application can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of example, but not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), Double Data Rate Synchronous Dynamic random access memory (DDR SDRAM), Enhanced Synchronous SDRAM (ESDRAM), Synchronous link SDRAM (SLDRAM), and Direct Rambus RAM (DR RAM). It should be noted that the memory of the systems and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

It should be understood that the above memories are exemplary but not limiting, for example, the memories in the embodiments of the present application may also be static random access memory (static RAM, SRAM), dynamic random access memory (dynamic RAM, DRAM), synchronous dynamic random access memory (synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), enhanced synchronous SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Direct Rambus RAM (DR RAM), and the like. That is, the memory in the embodiments of the present application is intended to comprise, without being limited to, these and any other suitable types of memory.

The embodiment of the application also provides a computer readable storage medium for storing the computer program.

Optionally, the computer-readable storage medium may be applied to the electronic device in the embodiment of the present application, and the computer program enables the computer to execute the corresponding process executed by the image generation system in each method in the embodiment of the present application, which is not described herein again for brevity.

Embodiments of the present application also provide a computer program product comprising computer program instructions.

Optionally, the computer program product may be applied to the electronic device in the embodiment of the present application, and the computer program instructions enable the computer to execute the corresponding process executed by the video information generation system in each method in the embodiment of the present application, which is not described herein again for brevity.

The embodiment of the application also provides a computer program.

Optionally, the computer program may be applied to the electronic device in the embodiment of the present application, and when the computer program runs on a computer, the computer executes a corresponding process executed by the video information generation system in each method in the embodiment of the present application, and for brevity, details are not described here again.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The functions may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as separate products. With regard to such understanding, the technical solutions of the present application may be essentially implemented or contributed to by the prior art, or may be implemented in a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of training a model, comprising:

acquiring a first video sample, wherein the first video sample presents dynamic expression information of a face;

performing iterative training of a model according to the image frames in the first video sample until the value of the loss function converges to a preset value, and obtaining a depth recognition model;

2. The method of claim 1, wherein determining the loss function value of the initial recognition model under the i-th iterative training from the depth image of the i-th image frame and the i + n-th image frame comprises:

inputting a spliced image of an i + n image frame and the ith image frame into an attitude network to obtain a transformation parameter between the ith image frame and the i + n image frame;

and determining a loss function value of the initial identification model according to the depth image of the ith image frame, the (i + n) th image frame and the transformation parameter.

3. The method of claim 2, wherein determining the loss function value of the initial recognition model from the depth image of the i-th image frame, the i + 1-th image frame and the transformation parameter comprises:

performing geometric transformation according to the depth image of the ith image frame, the pixel matrix of the (i + n) th image frame and the transformation parameter to obtain a transformation depth image of the ith image frame;

carrying out image reconstruction according to the converted depth image of the ith image frame to obtain a reconstructed image of the ith image frame;

and determining a consistency error between the ith image frame and the reconstructed image, and taking the consistency error as the loss function value.

4. The method according to claim 2 or 3, wherein the transformation parameters comprise camera internal parameters and/or camera external parameters.

5. The method according to any one of claims 1 to 3, further comprising:

acquiring a training set, wherein the training set comprises a driving video sample and a facial image sample, and the driving video sample comprises dynamic expression information for synthesizing a facial video;

and performing model training on an initial generation model by using a generation countermeasure network and utilizing the training set to obtain a video generation model, wherein the initial generation model comprises the depth recognition model.

6. A method for generating video information, comprising:

acquiring a face image and a driving image, wherein the driving image is an image frame in a driving video, and the driving video comprises dynamic expression information for synthesizing the face video;

obtaining depth images corresponding to the face image and the driving image respectively through a depth recognition model, wherein the depth images are used for reflecting the three-dimensional geometric shape of a visible surface in an image, the depth recognition model is obtained based on training of every two image frames with the interval of n in a video sample, and n is a positive integer;

and according to the depth images respectively corresponding to the face image and the driving image, carrying out feature transformation on the face image to obtain a target face image of the face image under the driving image, wherein the similarity between the facial expression presented by the target face image and the facial expression presented by the driving image is greater than a threshold value.

7. The method according to claim 6, wherein the performing feature transformation on the face image according to the depth images corresponding to the face image and the driving image respectively to obtain a target face image of the face image under the driving image comprises:

according to the depth images corresponding to the face image and the driving image respectively, performing feature transformation on the face image to obtain a feature transformation vector of the face image under the driving image;

reconstructing a feature transformation vector of the face image according to a depth image corresponding to the face image based on an attention mechanism;

and obtaining the target face image according to the reconstructed feature transformation vector.

8. The method according to claim 7, wherein the reconstructing the feature transformation vector of the face image according to the depth image corresponding to the face image based on the self-attention mechanism comprises:

convolving to obtain a Query vector according to a depth feature vector, and convolving to obtain a Key Key vector and a Value vector according to a feature transformation vector of the face image under the driving image, wherein the depth feature vector is obtained by performing feature mapping based on a depth image corresponding to the face image;

performing feature fusion on the Query vector and the Key vector to obtain a similarity vector between the depth feature vector and the feature transformation vector;

and performing feature fusion on the similarity vector and the Value vector to obtain a reconstructed feature transformation vector.

9. The method according to claim 7 or 8, wherein the performing feature transformation on the face image according to the depth images corresponding to the face image and the driving image respectively to obtain a feature transformation vector of the face image under the driving image comprises:

generating a key point pair according to a first image and a second image, wherein the first image is a spliced image of the face image and a depth image of the face image, and the second image is a spliced image of the driving image and the depth image of the driving image;

generating a motion field between the face image and the driving image according to the key point pairs;

and according to the motion field, carrying out feature conversion on the image feature vector of the face image to obtain a feature transformation vector of the face image under the driving image.

10. The method according to claim 9, wherein the performing feature transformation on the image feature vector of the face image according to the motion field to obtain a feature transformation vector of the face image under the driving image comprises:

performing feature transformation on the face image according to the motion field to obtain an initial transformation feature map;

inputting the initial feature transformation map into an occlusion estimator to obtain a motion flow mask and a feature occlusion map, wherein the feature occlusion map comprises a feature region to be repaired due to the rotation of a face;

after dot product is carried out on the motion field and the motion flow mask, feature conversion is carried out on the image feature vector of the facial image to obtain an initial feature vector;

and performing dot product on the initial feature vector and the feature occlusion image to obtain the feature transformation vector.

11. An apparatus for training a model, comprising:

a sample acquisition unit configured to acquire a first video sample that presents dynamic expression information of a face;

the training unit is used for carrying out iterative training on a model according to the image frames in the first video sample until the value of the loss function converges to a preset value, and then a depth recognition model is obtained; in the ith iterative training process, inputting an ith image frame into an initial recognition model to obtain a depth image of the ith image frame, wherein i is a positive integer; determining a loss function value of the initial recognition model under the ith iterative training according to the depth image of the ith image frame and the (i + n) th image frame, wherein n is a positive integer;

12. A system for generating video information, comprising:

the image acquisition component is used for acquiring a face image and a driving image, wherein the driving image is an image frame in a driving video, and the driving video comprises dynamic expression information for synthesizing the face video;

the video information generation component is used for acquiring depth images corresponding to the face image and the driving image respectively through a depth recognition model, the depth images are used for reflecting the three-dimensional geometric shape of a visible surface in an image, the depth recognition model is obtained based on training of every two image frames with the interval of n in a video sample, and n is a positive integer;

the video information generation component is further configured to perform feature transformation on the face image according to the depth images corresponding to the face image and the driving image respectively to obtain a target face image of the face image in the driving image, and a similarity between a facial expression presented by the target face image and a facial expression presented by the driving image is greater than a threshold value.

13. An electronic device, comprising: at least one processor and memory;

the memory stores computer-executable instructions;

the at least one processor executing the memory-stored computer-executable instructions cause the at least one processor to perform the method of any of claims 1-11.

14. A computer-readable storage medium having computer-executable instructions stored thereon which, when executed by a processor, implement the method of any one of claims 1 to 11.