CN116824650A

CN116824650A - Video generation method and related device of target object

Info

Publication number: CN116824650A
Application number: CN202210272888.9A
Authority: CN
Inventors: 孙萁浩; 贲晛烨; 杜兆臣; 袁嫡伽; 李柏岩; 马琳杰
Original assignee: Qingdao Hisense Electronic Technology Services Co ltd
Current assignee: Qingdao Hisense Electronic Technology Services Co ltd
Priority date: 2022-03-18
Filing date: 2022-03-18
Publication date: 2023-09-29

Abstract

The application discloses a video generation method of a target object and a related device. The method comprises the steps of cutting a video to be processed into video fragments corresponding to emotion characteristics based on the emotion characteristics contained in the audio content of the video to be processed, and inputting the video fragments and the emotion characteristics of the video fragments into a generated countermeasure network. After the video clip is split into a plurality of frames of first images through the generation countermeasure network, emotional noise corresponding to the emotional characteristics is added to each first image, so that the facial expression of the target object in the second image after noise processing has the emotional characteristics. And finally integrating each second image into a video based on the video time sequence position of each first image in the video to be processed. According to the flow, the facial expression of the target object in the video to be processed is adjusted according to the emotional characteristics of the video audio content to be processed, so that the emotional characteristics of the target object in the video, which are possessed by the facial expression, are consistent with the audio content, and the sense of realism of the video is improved.

Description

Video generation method and related device of target object

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a video generation method of a target object and a related device.

Background

The speaker face video technology is widely applied to the fields of virtual interaction, cartoon character construction, video dubbing and the like. The technique is used to generate dynamic video that dynamically recites the audio content through a given audio and a single face image. Generating natural and fluent talking face video from single Zhang Ren face images and audio content is challenging, i.e., multiple frames of face generation that preserve identity features need to be implemented, and face changes need to be kept as consistent as possible with the audio content.

In the related art, a deep learning mode is mostly adopted to extract features of a face image, and then after multi-frame position prediction is carried out on extracted feature points, audio content is embedded into a lip feature point area, so that a continuous video picture with expression change of the face is obtained. Although the motion trail of the lips of the face in the video can be well controlled to correspond to the audio content, the semantic emotion in the audio content is not considered, and the generated video has the problems of stiff facial expression, inconformity of facial expression and semantic emotion and the like, so that the overall effect of the video is not natural enough, and the user experience is influenced.

Disclosure of Invention

The embodiment of the application provides a video generation method and a related device of a target object, which are used for solving the problem that facial expression is stiff and semantic emotion is inconsistent in the existing speaker face video.

In a first aspect, an embodiment of the present invention provides a method for generating a video of a target object, where the method includes:

carrying out semantic recognition on audio content of a video to be processed, and determining emotion characteristics contained in the audio content; wherein each frame of video image of the video to be processed contains the same target object;

if the audio content contains multiple emotion characteristics, performing preset cutting processing on the video to be processed based on the emotion characteristics to obtain multiple video clips; wherein the audio content of each video clip corresponds to an emotional characteristic;

inputting the video clips and the emotion characteristics of the video clips into a generation countermeasure network for each video clip, so that after the video clips are split into multiple frames of first images according to video duration through the generation countermeasure network, emotion noise corresponding to the emotion characteristics is added to each first image, and a second image corresponding to the first image is obtained; the emotional noise is used for adjusting the facial expression of the target object so that the facial expression has the emotional characteristics;

and sequencing the second images corresponding to the first images according to the video time sequence positions of the first images in the video to be processed, and integrating the sequenced second images into a video.

The embodiment of the application cuts the video to be processed into video clips corresponding to the emotion characteristics based on the emotion characteristics contained in the audio content of the video to be processed. The method comprises the steps of inputting a video segment and emotion characteristics of the video segment into a generating countermeasure network, dividing the video segment into a plurality of frames of first images according to video duration through the generating countermeasure network, and adding emotion noise corresponding to the emotion characteristics to each first image so that facial expressions of target objects in a second image after noise processing have the emotion characteristics. And finally, sequencing the second images after noise processing according to the video time sequence position of the first images in the video to be processed, and integrating the sequenced second images into the video. According to the flow, the facial expression of the target object in the video to be processed is adjusted according to the emotional characteristics of the video audio content to be processed, so that the emotional characteristics of the target object in the video, which are possessed by the facial expression, are consistent with the audio content, and the sense of realism of the video is improved.

In some possible embodiments, the generation of the antagonism network is trained by:

acquiring a sample video and a reference video; wherein each video image of the sample video and the reference video includes the target object; the video duration of the sample video is the same as that of the target video, and the emotional characteristics of the audio content of the sample video and the emotional characteristics of the audio content of the reference video are the same;

Splitting the sample video and the reference video respectively to obtain a multi-frame sample image of the sample video and a multi-frame reference image of the reference video; wherein each sample image corresponds to a reference image on video timing;

training an originally generated countermeasure network in an iterative mode based on the sample image and the reference image until a first convergence condition is met; determining a video to be tested according to the model parameters obtained in the last iteration;

training the original generation countermeasure network in an iterative mode based on the video to be detected and the reference video until a second convergence condition is met; and determining the generated countermeasure network according to the model parameters obtained in the last iteration.

In the embodiment of the application, two modes of image comparison and video comparison are adopted to adjust the parameters of the network model in the training stage, and the model parameters are adjusted through the comparison result of the sample image and the reference image until the facial expression of the target object in the image to be detected and the reference image has the same emotional characteristics. Further, after the image to be detected is integrated into the video to be detected, the model parameters are adjusted again according to the comparison result of the video to be detected and the reference video until the facial expression of the target object in the video to be detected and the reference video has the same emotion characteristics. Therefore, the emotional characteristics of the generated countermeasure network in the two dimensions of the static image and the dynamic video are matched with the reference video, and the sense of reality of the generated video is improved.

In some possible embodiments, the training the originally generated countermeasure network in an iterative manner based on the sample image and the reference image until a first convergence condition is satisfied includes:

for each image to be measured, acquiring a first loss function between the image to be measured and a target image corresponding to the image to be measured; the target image is a reference image with the same video time sequence as the sample image corresponding to the image to be detected;

and if the first loss function is larger than a first threshold value, adjusting the model parameters according to the first loss function, regenerating random noise according to the adjusted model parameters, and determining an image to be detected based on the regenerated random noise until the first loss function corresponding to the image to be detected is not larger than the first threshold value.

According to the embodiment of the application, random noise is generated based on the model parameters of the current network model, and the facial expression of the target object in the sample image is changed by adding the random noise to the sample image, so that the image to be detected is obtained. And determining whether the emotion characteristics of the target object in the image to be detected are identical with the reference image or not by comparing the first loss function between the image to be detected and the reference image corresponding to the sample image. Training the converged network model in the manner described above can generate an image that matches the emotional characteristics of the reference image.

In some possible embodiments, the training the originally generated countermeasure network in an iterative manner based on the video to be tested and the reference video until a second convergence condition is satisfied includes:

determining a second loss function between the video to be detected and the reference video;

and if the second loss function is larger than a second threshold, adjusting the model parameters according to the second loss function, regenerating random noise according to the adjusted model parameters, and determining the video to be detected based on the regenerated random noise until the second loss function of the video to be detected is not larger than the second threshold.

After the convergence of the model parameters is determined from the angle of the static image, the embodiment of the application constructs the video to be tested according to the image to be tested generated by the model parameters. And then, compared with a second loss function between the video to be detected and the reference video, the model parameters of the network model are adjusted through the second loss function, so that the emotional characteristics of the video generated by the network model in both static and dynamic dimensions are matched with the reference video, and the sense of reality of the generated video is improved.

In some possible embodiments, the determining the video to be tested according to the model parameters obtained in the last iteration includes:

Dividing random noise generated by the model parameters based on the number of images of the sample images to obtain noise segments with the same number as the images; wherein each sample image corresponds to a unique noise segment;

for each sample image, adding a corresponding noise segment to the sample image to obtain an image to be synthesized;

determining video time sequence positions of sample images corresponding to the images to be synthesized in the sample video, sequencing the images to be synthesized according to the video time sequence positions, and integrating the sequenced images to be synthesized into the video to be synthesized.

The embodiment of the application splits the random noise generated by the model parameters of the trained network model into noise segments with the same number as the images of the sample images. Because the sample images are obtained by splitting sample videos, the sorting order of the split sample images on the video time sequence can correspond to the sorting order of the noise segments, so that each noise segment corresponds to one sample image, the images to be synthesized can be obtained by adding the corresponding noise segments in each sample image, and then the images to be synthesized are integrated according to the video time sequence sorting of the sample images, so that the video to be detected can be obtained.

In some possible embodiments, the emotional noise is determined according to the following manner:

and after determining the generated countermeasure network according to the model parameters obtained in the last iteration, taking random noise generated by the model parameters as emotion noise corresponding to the emotion characteristics.

According to the embodiment of the application, after the convergence of the model parameters is determined, the random noise generated by the model parameters is used as the emotion noise corresponding to the emotion characteristics, so that when the video to be processed input into the network model has the same emotion, the emotion noise can be directly added to the video to be processed, and the sense of reality of the generated video is improved.

In some possible embodiments, the video to be processed is determined according to the following manner:

receiving audio to be processed and at least one image to be processed containing the target object;

inputting the audio to be processed and the image to be processed into a video generation network to generate a multi-frame virtual image based on the facial feature points after the facial feature points of the target object are identified through the video generation network; each virtual image comprises the target object, and the position coordinates of the target object, which at least have the same facial feature point in each virtual image, are different;

And integrating the audio to be processed and the multi-frame virtual image into the video to be processed.

According to the embodiment of the application, based on a pre-trained video generation network, after the facial feature points of the target object in the image to be processed are identified, the multi-frame virtual image is generated based on the facial feature points, and then the multi-frame virtual image and the audio to be processed are integrated into the video to be processed, and the video to be processed, each frame of which contains the target object, can be directly generated according to the image to be processed and the audio to be processed of the target object through the flow.

In some possible embodiments, the generating a multi-frame virtual image based on the facial feature points includes:

determining a first motion track of a first feature point in the facial feature points according to semantic content of the audio to be processed, and determining the target number of virtual images to be generated according to the first motion track; wherein the first feature point is a feature point located in a lip region in the facial feature point;

determining a second motion trail of a second feature point according to the target quantity, wherein the second feature point is a feature point which is positioned outside the lip region in the facial feature point;

The virtual image is generated based on the first motion profile and the second motion profile.

According to the embodiment of the application, the first motion track of the feature points of the lip region in the facial feature points is determined according to the semantic content of the audio to be processed, and the target number required to generate the virtual image is calculated according to the first motion track. Further, a second motion track of the remaining facial feature points outside the lip region is predicted according to the target quantity, so that the position coordinates of the facial feature points of the target object in each virtual image can be determined according to the first motion track and the second motion track, and further, the corresponding virtual image is generated.

In some possible embodiments, the generating the virtual image based on the first motion profile and the second motion profile includes:

determining a first position coordinate of the first feature point in each virtual image according to the first motion trail and the target quantity, and determining a second position coordinate of the second feature point in each virtual image according to the second motion trail and the target quantity;

copying the image to be processed into a plurality of frames of copied images with the target quantity;

For each duplicate image, moving a first feature point in the duplicate image to the first position coordinate of the duplicate image and moving the second feature point to the second position coordinate of the duplicate image to obtain the virtual image.

According to the embodiment of the application, the first position coordinates of the first characteristic points in each virtual image are determined according to the first motion trail and the target number of the virtual images, and the second position coordinates of the second characteristic points in each virtual image are determined according to the second motion trail and the target number. After the image to be processed is copied into the multi-frame copied images with the target quantity, the facial feature point position of the target object in each copied image is the same as that of the image to be processed, and thus, the corresponding virtual image can be obtained by adjusting the position coordinates of the first feature point and the second feature point in each copied image.

In a second aspect, an embodiment of the present application provides a video generating apparatus for a target object, the apparatus including:

the feature extraction module is configured to perform semantic recognition on the audio content of the video to be processed and determine emotion features contained in the audio content; wherein each frame of video image of the video to be processed contains the same target object;

The video clipping module is configured to execute preset clipping processing on the video to be processed based on the emotion characteristics if the audio content contains multiple emotion characteristics, so as to obtain multiple video clips; wherein the audio content of each video clip corresponds to an emotional characteristic;

the emotion adding module is configured to execute the steps of inputting the video clips and emotion characteristics of the video clips into a generation countermeasure network for each video clip, so that after the video clips are split into multiple frames of first images according to video time length through the generation countermeasure network, emotion noise corresponding to the emotion characteristics is added to each first image, and a second image corresponding to the first image is obtained; the emotional noise is used for adjusting the facial expression of the target object so that the facial expression has the emotional characteristics;

the video generation module is configured to execute sorting of the second images corresponding to the first images according to the video time sequence positions of the first images in the video to be processed, and integrate the sorted second images into a video.

In some possible embodiments, the apparatus further comprises a network training module for deriving the generation of the countermeasure network by:

In some possible embodiments, the training of the originally generated countermeasure network in an iterative manner based on the sample image and the reference image is performed until a first convergence condition is met, the network training module being configured to:

In some possible embodiments, the training of the originally generated countermeasure network in an iterative manner based on the video under test and the reference video is performed until a second convergence condition is met, the network training module being configured to:

In some possible embodiments, the determining the video to be tested according to the model parameters obtained in the last iteration is performed, and the network training module is configured to:

In some possible embodiments, the apparatus further comprises a video acquisition module for determining the video to be processed by:

In some possible embodiments, performing the generating a multi-frame virtual image based on the facial feature points, the video acquisition module is configured to:

In some possible embodiments, performing the generating the virtual image based on the first motion profile and the second motion profile, the video acquisition module is configured to:

In a third aspect, an embodiment of the present application provides an electronic device, including:

a memory for storing program instructions;

a processor for invoking program instructions stored in the memory and executing the steps comprised by the method according to any of the first aspects in accordance with the obtained program instructions.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium storing a computer program comprising program instructions which, when executed by a computer, cause the computer to perform the method of any one of the first aspects.

In a fifth aspect, embodiments of the present application provide a computer program product comprising: computer program code which, when run on a computer, causes the computer to perform the method of any of the first aspects.

Drawings

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application;

FIG. 2 is a flowchart of a method for generating video of a target object according to an embodiment of the present application;

FIG. 3 is a schematic diagram of motion trail prediction according to an embodiment of the present application;

FIG. 4 is a schematic diagram of virtual image generation according to an embodiment of the present application;

fig. 5 is a schematic diagram of video splitting according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a generating countermeasure network according to an embodiment of the present application;

FIG. 7 is a schematic diagram of comparing a generated video with a real video according to an embodiment of the present application;

Fig. 8 is a block diagram of a video generating apparatus 800 of a target object according to an embodiment of the present application;

fig. 9 is a schematic diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application. Embodiments of the application and features of the embodiments may be combined with one another arbitrarily without conflict. Also, while a logical order of illustration is depicted in the flowchart, in some cases the steps shown or described may be performed in a different order than presented.

The terms first and second in the description and claims of the application and in the above-mentioned figures are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the term "include" and any variations thereof is intended to cover non-exclusive protection. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus. The term "plurality" in the present application may mean at least two, for example, two, three or more, and embodiments of the present application are not limited.

In the technical scheme of the application, the data is collected, transmitted, used and the like, and all meet the requirements of national relevant laws and regulations.

As mentioned above, the related art mostly adopts a deep learning method to generate the face video of the speaker. Although the motion trail of the lips of the face in the video can be well controlled to correspond to the audio content, the semantic emotion in the audio content is not considered, and the generated video has the problems of stiff facial expression, inconformity of facial expression and semantic emotion and the like, so that the overall effect of the video is not natural enough, and the user experience is influenced.

To solve the above problems, the inventive concept of the present application is: based on the emotion characteristics contained in the audio content of the video to be processed, the video to be processed is cut into video clips corresponding to the emotion characteristics. The method comprises the steps of inputting a video segment and emotion characteristics of the video segment into a generating countermeasure network, dividing the video segment into a plurality of frames of first images according to video duration through the generating countermeasure network, and adding emotion noise corresponding to the emotion characteristics to each first image so that facial expressions of target objects in a second image after noise processing have the emotion characteristics. And finally, sequencing the second images after noise processing according to the video time sequence position of the first images in the video to be processed, and integrating the sequenced second images into the video. According to the flow, the facial expression of the target object in the video to be processed is adjusted according to the emotional characteristics of the video audio content to be processed, so that the emotional characteristics of the target object in the video, which are possessed by the facial expression, are consistent with the audio content, and the sense of realism of the video is improved.

Referring to fig. 1, an application scenario is schematically shown according to an embodiment of the present application.

As shown in fig. 1, the application scenario may include, for example, a network 10, a smart device 20, and a server 30. Wherein: the smart device 20 includes various electronic devices with man-machine interaction capability, such as the smart computer 20_1, the smart phone 20_2, and the smart television 20_n shown in fig. 1.

In the application scenario illustrated in fig. 1, a user may send a request for generating a face video of a speaker through the smart device 20. In operation, a user uploads a user's own photograph and a piece of audio to the smart device 20. The smart device 20 transmits the received self-photograph and audio to the server 30 through the network 10.

The server 30 stores a neural network model for generating a face video of a speaker, and the server 30 inputs the self-timer photograph and the audio into a video generation model to perform semantic recognition on the audio through the video generation model so as to predict the lip movement track of a user when describing the audio. And further, after the facial features of the user in the self-timer are identified, the rest facial features except the lip area are adjusted frame by frame to generate the face video of the speaker. The server 30 sends the face video of the speaker to the smart device 20 over the network 10 for viewing by the user.

In some possible embodiments, the neural network model determines a duration of the video to be generated according to a length of the lip motion trajectory, and determines a number of image frames to be included in the face video to be generated according to the duration of the video. Specifically, based on the self-photographing, a corresponding number of virtual images of the image frame number are generated. Further, firstly determining the lip characteristics of the user in each virtual image according to the lip motion track, and then adjusting the other facial characteristics except the lip area of the user in the virtual image by adding random noise to each virtual image, thereby obtaining a dynamic face video. And finally, embedding the audio into the face video to obtain the speaker face video.

It should be noted that only a single server is described in detail in the description of the present application, but it should be understood by those skilled in the art that the server 30 shown in fig. 1 is intended to represent the operation of the server according to the technical solution of the present application. The details of a single server are provided for convenience of explanation at least, and are not meant to imply limitations on the number, type, location, etc. of servers. It should be noted that the underlying concepts of the exemplary embodiments of this application are not altered if additional modules are added to or individual modules are removed from the illustrated environment.

Fig. 2 schematically illustrates a flowchart of a method for generating video of a target object according to an embodiment of the present application. As shown in fig. 2, the method comprises the following steps:

step 201: carrying out semantic recognition on audio content of a video to be processed, and determining emotion characteristics contained in the audio content; wherein each frame of video image of the video to be processed contains the same target object;

the speaker face video technique is used to generate a dynamic video that dynamically recites the audio content through the face, given audio and individual face images. Based on the above, the embodiment of the application aims to generate the video to be processed, of which the lip movement track is matched with the audio content, in a deep learning mode. And then, generating an antagonism network to carry out fine processing on the facial expression of the target object in the video to be processed, so that the facial expression of the target object is matched with the emotion characteristics of the audio, and the sense of reality of the video is improved.

For the video generation network, a plurality of sample images marked with facial feature points can be used in a training stage to train the video generation network so as to enable the video generation network to have the capability of identifying the facial feature points of the image to be processed. Furthermore, the video generation network has the capability of obtaining the lip movement track corresponding to the audio information based on the input audio information through semantic training, the converged video generation network can accurately identify the facial feature points of the target object in the image to be input, and the mouth shape change of the target object when describing the audio content is obtained through semantic identification of the audio content.

Thus, the audio to be processed and the image to be processed are input into the pre-trained video generation network, facial feature points of the target object can be identified through the video generation network, and a multi-frame virtual image can be generated based on the facial feature points. And finally integrating the audio to be processed and the multi-frame virtual image into a video to be processed.

In implementation, the video generation network determines a first motion trail of a first feature point in the facial feature points according to semantic content of audio to be processed. The first feature points are feature points located in the lip region in the facial feature points. Thus, the mouth shape change, namely the first motion trail, of the target object describing the semantic content can be obtained. Further, the number of targets of the virtual image to be generated can be determined according to the first motion trail, and after the second motion trail of the second feature point is determined according to the number of targets, the virtual image is generated based on the first motion trail and the second motion trail. It should be understood that the second feature point is the remaining feature point of the facial feature point that is located outside the lip region.

It should be noted that, to enhance the sense of reality of the video, each virtual image to be generated includes the target object, and the target object has at least different position coordinates of the same facial feature point in each virtual image. That is, the target object has a difference in facial expression within the screen of each virtual image.

Specifically, taking the semantic content of the audio to be processed as "two and three" as an example, the first motion track of the first feature point corresponding to the semantic content is shown in fig. 3, and according to the conversion rule of the mouth shape and the frame number, it can be determined that 3 seconds are required for describing the semantic content, and one second corresponds to 8 frames of virtual images, that is, taking the semantic content as "two and three" as an example, 24 frames of virtual images for synthesizing the video to be processed need to be generated. After knowing that the target number is 24 frames, 24 sections of noise segments meeting 24 frames can be split from random noise generated according to model parameters of the video generation network, and then the 24 sections of noise segments correspond to 24 frames of virtual images to be generated according to the splitting sequence. It will be appreciated that the purpose of adding the noise segment is to change the position of the second feature point of the target object so as to produce a change in the remaining facial expression outside the lip region of the target object in the 24-frame virtual image. The change of the facial expression is the second motion trail of the second characteristic point in the 24 frames of virtual images.

In some possible embodiments, the second motion trajectory of the second feature point may be determined by a preset facial feature variation curve. The facial feature change curve is a change track of fixed facial feature points, and is used for predicting the corresponding positions of the facial feature points under the corresponding frame number based on the position coordinates of the given facial feature points. For example, knowing that the virtual image to be generated is 24 frames, the position coordinates of the second feature point in each frame of virtual image in the 24 frames of virtual images can be predicted based on the facial feature change curve.

Then, a first position coordinate of the first feature point in each virtual image can be determined according to the first motion track and the number of targets, and a second position coordinate of the second feature point in each virtual image can be determined according to the second motion track and the number of targets. After the image to be processed is copied into a target number of multi-frame copied images, for each copied image, a first characteristic point in the copied image is moved to a first position coordinate of the copied image, and a second characteristic point is moved to a second position coordinate of the copied image, so that a virtual image is obtained.

Specifically, taking the virtual images with the target number of 24 frames as an example, the first motion track may be divided according to 24 frames, so as to obtain the position coordinates of the first feature point in each frame of virtual image. And then the position coordinates of the second feature point are obtained in the same manner. Next, as shown in fig. 4, the feature points located in the rectangular region in fig. 4 are first feature points, and the remaining feature points are second feature points. In implementation, 24 copies of the image to be processed are obtained, and 24 copies of the image are obtained. As shown in the left side of fig. 4, the position coordinates of the facial feature point of the target object in each copied image are the same as those of the image to be processed, so that after the position coordinates of the first feature point and the second feature point in each virtual image are obtained through the above procedure, the virtual image can be obtained by moving the first feature point in the copied image to the first position coordinates and moving the second feature point to the second position coordinates. And finally integrating the obtained 24 frames of virtual images into a video, and embedding the audio to be processed into the video to obtain the video to be processed of the target object.

Step 202: if the audio content contains multiple emotion characteristics, performing preset cutting processing on the video to be processed based on the emotion characteristics to obtain multiple video clips; wherein the audio content of each video clip corresponds to an emotional characteristic;

in the related art, the semantic emotion in the audio content is not considered, and the generated video has the problems of stiff facial expression, inconformity of facial expression and semantic emotion, and the like, so that the overall effect of the video is not natural enough, and the user experience is affected. In order to solve the problem, the embodiment of the application aims at adding emotion noise corresponding to the emotion characteristics to the video to be processed aiming at different emotion characteristics, so that the facial expression of a target object in the video has the emotion characteristics, and the authenticity of the video is improved.

Based on this, the embodiment of the application needs to adjust the facial expression of the target object according to the emotion characteristics contained in the audio content, so that the facial expression of the target object coincides with the emotion characteristics when the target object describes the audio content corresponding to the emotion characteristics. Therefore, by performing semantic recognition on the audio content of the video to be processed, the emotion that the target object should carry when describing the audio content, such as happiness, sadness, anger, neutral (i.e., substantially no emotion), etc., can be determined specifically based on the recognized speaking content and the speech intonation. Then clipping the video to be processed according to the number of the emotion features contained in the audio content, so that each video segment after clipping contains one emotion feature.

Taking a 10 second to-be-processed video as an example, as shown in fig. 5, assuming that the emotional characteristics of the to-be-processed video within 0 to 4 seconds are neutral and the emotional characteristics within 5 to 10 seconds are happy, the to-be-processed video can be divided into a video segment 1 of 0 to 4 seconds and a video segment 2 of 5 to 10 seconds according to the emotional characteristics corresponding to different video periods. Thus, the resulting overall audio content within each video clip corresponds to one emotional characteristic.

Step 203: inputting the video clips and the emotion characteristics of the video clips into a generation countermeasure network for each video clip, so that after the video clips are split into multiple frames of first images according to video duration through the generation countermeasure network, emotion noise corresponding to the emotion characteristics is added to each first image, and a second image corresponding to the first image is obtained; the emotional noise is used for adjusting the facial expression of the target object so that the facial expression has the emotional characteristics;

the generation countermeasure network in the embodiment of the application needs to train based on a large number of sample videos and reference videos in the training stage. The sample video may be a video to be processed which is generated by the video generation network and does not consider emotional characteristics, and the reference video is a real life video of the target object.

When the method is implemented, firstly, the video frame image which does not contain the target object is removed through preprocessing operation, so that each video image of the sample video and the reference video contains the target object, and the convergence efficiency of the model parameters can be effectively improved. Second, the video durations of the sample video and the target video need to be the same, and the emotional characteristics of the audio content of the sample video and the reference video need to be the same. This allows the facial expression of the target object in the sample image to be provided with the emotional feature by adding noise to the sample video.

The generation countermeasure network of the embodiment of the application is shown in particular in FIG. 6, and comprises an image generator G _I Image authenticationDevice D _I And video discriminator D _V 。

In the training stage, firstly, after a video to be processed and a reference video are input into an original generation countermeasure network, the network model respectively splits the sample video and the reference video to obtain a multi-frame sample image of the sample video and a multi-frame reference image of the reference video. That is, the video is split into corresponding video frame images, and each sample image corresponds to one reference image in video timing since the sample video has the same video duration as the reference video. Training the originally generated countermeasure network in an iterative mode based on the sample image and the reference image until a first convergence condition is met; and determining the video to be tested according to the model parameters obtained in the last iteration.

As shown in fig. 6, the image generator G _I Is a cyclic neural network R _M ，R _M For learning a series of motion codes from noisy inputs to use them to cause the image generator G _I A continuous sequence of images may be generated. Specifically, after the sample images 1 to n and the reference images 1 to n are obtained through the above procedure, a set of random noise is generated according to the model parameters of the originally generated countermeasure network, and the random noise is split according to the number n of the sample images into n-division noise segments C ₁ ～C _n 。

Then adding corresponding noise segments for each sample image according to the splitting sequence, wherein the purpose of adding the noise segments is to change the facial expression of the target object in the sample image. Specifically, the modeling of the content subspace and the modeling of the motion subspace can be performed through Gaussian distribution to construct the image to be detected corresponding to the sample image. Wherein Z in FIG. 6 _c I.e., modeling the content subspace, the portion characterizing the processing of the lip region in the sample image; z is Z _M I.e. modeling the motion subspace, the part characterizes the processing of other facial regions than the lip region in the sample image.

Taking sample image 1 as an example, noise segment C is added into sample image 1 ₁ And obtaining the image 1 to be measured. Then obtain theA first loss function between the image 1 to be measured and the reference image 1, which can be specifically expressed as the following formula (1):

T _I (D _I ，G _I )＝E _x～px [-logD _I (x)]+E _z～pz [-log(1-D _I (G _I (z)))] (1)

wherein T is _I A first loss function which is an I-th image to be detected; x is a reference image extracted from the image distribution px, namely a reference image corresponding to the sample image; z is a random vector of the latent space of the image to be measured; d (D) _I Is an image generator; g _I Is an image discriminator.

After the first loss function is determined according to the above formula (1), if the first loss function is greater than the first threshold, it indicates that the facial expression of the target object in the image 1 to be measured has a larger phase difference with the reference image 1, that is, it indicates that the facial expression of the target object in the image 1 to be measured does not have the emotional characteristic. At this time, model parameters are adjusted according to the first loss function, and random noise is regenerated according to the adjusted model parameters. Then the new noise segment C corresponding to the sample image 1 is obtained again ₁ And further obtaining a to-be-detected image based on the regenerated random noise until a first loss function corresponding to the to-be-detected image is not greater than a first threshold value, and at the moment, indicating that the facial expression of the target object in the to-be-detected image has the emotion characteristics.

By means of the method, each image to be detected and the corresponding reference image are compared, and after the model parameters are repeatedly corrected according to the first loss function obtained through comparison, the random noise generated by the model parameters can enable the target object in the sample image to have the same emotion characteristics as those in the reference image.

Due to the image discriminator D _I The object of authentication is a separate image, based on an image discriminator D _I The image generated after the model parameters are corrected can have better static characteristics, and the video discriminator D is also needed to be based on in order to further improve the authenticity of the generated video _V For passing through the image discriminator D _I Comparing the video to be tested composed of the images to be tested with the referenceAnd correcting the model parameters according to the comparison result to improve the dynamic characteristics of the video, so that the generated video is more real.

As shown in fig. 6, through the above-mentioned image discriminator D _I After the network model is converged, random noise generated by the model parameters can be segmented based on the number of images of the sample images, so that noise segments with the same number as the images can be obtained. Then, for each sample image, adding the corresponding noise segment to the sample image to obtain images X1-n to be synthesized, wherein the images X1-n to be synthesized represent an image generator G _I Trained samples generated from content and motion vectors. And then determining the video time sequence positions of the sample images corresponding to the images to be synthesized in the sample video, and sequencing the images to be synthesized according to the video time sequence positions. And finally integrating the sequenced images to be synthesized into a video V to be detected.

Further, training the originally generated countermeasure network in an iterative mode based on the video to be detected and the reference video until a second convergence condition is met; and determining and generating an countermeasure network according to the model parameters obtained in the last iteration. In practice, a second loss function between the video under test and the reference video may be determined. And if the second loss function is larger than the second threshold, adjusting the model parameters according to the second loss function, regenerating random noise according to the adjusted model parameters, and determining the video to be detected based on the regenerated random noise until the second loss function of the video to be detected is not larger than the second threshold.

The above-mentioned image-based discriminator D _I And video discriminator D _V The process of adjusting the model parameters can be represented by the following formula concept, and if the video length of the video to be processed is represented by K, the recurrent neural network R is operated _M Taking K steps as a total, each step takes a random variable epsilon as input, and the final generated video can be expressed as the following formula (2):

wherein, the liquid crystal display device comprises a liquid crystal display device,characterizing the final generated video; g _I Is an image discriminator; z is Z _c Modeling the content subspace; 1-K is the executing step K; the entire generated antagonism network is modeled with very little and very big problem as the following equation (3): />

Wherein F is _V (D _I ，D _V ，G _I ，R _M ) I.e. to generate a loss function of the antagonism network as a whole, the F _V (D _I ，D _V ，G _I ，R _M ) The corresponding loss function of (2) can be expressed specifically as the following equation (4):

wherein the first item E _V [-logD _I (S _I (V))]Characterizing a loss function of the reference image; second itemRepresenting a loss function of an image to be detected; third item E _V [-logD _V (S _T (V))]Characterizing a loss function of a reference video; fourth item->Characterizing a loss function of the finally generated video; r is R _M Is a recurrent neural network; g _I Is an image generator; v represents a reference target, V in the first item is a reference image, and V in the third item is a reference video; />Representing the sample object, in the first item is +.>The image to be measured is +.>Finally generating a video; s is S _I And S is _T As two random functions, S _I Characterized in that after inputting the video to be tested, outputting a frame in the video, S _T Characterized as randomly returning consecutive T frames in the video.

D in the first and third terms in equation (4) _I And D _V Learning the input single frame and continuous frame respectively, and D of two and four _I And D _V The generated single frame and continuous frames are discriminated, and 1 is output if the discrimination is true, and 0 is output if the discrimination is false. The whole network is continuously trained by using an alternate gradient updating algorithm, and the whole network is fixed in G _I And R is _M Time update D _I And D _V . Fixing D at alternation _I And D _V For G _I And R is _M Updating, thereby obtaining an image generator G with converged model parameters _I The training of the original generated countermeasure network is completed.

Thus, through the step 203, after the generating countermeasure network splits the video clip into multiple frames of first images according to the video duration, and adds emotional noise corresponding to the emotional characteristics to each first image, the following step 204 is performed.

Step 204: and sequencing the second images corresponding to the first images according to the video time sequence positions of the first images in the video to be processed, and integrating the sequenced second images into a video.

As mentioned above, the first images are split according to the video duration, so that each split first image has a sorting order in video timing. And then the second images corresponding to each first image are ordered according to the video time sequence position of each first image in the video to be processed. The second image is then integrated into a video. In addition, the video to be processed enables the audio content of the video to be consistent with the motion track (namely the second motion track) of the lip of the target object in each frame of video image, which is obtained according to the video generation network. Therefore, the video integrated according to the second image can still use the audio in the video to be processed, so that each video frame image can be ensured to correspond to the audio content.

The actual effect diagram of the embodiment of the application in the application stage is shown in fig. 7, and the frame-by-frame comparison of the original video frame data of the female target object and the video frame data generated by the countermeasure network based on the application is shown in the upper rectangular frame of fig. 7. The frame-by-frame comparison of the original video frame data of a male target object with the video frame data generated based on the present application to the countermeasure network is shown in the lower rectangular box. According to the comparison result, the technical scheme provided by the embodiment of the application can adjust the facial expression of the target object in the video to be processed according to the emotional characteristics of the video to be processed and the audio content, so that the emotional characteristics of the target object in the video and the audio content are consistent, and the sense of reality of the video is improved.

Based on the same inventive concept, the embodiment of the present application further provides a video generating apparatus 800 of a target object, specifically as shown in fig. 8, including:

a feature extraction module 801 configured to perform semantic recognition on audio content of a video to be processed, and determine emotional features contained in the audio content; wherein each frame of video image of the video to be processed contains the same target object;

the video clipping module 802 is configured to perform preset clipping processing on the video to be processed based on the emotion characteristics if the audio content contains multiple emotion characteristics, so as to obtain multiple video clips; wherein the audio content of each video clip corresponds to an emotional characteristic;

The emotion adding module 803 is configured to execute, for each video clip, inputting the video clip and emotion characteristics of the video clip into a generation countermeasure network, so as to split the video clip into multiple frames of first images according to video duration through the generation countermeasure network, and then adding emotion noise corresponding to the emotion characteristics to each first image to obtain a second image corresponding to the first image; the emotional noise is used for adjusting the facial expression of the target object so that the facial expression has the emotional characteristics;

the video generating module 804 is configured to perform ranking of the second images corresponding to the first images according to the video time sequence positions of the first images in the video to be processed, and integrate the ranked second images into a video.

An electronic device 130 according to this embodiment of the application is described below with reference to fig. 9. The electronic device 130 shown in fig. 9 is merely an example, and should not be construed as limiting the functionality and scope of use of embodiments of the present application.

As shown in fig. 9, the electronic device 130 is embodied in the form of a general-purpose electronic device. Components of electronic device 130 may include, but are not limited to: the at least one processor 131, the at least one memory 132, and a bus 133 connecting the various system components, including the memory 132 and the processor 131.

Bus 133 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a processor, and a local bus using any of a variety of bus architectures.

Memory 132 may include readable media in the form of volatile memory such as Random Access Memory (RAM) 1321 and/or cache memory 1322, and may further include Read Only Memory (ROM) 1323.

Memory 132 may also include a program/utility 1325 having a set (at least one) of program modules 1324, such program modules 1324 include, but are not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

The electronic device 130 may also communicate with one or more external devices 134 (e.g., keyboard, pointing device, etc.), one or more devices that enable a user to interact with the electronic device 130, and/or any device (e.g., router, modem, etc.) that enables the electronic device 130 to communicate with one or more other electronic devices. Such communication may occur through an input/output (I/O) interface 135. Also, electronic device 130 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 136. As shown, network adapter 136 communicates with other modules for electronic device 130 over bus 133. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 130, including, but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

In an exemplary embodiment, a computer readable storage medium is also provided, such as a memory 132, comprising instructions executable by the processor 131 of the apparatus 400 to perform the above-described method. Alternatively, the computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

In an exemplary embodiment, a computer program product is also provided, comprising a computer program/instruction which, when executed by the processor 131, implements any one of a method for generating video of a target object or a method for acquiring buried point data as provided by the present application.

In an exemplary embodiment, aspects of a method for generating video of a target object or a method for acquiring buried point data provided by the present application may also be implemented in the form of a program product, which includes program code for causing a computer device to execute the steps in the method for generating video of a target object or the method for acquiring buried point data according to the various exemplary embodiments of the present application described above when the program product is run on the computer device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The program product for video generation of a target object or acquisition of buried point data of an embodiment of the present application may employ a portable compact disc read only memory (CD-ROM) and comprise program code and may be run on an electronic device. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the language or similar programming languages. The program code may execute entirely on the consumer electronic device, partly on the consumer electronic device, as a stand-alone software package, partly on the consumer electronic device, partly on the remote electronic device, or entirely on the remote electronic device or server. In the case of remote electronic devices, the remote electronic device may be connected to the consumer electronic device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external electronic device (e.g., connected through the internet using an internet service provider).

It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such a division is merely exemplary and not mandatory. Indeed, the features and functions of two or more of the elements described above may be embodied in one element in accordance with embodiments of the present application. Conversely, the features and functions of one unit described above may be further divided into a plurality of units to be embodied.

Furthermore, although the operations of the methods of the present application are depicted in the drawings in a particular order, this is not required to either imply that the operations must be performed in that particular order or that all of the illustrated operations be performed to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable image scaling device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable image scaling device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable image scaling device to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable image scaling apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A method for generating video of a target object, the method comprising:

2. The method of claim 1, wherein the generating the antagonism network is trained by:

3. The method of claim 2, wherein iteratively training the originally generated countermeasure network based on the sample image and the reference image until a first convergence condition is satisfied, comprises:

4. A method according to claim 3, wherein the iteratively training the originally generated countermeasure network based on the video under test and the reference video until a second convergence condition is satisfied comprises:

5. The method according to claim 2, wherein determining the video to be tested according to the model parameters obtained in the last iteration comprises:

6. The method of claim 2, wherein the emotional noise is determined according to the following:

7. The method according to any of claims 1-6, wherein the video to be processed is determined according to the following manner:

8. The method of claim 7, wherein the generating a multi-frame virtual image based on the facial feature points comprises:

9. The method of claim 8, wherein the generating the virtual image based on the first motion profile and the second motion profile comprises:

10. An electronic device, comprising:

A memory for storing program instructions;

a processor for invoking program instructions stored in the memory and for performing the steps comprised in the method according to any of claims 1-9 in accordance with the obtained program instructions.