CN111368137A

CN111368137A - Video generation method and device, electronic equipment and readable storage medium

Info

Publication number: CN111368137A
Application number: CN202010088384.2A
Authority: CN
Inventors: 彭哲; 鲍冠伯; 刘玉强
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd; Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-02-12
Filing date: 2020-02-12
Publication date: 2020-07-03

Abstract

The application discloses a video generation method and device, electronic equipment and a readable storage medium, and relates to the computer vision technology. The method comprises the specific implementation scheme that three-dimensional face meshes of a face image of a target object in a video to be generated and face image textures of the face image are obtained; obtaining each expression parameter of the facial image according to the audio characteristics of the audio content of the target object; obtaining each rendering face image of the three-dimensional face mesh according to the three-dimensional face mesh of the face image, each expression parameter of the face image and the face image texture of the face image; performing fusion processing on each rendered face image of the three-dimensional face grid and each video frame image of the template video to obtain each fused video frame image after fusion; and synthesizing the fused video frame images to generate a fused video.

Description

Video generation method and device, electronic equipment and readable storage medium

Technical Field

The present disclosure relates to computer technologies, and in particular, to a method and an apparatus for generating a video, an electronic device, and a readable storage medium.

Background

With the deep development of the internet, terminals can integrate more and more functions, so that Applications (APPs) applied to the terminals are diversified. In some applications, the content representation of the video may be involved, and the recording of the video may be performed manually to generate videos with various content representations.

However, the efficiency of video generation is low due to the complete reliance on manual recording. Especially, for some videos with fixed content expression, such as videos with content expression of news broadcasting, subject teaching, etc., the content expressed by these videos is fixed, and the manual recording mode is completely adopted, which not only is efficient, but also causes unnecessary waste of human resources.

Disclosure of Invention

Aspects of the present application provide a video generation method, apparatus, electronic device and readable storage medium, so as to improve efficiency of video generation.

In one aspect of the present application, a method for generating a video is provided, including:

acquiring a three-dimensional face mesh of a face image of a target object in a video to be generated and face image textures of the face image;

obtaining each expression parameter of the facial image according to the audio characteristics of the audio content of the target object;

obtaining each rendering face image of the three-dimensional face mesh according to the three-dimensional face mesh of the face image, each expression parameter of the face image and the face image texture of the face image;

performing fusion processing on each rendered face image of the three-dimensional face grid and each video frame image of the template video to obtain each fused video frame image after fusion;

and synthesizing the fused video frame images to generate a fused video.

The above-described aspect and any possible implementation manner further provide an implementation manner, where the obtaining a three-dimensional face mesh of a face image of a target object in a video to be generated and a face image texture of the face image includes:

obtaining a three-dimensional face mesh of the face image according to the image content of the target object;

and obtaining the facial image texture of the facial image according to the projection relation between the three-dimensional facial mesh of the facial image and the image content of the target object.

The above-described aspect and any possible implementation manner further provide an implementation manner in which the obtaining of the three-dimensional face mesh of the face image according to the image content of the target object includes:

acquiring image characteristic information of the target object according to the image content of the target object; acquiring a three-dimensional face mesh of the face image by utilizing a pre-trained first neural network according to the image characteristic information of the target object; or

According to the image content of the target object, a pre-trained second neural network is utilized to obtain a three-dimensional face mesh of the face image; or

And utilizing the artificial calibration key points to perform modeling processing on the image content of the target object so as to obtain a three-dimensional face mesh of the face image.

The above-described aspect and any possible implementation manner further provide an implementation manner, where obtaining a three-dimensional face mesh of the face image by using a pre-trained first neural network according to image feature information of the target object, includes:

according to the image characteristic information of the target object, utilizing a pre-trained first neural network to obtain each shape parameter of the face image;

and obtaining a three-dimensional face mesh of the face image according to the shape parameters of the face image.

acquiring the face shape and the face texture of the basic cartoon image;

and obtaining a three-dimensional face mesh of the face image and the face image texture of the face image according to the face shape and the face texture of the basic cartoon image.

The above-mentioned aspect and any possible implementation manner further provide an implementation manner, where after obtaining each expression parameter of the facial image according to an audio feature of the audio content of the target object, the method further includes:

and adjusting each expression parameter of the facial image by using the image data of the target object and the audio data corresponding to the image data to obtain each expression parameter of the adjusted facial image.

The above-mentioned aspect and any possible implementation manner further provide an implementation manner, where obtaining each rendered face image of the three-dimensional face mesh according to the three-dimensional face mesh of the face image, each expression parameter of the face image, and a face image texture of the face image, includes:

obtaining three-dimensional tooth grids and tooth image textures corresponding to the expression parameters according to the expression parameters of the face image;

and obtaining each rendering face image of the three-dimensional face mesh according to the three-dimensional face mesh of the face image, each expression parameter of the face image, the face image texture of the face image, the three-dimensional tooth mesh and the tooth image texture.

In another aspect of the present application, there is provided a video generation apparatus including:

the mesh texture acquiring unit is used for acquiring a three-dimensional face mesh of a face image of a target object in a video to be generated and a face image texture of the face image;

the expression parameter acquisition unit is used for acquiring each expression parameter of the facial image according to the audio characteristics of the audio content of the target object;

the grid rendering unit is used for obtaining each rendering face image of the three-dimensional face grid according to the three-dimensional face grid of the face image, each expression parameter of the face image and the face image texture of the face image;

the image fusion unit is used for carrying out fusion processing on each rendered face image of the three-dimensional face grid and each video frame image of the template video so as to obtain each fused video frame image after fusion;

and the video synthesis unit is used for synthesizing each fused video frame image to generate a fused video.

The above-mentioned aspects and any possible implementation further provide an implementation of the mesh texture obtaining unit, which is specifically configured to

Obtaining a three-dimensional face mesh of the face image according to the image content of the target object; and

According to the image characteristic information of the target object, utilizing a pre-trained first neural network to obtain each shape parameter of the face image; and

Acquiring the face shape and the face texture of the basic cartoon image; and

The above-mentioned aspects and any possible implementation manners further provide an implementation manner, where the expression parameter obtaining unit is further configured to

The above-described aspects and any possible implementation further provide an implementation of the mesh rendering unit, particularly for

Obtaining three-dimensional tooth grids and tooth image textures corresponding to the expression parameters according to the expression parameters of the face image; and

In another aspect of the present invention, an electronic device is provided, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method of the aspects and any possible implementation described above.

In another aspect of the invention, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of the above described aspects and any possible implementation.

According to the technical scheme, the expression parameters of the face image of the target object corresponding to the audio content are obtained based on the audio content of the target object in the video to be generated, and then, the rendering face images of the three-dimensional face mesh are obtained by further utilizing the expression parameters of the face image and combining the three-dimensional face mesh and the face image texture of the face image of the target object, so that a fused video can be obtained according to the rendering face images and the video frame images of the template video without manual participation, and the video generation efficiency is effectively improved.

In addition, by adopting the technical scheme provided by the application, the character can be virtualized, the three-dimensional face mesh and the face image texture of the face image of the target object can be obtained usually according to the image of a single or a plurality of target objects, and the expression generated by the audio is applied to the three-dimensional face mesh of the face image, so that the video generation of the virtual character is realized.

In addition, by adopting the technical scheme provided by the application, the three-dimensional cartoon image can be virtualized in addition to the virtual character image, the three-dimensional cartoon image can be driven by a group of prefabricated shape fusion deformers (blend shape), and the expression parameters of the face image can be transferred to the three-dimensional cartoon image by labeling the three-dimensional cartoon image and the corresponding vertexes of the three-dimensional face mesh of the face image, so that the video generation of the virtual three-dimensional cartoon image is realized.

In addition, by adopting the technical scheme provided by the application, the user experience can be effectively improved.

Further effects of the above aspects or possible implementations will be described below in connection with specific embodiments.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and those skilled in the art can also obtain other drawings according to the drawings without inventive labor. The drawings are only for the purpose of illustrating the present invention and are not to be construed as limiting the present application. Wherein:

fig. 1 is a schematic flowchart of a video generation method according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a video generation apparatus according to another embodiment of the present application;

fig. 3 is a schematic diagram of an electronic device for implementing a video generation method provided by an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terminal involved in the embodiments of the present application may include, but is not limited to, a mobile phone, a Personal Digital Assistant (PDA), a wireless handheld device, a Tablet Computer (Tablet Computer), a Personal Computer (PC), an MP3 player, an MP4 player, a wearable device (e.g., smart glasses, smart watch, smart bracelet, etc.), and the like.

In addition, the term "and/or" herein is only one kind of association relationship describing an associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

Fig. 1 is a schematic flowchart of a video generation method according to an embodiment of the present application, as shown in fig. 1.

101. The method comprises the steps of obtaining a three-dimensional face mesh of a face image of a target object in a video to be generated and face image textures of the face image.

102. And obtaining each expression parameter of the facial image according to the audio characteristics of the audio content of the target object.

103. And obtaining each rendering face image of the three-dimensional face mesh according to the three-dimensional face mesh of the face image, each expression parameter of the face image and the face image texture of the face image.

104. And performing fusion processing on each rendered face image of the three-dimensional face grid and each video frame image of the template video to obtain each fused video frame image after fusion.

105. And synthesizing the fused video frame images to generate a fused video.

It should be noted that part or all of the execution subjects 101 to 105 may be an application located at the local terminal, or may also be a functional unit such as a plug-in or Software Development Kit (SDK) set in the application located at the local terminal, or may also be a processing engine located in a server on the network side, or may also be a distributed system located on the network side, for example, a processing engine or a distributed system in a video processing platform on the network side, and the like, which is not particularly limited in this embodiment.

It is to be understood that the application may be a native app (native app) installed on the terminal, or may also be a web page program (webApp) of a browser on the terminal, which is not limited in this embodiment.

Therefore, each expression parameter of the face image of the target object corresponding to the audio content is obtained based on the audio content of the target object in the video to be generated, and then each rendering face image of the three-dimensional face grid is obtained by further utilizing each expression parameter of the face image and combining the three-dimensional face grid and the face image texture of the face image of the target object, so that a fusion video can be obtained according to each rendering face image and each video frame image of the template video without manual participation, and the video generation efficiency is effectively improved.

In the present application, the target object in the video to be generated may be a character image or may also be a cartoon image, which is not particularly limited in the present application.

Optionally, in a possible implementation manner of this embodiment, for a scene in which the target object is a character, the image content of the target object may be acquired in advance.

In a specific implementation process, one or more images of the target object provided by an operator may be specifically obtained, and further, the image content of the target object may be obtained according to the obtained one or more images of the target object.

In another specific implementation process, a video of the target object provided by the operator may be specifically obtained, and further, a video frame image of the video may be extracted from the video. Specifically, a video decoding method (e.g., a video decoding method such as x 264) may be specifically adopted to perform video decoding processing on a video to obtain a video data stream, acquire image data of each frame according to the video data stream, and further perform image coding processing on the image data of each frame by further adopting an image coding method (e.g., an image coding method such as png or jpg) to obtain a video frame image of the video.

After obtaining the video frame image of the video, the image content of the target object may be obtained according to the video frame image.

Specifically, in 101, after obtaining the image content of the target object, specifically, the three-dimensional face mesh of the face image may be obtained according to the image content of the target object, and further, the face image texture of the face image, which may be referred to as UV of the face image herein, may be obtained according to the projection relationship between the three-dimensional face mesh of the face image and the image content of the target object_map。

For example, image feature information of the target object may be obtained according to image content of the target object, and then, a three-dimensional face mesh of the face image may be obtained by using a pre-trained first neural network according to the image feature information of the target object.

Specifically, each shape parameter of the face image, that is, the shape of the face, may be obtained by using a pre-trained first neural network according to the image feature information of the target object. Further, a three-dimensional face mesh of the face image may be obtained from each shape parameter of the face image.

The first Neural Network may be a Recurrent Neural Network (RNN), or may also be another Neural Network, which is not particularly limited in this embodiment.

Specifically, continuous video data may be specifically adopted to perform model training to obtain the first neural network. In the training process, constraint can be carried out through the labeled key points of the human face. The key region positions of the face can be located for the face image in each video frame image of the video data, and the key region positions comprise eyebrows, eyes, a nose, a mouth, a face contour and the like.

Or, for another example, the three-dimensional face mesh of the face image may be obtained by using a pre-trained second neural network according to the image content of the target object.

The second Neural Network may be a Recurrent Neural Network (RNN) or another Neural Network, which is not particularly limited in this embodiment.

Specifically, a large number of face images can be used for model training to obtain a second neural network based on image features.

Or, for another example, specifically, the image content of the target object may be modeled by using a manual calibration key point to obtain a three-dimensional face mesh of the face image.

Optionally, in a possible implementation manner of this embodiment, for a scene in which the target object is a cartoon image, a shape fusion morpher (blend shape) of the target object may be obtained in advance.

Specifically, in 101, after obtaining the shape fusion morpher (blend shape) of the target object, the face shape and the face texture of the base avatar may be obtained based on the shape fusion morpher (blend shape) of the target object, and further, the three-dimensional face mesh of the face image and the face image texture of the face image may be obtained according to the face shape and the face texture of the base avatar.

Optionally, in a possible implementation manner of this embodiment, before 102, the audio content of the target object may be further acquired.

In a specific implementation process, the text of the target object may be specifically obtained, and then, the text may be further subjected to voice conversion processing by using a text conversion technology, so as to obtain the audio content of the target object.

In another specific implementation process, the audio content of the target object may be directly obtained.

Optionally, in a possible implementation manner of this embodiment, in 102, each expression parameter of the facial image may be obtained by using a pre-trained third neural network according to an audio feature of the audio content of the target object.

The expression parameters refer to the weight coefficients of each expression base of an expression base matrix, wherein the expression base matrix is composed of a group of expression bases for describing facial expression changes.

The third Neural Network may be a Recurrent Neural Network (RNN), or may also be another Neural Network, which is not particularly limited in this embodiment.

Specifically, the model training may be specifically performed by using continuous video data of a training object and audio data corresponding to the video data, so as to obtain the third neural network. In the training process, the constraint can be carried out through the labeled key points of the human face, and meanwhile, the constraint can be further carried out through the texture change of continuous videos. The key region positions of the face can be located for the face image in each video frame image of the video data, and the key region positions comprise eyebrows, eyes, a nose, a mouth, a face contour and the like.

In practical applications, the expressions of different character images may be slightly different although the overall deformation is similar. Therefore, each expression parameter of the obtained facial image can be finely adjusted by using the image data of the target object and the audio data corresponding to the image data, so that each expression parameter of the adjusted facial image more conforms to the characteristics of the target object.

Optionally, in a possible implementation manner of this embodiment, after 102, the image data of the target object and the audio data corresponding to the image data may be further utilized to perform adjustment processing on each expression parameter of the facial image, so as to obtain each expression parameter of the facial image after adjustment.

Specifically, the expression parameters of the facial image may be adjusted by using a pre-trained fourth neural network according to the expression parameters of the facial image, so as to obtain the adjusted expression parameters of the facial image.

The fourth Neural Network may adopt a Recurrent Neural Network (RNN), or may also adopt another Neural Network, which is not particularly limited in this embodiment.

Specifically, the image data of the target object and the audio data corresponding to the image data may be specifically adopted to perform model training to obtain the fourth neural network.

Optionally, in a possible implementation manner of this embodiment, in 103, each expression parameter of the facial image may be specifically applied to a three-dimensional face mesh of the facial image, so as to obtain each three-dimensional face mesh of the facial image with an expression. Then, projection processing may be performed on each three-dimensional face mesh of the expressive face image. Finally, the projection result of the projection processing may be rendered by using the facial image texture of the facial image, so as to obtain each rendered facial image of the three-dimensional facial mesh.

In the application process, each expression parameter of the facial image can be used as a weight coefficient of each expression base of the facial image, and each expression base of the facial image is linearly superposed, so that a three-dimensional expression base grid on the three-dimensional face grid of the facial image can be obtained.

Generally, each expression parameter of the facial image is a specific character image as a training object, and a mapping relationship between an audio feature learned by continuous video data of the training object and audio data corresponding to the video data and each expression parameter of the facial image is utilized, so that each expression parameter of the obtained facial image can be directly applied to each three-dimensional face mesh of the facial image obtained when the character image is used as a target object.

Then, for each three-dimensional face mesh of the face image obtained when the cartoon character is taken as the target object, it cannot be directly applied to each expression parameter of the obtained face image, but it is necessary to perform the corresponding vertex labeling processing on each three-dimensional face mesh of the face image obtained when the cartoon character is taken as the target object and each three-dimensional face mesh of the face image obtained when the character is taken as the target object, so as to obtain the corresponding relationship between each three-dimensional face mesh of the cartoon character and each three-dimensional face mesh of the character. Based on the obtained corresponding relation, the expression parameters of the obtained facial image can be applied to the three-dimensional face grids of the facial image obtained when the cartoon image is taken as the target object, thereby realizing that the expression parameters of the obtained facial image are transferred to the cartoon image from the character image,

specifically, after each three-dimensional face mesh of the expressive face image is obtained, it may be further projected onto a two-dimensional plane according to the coordinate relationship of the front view of the camera.

For example, it may be specifically assumed that the eyes of an observer who observes each three-dimensional face mesh of the expressive face image are a particle, and a connection line from the particle to the center of each three-dimensional face mesh of the expressive face image is a z-axis, and the projection is to eliminate the z-axis coordinate, so as to obtain coordinates of each mesh point of each three-dimensional face mesh of the expressive face image in a two-dimensional coordinate plane. Because the coordinate points of the three-dimensional face mesh of each three-dimensional face image with expression can form a triangle, and each coordinate point can obtain the sampled texture color according to the texture of the face image, after the projection result of the projection processing is obtained, the pixel values in each triangle of each three-dimensional face mesh of the face image with expression can be interpolated and filled by the pixel values of three points. The interpolation may be linear interpolation of the pixel values, or may also be spline interpolation of the pixel values, or may also be other interpolation of the pixel values, which is not particularly limited in this embodiment.

Optionally, in a possible implementation manner of this embodiment, in 104, a preset face image mask may be specifically used to perform fusion processing on each rendered face image of the three-dimensional face mesh and each video frame image of the template video, so as to obtain each fused video frame image after fusion.

Specifically, the fusion processing may be performed by using various existing fusion methods, for example, an alpha fusion (alpha Blending) method, a poisson fusion (poisson Blending) method, and the like.

The face image mask used in this embodiment may be a face image mask of a fixed shape preset according to the facial feature positions of the face texture, where a face image range reserved for fusion is preset.

Therefore, the face image mask with the fixed shape preset according to the facial five sense organs position of the face texture is adopted, and the rendered face images of the three-dimensional face grid and the video frame images of the template video are matched for fusion processing, so that the fusion effect between the user face edge in the fusion video and the template background in the template video can be effectively improved.

After the fused video frame images are obtained, image decoding processing can be further performed on the fused video frame images to obtain image data of the original frames, the image data of the frames are spliced into a data stream, and video coding processing is further performed to generate a fused video.

Generally, the obtained three-dimensional face mesh does not include a tooth part, and therefore, in the present application, the correspondence between each expression parameter and the three-dimensional tooth mesh and the texture of the tooth image may be constructed in advance. Correspondingly, in a possible implementation manner of this embodiment, in 103, a three-dimensional dental grid and a dental image texture corresponding to each expression parameter may be specifically obtained according to each expression parameter of the facial image and a pre-constructed corresponding relationship, and further, each rendered facial image of the three-dimensional facial grid may be obtained according to the three-dimensional facial grid of the facial image, each expression parameter of the facial image, and the facial image texture of the facial image, and the three-dimensional dental grid and the dental image texture.

By adopting the technical scheme provided by the application, the labor cost of videos with fixed content expression can be effectively reduced, for example, videos with content expression such as news broadcasting and subject teaching do not need video recording and excessive manual participation, and only one text or one audio is needed, so that real-time broadcasting can be realized. In addition, the requirement of the image under certain scenes with only voice broadcasting can be further met.

In the embodiment, each expression parameter of the face image of the target object corresponding to the audio content is obtained based on the audio content of the target object in the video to be generated, and then each rendering face image of the three-dimensional face mesh is obtained by further using each expression parameter of the face image and combining the three-dimensional face mesh and the face image texture of the face image of the target object, so that a fusion video can be obtained according to each rendering face image and each video frame image of the template video without manual participation, and the video generation efficiency is effectively improved.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

Fig. 2 is a schematic structural diagram of a video generation apparatus according to another embodiment of the present application, as shown in fig. 2. The video generation apparatus 200 of the present embodiment may include a mesh texture acquisition unit 201, an expression parameter acquisition unit 202, a mesh rendering unit 203, an image fusion unit 204, and a video composition unit 205. The mesh texture acquiring unit 201 is configured to acquire a three-dimensional face mesh of a face image of a target object in a video to be generated and a face image texture of the face image; an expression parameter obtaining unit 202, configured to obtain each expression parameter of the facial image according to an audio feature of the audio content of the target object; a mesh rendering unit 203, configured to obtain each rendered facial image of the three-dimensional facial mesh according to the three-dimensional facial mesh of the facial image, each expression parameter of the facial image, and a facial image texture of the facial image; an image fusion unit 204, configured to perform fusion processing on each rendered face image of the three-dimensional face mesh and each video frame image of the template video to obtain each fused video frame image after fusion; a video synthesizing unit 205, configured to perform synthesizing processing on the fused video frame images to generate a fused video.

It should be noted that, part or all of the execution main body of the video generation apparatus provided in this embodiment may be an application located at the local terminal, or may also be a functional unit such as a plug-in or Software Development Kit (SDK) set in the application located at the local terminal, or may also be a processing engine located in a server on the network side, or may also be a distributed system located on the network side, for example, a processing engine or a distributed system in a video processing platform on the network side, and this embodiment is not particularly limited thereto.

Optionally, in a possible implementation manner of this embodiment, the mesh texture obtaining unit 201 may be specifically configured to obtain a three-dimensional face mesh of the face image according to the image content of the target object; and obtaining the facial image texture of the facial image according to the projection relation between the three-dimensional facial mesh of the facial image and the image content of the target object.

In a specific implementation process, the grid texture obtaining unit 201 may be specifically configured to obtain image feature information of the target object according to image content of the target object; acquiring a three-dimensional face mesh of the face image by utilizing a pre-trained first neural network according to the image characteristic information of the target object; or according to the image content of the target object, utilizing a pre-trained second neural network to obtain a three-dimensional face mesh of the face image; or utilizing manual calibration key points to perform modeling processing on the image content of the target object so as to obtain the three-dimensional face mesh of the face image.

For example, the mesh texture obtaining unit 201 may be specifically configured to obtain each shape parameter of the face image by using a first neural network trained in advance according to the image feature information of the target object; and obtaining a three-dimensional face mesh of the face image according to each shape parameter of the face image.

Optionally, in a possible implementation manner of this embodiment, the mesh texture obtaining unit 201 may be specifically configured to obtain a face shape and a face texture of the basic cartoon image; and obtaining a three-dimensional face mesh of the face image and a face image texture of the face image according to the face shape and the face texture of the base cartoon image.

Optionally, in a possible implementation manner of this embodiment, the expression parameter acquiring unit 202 may be further configured to perform adjustment processing on each expression parameter of the facial image by using the image data of the target object and the audio data corresponding to the image data, so as to obtain each expression parameter of the facial image after adjustment.

Optionally, in a possible implementation manner of this embodiment, the grid rendering unit 203 may be specifically configured to obtain, according to each expression parameter of the facial image, a three-dimensional dental grid and a dental image texture corresponding to each expression parameter; and obtaining each rendering face image of the three-dimensional face mesh according to the three-dimensional face mesh of the face image, each expression parameter of the face image, the face image texture of the face image, the three-dimensional tooth mesh and the tooth image texture.

It should be noted that the method in the embodiment corresponding to fig. 1 may be implemented by the video generation apparatus provided in this embodiment. For a detailed description, reference may be made to relevant contents in the embodiment corresponding to fig. 1, and details are not described here.

In this embodiment, the expression parameter acquiring unit acquires, based on the audio content of the target object in the video to be generated, each expression parameter of the facial image of the target object corresponding to the audio content, and then the mesh rendering unit further acquires each rendered facial image of the three-dimensional facial mesh by using each expression parameter of the facial image and combining the three-dimensional facial mesh and the facial image texture of the facial image of the target object acquired by the mesh texture acquiring unit, so that the image fusion unit and the video synthesizing unit can acquire the fused video according to each video frame image of each rendered facial image and the template video without manual participation, thereby effectively improving the efficiency of video generation.

The present application also provides an electronic device and a non-transitory computer readable storage medium having computer instructions stored thereon, according to embodiments of the present application.

Fig. 3 is a schematic view of an electronic device for implementing the video generation method provided in the embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 3, the electronic apparatus includes: one or more processors 301, memory 302, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a Graphical User Interface (GUI) on an external input/output apparatus, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 3, one processor 301 is taken as an example.

Memory 302 is a non-transitory computer readable storage medium as provided herein. The memory stores instructions executable by at least one processor to cause the at least one processor to perform the video generation method provided by the present application. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to perform the method of generating a video provided by the present application.

The memory 302 is a non-transitory computer-readable storage medium, and can be used to store non-transitory software programs, non-transitory computer-executable programs, and units, such as program instructions/units corresponding to the video generation method in the embodiment of the present application (for example, the grid texture acquisition unit 201, the expression parameter acquisition unit 202, the grid rendering unit 203, the image fusion unit 204, and the video composition unit 205 shown in fig. 2). The processor 301 executes various functional applications of the server and data processing, i.e., implements the video generation method in the above-described method embodiment, by executing the non-transitory software programs, instructions, and units stored in the memory 302.

The memory 302 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of an electronic device that implements the video generation method provided by the embodiment of the present application, and the like. Further, the memory 302 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 302 may optionally include a memory remotely located from the processor 301, and these remote memories may be connected via a network to an electronic device implementing the video generation method provided by the embodiments of the present application. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the video generation method may further include: an input device 303 and an output device 304. The processor 301, the memory 302, the input device 303 and the output device 304 may be connected by a bus or other means, and fig. 3 illustrates the connection by a bus as an example.

The input device 303 may receive input numeric or character information and generate key signal inputs related to user settings and function control of an electronic apparatus implementing the video generation method provided by the embodiment of the present application, such as an input device of a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or the like. The output devices 304 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, an Application Specific Integrated Circuit (ASIC), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a Cathode Ray Tube (CRT) or Liquid Crystal Display (LCD) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

According to the technical scheme of the embodiment of the application, each expression parameter of the face image of the target object corresponding to the audio content is obtained based on the audio content of the target object in the video to be generated, and then each rendering face image of the three-dimensional face grid is obtained by further utilizing each expression parameter of the face image and combining the three-dimensional face grid and the face image texture of the face image of the target object, so that a fused video can be obtained according to each rendering face image and each video frame image of the template video without manual participation, and the video generation efficiency is effectively improved.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method for generating a video, comprising:

and synthesizing the fused video frame images to generate a fused video.

2. The method according to claim 1, wherein the obtaining of the three-dimensional face mesh of the face image of the target object in the video to be generated and the face image texture of the face image comprises:

3. The method of claim 2, wherein obtaining the three-dimensional face mesh of the facial image from the image content of the target object comprises:

4. The method according to claim 3, wherein the obtaining a three-dimensional face mesh of the face image by using a pre-trained first neural network according to the image feature information of the target object comprises:

5. The method according to claim 1, wherein the obtaining of the three-dimensional face mesh of the face image of the target object in the video to be generated and the face image texture of the face image comprises:

acquiring the face shape and the face texture of the basic cartoon image;

6. The method according to claim 1, wherein after obtaining the expression parameters of the facial image according to the audio features of the audio content of the target object, the method further comprises:

7. The method according to any one of claims 1 to 6, wherein obtaining each rendered facial image of the three-dimensional face mesh from the three-dimensional face mesh of the facial image, each expression parameter of the facial image, and a facial image texture of the facial image comprises:

8. An apparatus for generating a video, comprising:

9. The apparatus according to claim 8, wherein the mesh texture fetch unit is specifically configured to fetch mesh textures

10. The apparatus according to claim 9, wherein the mesh texture fetch unit is specifically configured to fetch mesh textures

11. The apparatus according to claim 10, wherein the mesh texture fetch unit is specifically configured to fetch mesh textures

12. The apparatus according to claim 8, wherein the mesh texture fetch unit is specifically configured to fetch mesh textures

Acquiring the face shape and the face texture of the basic cartoon image; and

13. The apparatus of claim 8, wherein the expression parameter obtaining unit is further configured to obtain the expression parameters

14. The apparatus according to any of claims 8-13, wherein the mesh rendering unit, in particular for

15. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.