CN116389850A

CN116389850A - Method and device for generating video by utilizing audio

Info

Publication number: CN116389850A
Application number: CN202310243642.3A
Authority: CN
Inventors: 廖盛斌; 李一鸣
Original assignee: Central China Normal University
Current assignee: Central China Normal University
Priority date: 2023-03-14
Filing date: 2023-03-14
Publication date: 2023-07-04

Abstract

The invention provides a method and a device for generating video by utilizing audio, belonging to the technical field of computers, wherein the method comprises the following steps: extracting target audio characteristics from target audio; carrying out face analysis on a target image to obtain a face analysis image, and carrying out pixel-by-pixel ray projection on the face analysis image so as to sample ray points; inputting the space coordinates and directions of the light points and the target audio characteristics into a multi-layer perceptron to obtain the color and density of the light points; and rendering a target face video matched with the target audio by using a volume rendering technology based on the color and the density of the light points. The method and the device for generating the video by utilizing the audio, provided by the invention, input the spatial coordinates and the directions of the light points projected by the target audio characteristics and the face analysis image into the multi-layer perceptron, acquire the colors and the densities of the light points, and then generate the video with the expression and the lip motion consistent with the target audio by utilizing the volume drawing technology.

Description

Method and device for generating video by utilizing audio

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and apparatus for generating video using audio.

Background

The classroom is an important place for teachers to give lessons and students to acquire knowledge. With the continuous development of society and the development of information technology, the quality of online classroom teaching becomes more important. The network teaching video is processed by the informatization technology, so that not only can teachers be helped to enrich teaching means, but also students can be helped to improve the attention of the students in class, and the students are helped to improve the learning efficiency. If different speaking styles and characters are used for generating the same teaching content video, teaching of different types of students can be realized, and the learning performance of the students can be improved more significantly.

Traditional approaches rely on intermediate representations like 2D landmarks or 3D face models to bridge the gap between the audio input and the video output, which may lead to semantic mismatch between the original audio signal and the face deformation due to information loss caused by the intermediate representation.

In addition, the invention patent application with the patent application number of 202211508415.0 discloses a method and a device for generating video through voice, and the main technical scheme is as follows: determining voice data corresponding to voice input operation; determining a target character from a plurality of preset characters configured for the target object according to the voice data; acquiring a first model based on voice extraction gesture data, and determining initial gesture data of the voice data through the first model; standard posture data corresponding to the target image are obtained, and the initial posture data are redirected according to the standard posture data to obtain target posture data; and determining a second model corresponding to the target image and used for synthesizing the video based on the gesture, and inputting target gesture data into the second model to generate a target video of the target image. Although applicable, the preset character and the plurality of model transformations need to be configured in advance, increasing the complexity of the operation.

Disclosure of Invention

The invention provides a method and a device for generating a video by utilizing audio, which are used for solving the defect of semantic mismatch between an original audio signal and face deformation in the prior art and generating the video with consistent expression and lip actions and voice.

In a first aspect, the present invention provides a method of generating video using audio, comprising: extracting target audio characteristics from target audio; carrying out face analysis on a target image to obtain a face analysis image, and carrying out pixel-by-pixel ray projection on the face analysis image so as to sample ray points; inputting the space coordinates and directions of the light points and the target audio characteristics into a multi-layer perceptron to obtain the color and density of the light points; and rendering a target face video matched with the target audio by using a volume rendering technology based on the color and the density of the light points.

According to the method for generating video by utilizing audio, the method for extracting the target audio features from the target audio comprises the following steps: and inputting the target audio to a preset voice recognition network model to extract target audio characteristics of the target audio.

According to the method for generating the video by utilizing the audio, the preset voice recognition network model is a deepspech 2 network; the deepsech 2 network comprises 3 convolution layers, 7 circulation layers and 1 full connection layer which are connected in sequence.

According to the method for generating video by utilizing audio, the invention analyzes the face of the target image to obtain the face analysis image, which comprises the following steps: carrying out face analysis on the target image by adopting a Bisenet network to obtain a face analysis image; the network structure of the Bisenet network comprises a Spatial Path and a Context Path; the Spatial Path includes three layers, each layer including a convolution with a stride of 2, followed by a batch normalization and ReLU activation function; the Context Path can rapidly downsample the feature map to obtain a large receptive field, and encode high-level semantic Context information.

According to the method for generating video by utilizing audio, the face analysis image is subjected to pixel-by-pixel projection of rays, wherein the rays projected by any pixel are specifically expressed as follows:

r＝o+td；

where r denotes a ray, o denotes a camera coordinate set as an origin, t denotes a distance from a point on the ray to the camera origin, and d denotes a direction of the ray.

According to the method for generating the video by utilizing the audio, provided by the invention, the target face video matched with the target audio is rendered by utilizing a volume rendering technology based on the color and the density of the light points, and the corresponding realization formula is as follows:

where T (T) represents the cumulative light transmittance, tn and tf represent the near and far ends on the light, respectively, σ (r (T)) represents the density, C (T), d) represents the color of each sample point, and C (r) represents the predicted color of each pixel.

According to the method for generating the video by utilizing the audio, the target audio is teacher teaching audio.

In a second aspect, the present invention also provides an apparatus for generating video using audio, including:

the target audio feature extraction module is used for extracting target audio features from target audio;

the light point sampling module is used for carrying out face analysis on the target image to obtain a face analysis image, and carrying out pixel-by-pixel projection on the face analysis image so as to sample light points;

the dynamic nerve radiation field module is used for inputting the space coordinates and directions of the light points and the target audio characteristics into the multilayer perceptron to acquire the colors and the densities of the light points;

and the video generation module is used for rendering the target face video matched with the target audio by utilizing a volume rendering technology based on the color and the density of the ray points.

In a third aspect, the invention provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method of generating video using audio as described in any one of the preceding claims when the program is executed.

In a fourth aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a method of generating video using audio as described in any of the above.

The method and the device for generating the video by utilizing the audio, provided by the invention, input the spatial coordinates and the directions of the light points projected by the target audio characteristics and the face analysis image into the multi-layer perceptron, acquire the colors and the densities of the light points, and then generate the video with the expression and the lip motion consistent with the target audio by utilizing the volume drawing technology.

Furthermore, the audio features and the portrait features are directly input into the neural network, a plurality of preset images are not required to be configured in advance, a plurality of models are not required to be prepared, the computing resources are saved, and the problem of semantic missing caused by the conversion of semantics among the plurality of models is also reduced. Simultaneously, the training is performed by using the audio under the Chinese teaching scene in the pretraining of deepsech 2 in combination with the education scene, so that the effect of extracting the semantic features of the audio is more accurate.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for generating video using audio according to the present invention;

fig. 2 is a schematic diagram of a deepsech 2 network structure provided by the present invention;

FIG. 3 is a schematic diagram of face parsing using a Bisenet network according to the present invention;

FIG. 4 is a second flow chart of a method for generating video using audio according to the present invention;

fig. 5 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that in the description of embodiments of the present invention, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The following describes a method and apparatus for generating video using audio according to embodiments of the present invention with reference to fig. 1 to 5.

Fig. 1 is a schematic flow chart of a method for generating video by audio according to the present invention, as shown in fig. 1, including but not limited to the following steps:

step 101: and extracting target audio characteristics from the target audio.

The invention can collect the target audio by using a plurality of devices with recording functions such as a mobile phone or a recording pen, or directly extract the audio from the video to obtain the audio with the format wav. Alternatively, the target audio may be audio in a teacher classroom teaching process, that is, teacher teaching audio.

Further, the target audio is input to a preset voice recognition network model to extract target audio characteristics of the target audio.

Step 102: and carrying out face analysis on the target image to obtain a face analysis image, and carrying out pixel-by-pixel ray projection on the face analysis image to sample ray points.

The target image may be an image containing teacher's face, which needs to be face-resolved because the motion of the head portion is not generally consistent with the motion of the torso portion, and we need to train the head and torso separately in the image.

Further, the obtained face analysis image is subjected to pixel-by-pixel light projection and light point sampling. If the focal point of the camera is known, the line connecting the focal point and the pixel can be connected to form a ray, and for the ray, the density (only related to the space coordinates) and the color (both depending on the space coordinates and the incident angle) of each point in space, the color of each pixel can be obtained by using the volume rendering technology.

Alternatively, the sampling of the light points takes place in a uniform sampling manner.

Step 103: and inputting the spatial coordinates and directions of the light points and the target audio characteristics into a multi-layer perceptron to obtain the colors and densities of the light points.

The step is a process of acquiring the color and density characteristics of the light points by using the dynamic nerve radiation field. The nerve radiation field is a new hot spot in the current computer vision research, and has the advantages that a two-dimensional image is input, and the position and the shape of a three-dimensional object are output by 'reconstructing' the two-dimensional image. And only the multi-layer perceptron is used, the mapping from the two-dimensional picture to the three-dimensional scene can be realized, and the computer resources are saved.

It can be understood that a multi-layer perceptron is constructed, the input of the network consists of audio features, coordinates and directions of light points, color and density values of the light points are output, eight layers in total, the width of the middle layer is 256, and the dimension of the output layer is 4.

Step 104: and rendering a target face video matched with the target audio by using a volume rendering technology based on the color and the density of the light points.

According to the method for generating the video by utilizing the audio, the spatial coordinates and the directions of the light points projected by the target audio features and the face analysis image are input into the multi-layer perceptron, the colors and the densities of the light points are obtained, and then the video with the expression and the lip motion consistent with the target audio is generated by utilizing the volume drawing technology.

Based on the foregoing embodiments, as an optional embodiment, the method for generating video using audio provided by the present invention, the preset speech recognition network model is a deepsech 2 network; the deepsech 2 network comprises 3 convolution layers, 7 circulation layers and 1 full connection layer which are connected in sequence.

FIG. 2 is a schematic diagram of the deep speech2 network structure provided in the present invention, and as shown in FIG. 2, the deep speech2 network is a speech recognition network model including 3 convolutional layers, 7 loop layers and 1 full connection layer, each convolutional layer including the following parameters: the convolution kernel sizes are 11×41×32, 11×21×32, 11×11×96, respectively, the step size of each convolution layer is 2, and a ReLu activation function is adopted; the 7 loop layers contained the following parameters: each loop layer consists of an LSTM structure, each LSTM structure comprising 2048 hidden units, and finally, through a full connection layer, the output of the LSTM layer is connected to a final classification layer, which contains all possible text labels, for extracting 29-dimensional audio features from the target audio features, predicting a 29-dimensional feature code for each 20ms audio segment using the deep 2 model, and extracting semantic information. The audio features of several consecutive frames are jointly transmitted to a time domain convolution network to cancel the noise signal in the original input. Specifically, feature a ε R from 16 adjacent frames is used ^16×29 To represent the current state of the audio modality. The use of audio features instead of regression expression coefficients or facial markers is beneficial to alleviating training costs of an intermediate translation network and preventing potential between audio and visual signalsSemantic mismatch problem at the place.

Optionally, in the case that the target audio is teacher teaching audio, in view of the fact that the classroom teaching language environment is chinese, a wennetspecch data set is specially selected to pretrain the deepspech 2 network, and the data set covers the labeled mandarin audio of ten thousand hours, so that the classroom audio features extracted by the model are more characterized.

Based on the foregoing embodiments, as an optional embodiment, the method for generating video by using audio provided by the present invention performs face analysis on a target image to obtain a face analysis image, which specifically includes the following steps.

Fig. 3 is a schematic diagram of face analysis using a bisnet, and referring to fig. 3, the network structure includes a Spatial Path (SP) and a Context Path (CP). The two components are used for solving the problems of space information loss and receptive field shrinkage respectively.

Spatial Path; it contains three layers, each layer containing one stride (stride) of 2 convolutions, followed by batch normalization and ReLU. Therefore, the road network extracts an output feature map corresponding to 1/8 of the original image. Because it utilizes a feature map of a larger scale, relatively rich spatial information can be encoded.

Context Path: the (CP) can rapidly downsample the feature map to obtain a large receptive field, encoding high-level semantic context information. Next, a global averaging pool is added at the end of the model, providing a maximum receptive field through global context information.

ARM module: ARM is used in the context path to optimize the features of each stage, which includes two branches, a global branch and a local branch, respectively. The global branch is mainly used for extracting global information, compresses an input feature map into a global feature vector so as to acquire global context information, and the local branch is used for carrying out feature compression on a 1X 1 convolution layer and then carrying out feature extraction on a 3X 3 convolution layer so as to acquire more local information.

And a feature fusion module: the feature fusion module fuses the features under different scales to help the network to better understand the image. It consists of two branches, an up-sampling branch, which is typically used to extract features of high fraction , and a down-sampling branch, which is used to extract features of low resolution but more global. In particular, the upsampling branch of the feature fusion module typically includes a deconvolution layer for upsampling the low-fraction feature map to a high resolution for fusion with other high-fraction features. The downsampling branches then typically include a pooling or convolution layer to reduce resolution and expand receptive fields to obtain more global feature information. Finally, the outputs of the up-sampling branch and the down-sampling branch are fused to obtain the final output of the feature fusion module.

Based on the foregoing embodiments, as an optional embodiment, the method for generating video using audio provided by the present invention performs pixel-by-pixel projection on the face analysis image, where a ray projected by any pixel is specifically expressed as:

r＝o+td；

where r denotes a ray, o denotes a camera coordinate set as an origin, t denotes a distance from a point on the ray to the camera origin, and d denotes a direction of the ray. The sampling of the light points adopts a uniform sampling mode.

In step 103, the dynamic neural radiation field is embodied as:

F:(a,d,x)→(c,σ)；

where a represents the 29-dimensional target audio feature, d represents the direction of the ray, x represents the spatial coordinates of the sampled ray point, c represents the color at the sampled ray point, σ represents the density at the sampled ray point, and the entire map F represents the dynamic neural radiation field.

Based on the foregoing, as an alternative embodiment, the present invention renders face video from the resulting neural radiation field using volume rendering techniques.

The specific implementation mode of volume rendering is as follows:

where T (T) represents the cumulative light transmittance, tn and tf represent the near and far ends on the light, respectively, σ (T) represents the density, C (r (T), d) represents the color of each sample point, so for the color of this pixel, the color of this point is weighted by T (T) σ (r (T)), i.e., how much light remains for the light to strike this point, and C (r) represents the predicted color of each pixel.

In the actual rendering process, the invention can equally divide the rays into N cells, randomly sample a point in each cell, and carry out weighted summation on the colors of the points obtained by sampling:

t _i ～u[tn+(i-1)*(tf-tn)/N,tn+(i)*(tf-tn)/N]

wherein delta _i ＝t _i+1 -t _i 。

The invention can collect classroom teaching voice (target audio) as input, determine voice data corresponding to voice input operation, and extract data characteristics of voice by using a deepspech 2 neural network. Features of an input audio signal are directly fed into a multi-layer perceptron to generate a dynamic neural radiation field, and the features obtained by the dynamic neural radiation field are synthesized into a high-fidelity talking head video corresponding to the audio signal by using volume rendering. The invention effectively solves the problem of connection between the audio signal and the facial deformation, and avoids distortion phenomenon of the generated teacher video expression and lip action.

Based on the foregoing embodiments, as an alternative embodiment, fig. 4 is a second flowchart of a method for generating video using audio according to the present invention, as shown in fig. 4, the present invention obtains colors and densities of light points corresponding to a head and a trunk, and further renders a target video matching the target audio using a volume rendering technique.

The present invention also provides an apparatus for generating video using audio, the apparatus comprising:

It should be noted that, when the apparatus for generating video using audio provided in the embodiment of the present invention specifically operates, the method for generating video using audio described in any one of the above embodiments may be executed, which is not described in detail in this embodiment.

Fig. 5 is a schematic structural diagram of an electronic device according to the present invention, and as shown in fig. 5, the electronic device may include: processor 510, communication interface (Communications Interface) 520, memory 530, and communication bus 540, wherein processor 510, communication interface 520, memory 530 complete communication with each other through communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform a method for generating video using audio, the method comprising: extracting target audio characteristics from target audio; carrying out face analysis on a target image to obtain a face analysis image, and carrying out pixel-by-pixel ray projection on the face analysis image so as to sample ray points; inputting the space coordinates and directions of the light points and the target audio characteristics into a multi-layer perceptron to obtain the color and density of the light points; and rendering a target face video matched with the target audio by using a volume rendering technology based on the color and the density of the light points.

Further, the logic instructions in the memory 530 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, are capable of performing the method for generating video using audio provided by the above embodiments, the method comprising: extracting target audio characteristics from target audio; carrying out face analysis on a target image to obtain a face analysis image, and carrying out pixel-by-pixel ray projection on the face analysis image so as to sample ray points; inputting the space coordinates and directions of the light points and the target audio characteristics into a multi-layer perceptron to obtain the color and density of the light points; and rendering a target face video matched with the target audio by using a volume rendering technology based on the color and the density of the light points.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the method for generating video using audio provided by the above embodiments, the method comprising: extracting target audio characteristics from target audio; carrying out face analysis on a target image to obtain a face analysis image, and carrying out pixel-by-pixel ray projection on the face analysis image so as to sample ray points; inputting the space coordinates and directions of the light points and the target audio characteristics into a multi-layer perceptron to obtain the color and density of the light points; and rendering a target face video matched with the target audio by using a volume rendering technology based on the color and the density of the light points.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for generating video using audio, comprising:

extracting target audio characteristics from target audio;

carrying out face analysis on a target image to obtain a face analysis image, and carrying out pixel-by-pixel ray projection on the face analysis image so as to sample ray points;

inputting the space coordinates and directions of the light points and the target audio characteristics into a multi-layer perceptron to obtain the color and density of the light points;

and rendering a target face video matched with the target audio by using a volume rendering technology based on the color and the density of the light points.

2. The method of generating video from audio of claim 1, wherein the extracting target audio features from the target audio comprises:

and inputting the target audio to a preset voice recognition network model to extract target audio characteristics of the target audio.

3. The method for generating video using audio according to claim 2, wherein the predetermined speech recognition network model is a deepspech 2 network;

the deepsech 2 network comprises 3 convolution layers, 7 circulation layers and 1 full connection layer which are connected in sequence.

4. The method for generating video using audio according to claim 1, wherein performing face parsing on the target image to obtain a face parsed image comprises:

carrying out face analysis on the target image by adopting a Bisenet network to obtain a face analysis image; the network structure of the Bisenet network comprises a Spatial Path and a Context Path;

the Spatial Path includes three layers, each layer including a convolution with a stride of 2, followed by a batch normalization and ReLU activation function;

the Context Path can rapidly downsample the feature map to obtain a large receptive field, and encode high-level semantic Context information.

5. The method of generating video using audio according to claim 1, wherein the face analysis image is subjected to pixel-by-pixel projection of rays, wherein rays projected by any pixel are expressed as:

r＝o+td；

6. The method for generating video using audio according to claim 5, wherein the rendering of the target face video matching the target audio using volume rendering technique is based on the color and density of the ray points, and the corresponding implementation formula is:

7. The method of generating video from audio of claim 1, wherein the target audio is teacher teaching audio.

8. An apparatus for generating video using audio, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of generating video using audio as claimed in any one of claims 1 to 7 when the computer program is executed.

10. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of the method of generating video from audio according to any of claims 1 to 7.