CN116389850A - Method and device for generating video by utilizing audio - Google Patents

Method and device for generating video by utilizing audio Download PDF

Info

Publication number
CN116389850A
CN116389850A CN202310243642.3A CN202310243642A CN116389850A CN 116389850 A CN116389850 A CN 116389850A CN 202310243642 A CN202310243642 A CN 202310243642A CN 116389850 A CN116389850 A CN 116389850A
Authority
CN
China
Prior art keywords
audio
target
target audio
video
pixel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310243642.3A
Other languages
Chinese (zh)
Inventor
廖盛斌
李一鸣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central China Normal University
Original Assignee
Central China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central China Normal University filed Critical Central China Normal University
Priority to CN202310243642.3A priority Critical patent/CN116389850A/en
Publication of CN116389850A publication Critical patent/CN116389850A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/8106Monomedia components thereof involving special audio data, e.g. different tracks for different languages
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/816Monomedia components thereof involving special video data, e.g 3D video

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Image Generation (AREA)

Abstract

The invention provides a method and a device for generating video by utilizing audio, belonging to the technical field of computers, wherein the method comprises the following steps: extracting target audio characteristics from target audio; carrying out face analysis on a target image to obtain a face analysis image, and carrying out pixel-by-pixel ray projection on the face analysis image so as to sample ray points; inputting the space coordinates and directions of the light points and the target audio characteristics into a multi-layer perceptron to obtain the color and density of the light points; and rendering a target face video matched with the target audio by using a volume rendering technology based on the color and the density of the light points. The method and the device for generating the video by utilizing the audio, provided by the invention, input the spatial coordinates and the directions of the light points projected by the target audio characteristics and the face analysis image into the multi-layer perceptron, acquire the colors and the densities of the light points, and then generate the video with the expression and the lip motion consistent with the target audio by utilizing the volume drawing technology.

Description

Method and device for generating video by utilizing audio
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method and apparatus for generating video using audio.
Background
The classroom is an important place for teachers to give lessons and students to acquire knowledge. With the continuous development of society and the development of information technology, the quality of online classroom teaching becomes more important. The network teaching video is processed by the informatization technology, so that not only can teachers be helped to enrich teaching means, but also students can be helped to improve the attention of the students in class, and the students are helped to improve the learning efficiency. If different speaking styles and characters are used for generating the same teaching content video, teaching of different types of students can be realized, and the learning performance of the students can be improved more significantly.
Traditional approaches rely on intermediate representations like 2D landmarks or 3D face models to bridge the gap between the audio input and the video output, which may lead to semantic mismatch between the original audio signal and the face deformation due to information loss caused by the intermediate representation.
In addition, the invention patent application with the patent application number of 202211508415.0 discloses a method and a device for generating video through voice, and the main technical scheme is as follows: determining voice data corresponding to voice input operation; determining a target character from a plurality of preset characters configured for the target object according to the voice data; acquiring a first model based on voice extraction gesture data, and determining initial gesture data of the voice data through the first model; standard posture data corresponding to the target image are obtained, and the initial posture data are redirected according to the standard posture data to obtain target posture data; and determining a second model corresponding to the target image and used for synthesizing the video based on the gesture, and inputting target gesture data into the second model to generate a target video of the target image. Although applicable, the preset character and the plurality of model transformations need to be configured in advance, increasing the complexity of the operation.
Disclosure of Invention
The invention provides a method and a device for generating a video by utilizing audio, which are used for solving the defect of semantic mismatch between an original audio signal and face deformation in the prior art and generating the video with consistent expression and lip actions and voice.
In a first aspect, the present invention provides a method of generating video using audio, comprising: extracting target audio characteristics from target audio; carrying out face analysis on a target image to obtain a face analysis image, and carrying out pixel-by-pixel ray projection on the face analysis image so as to sample ray points; inputting the space coordinates and directions of the light points and the target audio characteristics into a multi-layer perceptron to obtain the color and density of the light points; and rendering a target face video matched with the target audio by using a volume rendering technology based on the color and the density of the light points.
According to the method for generating video by utilizing audio, the method for extracting the target audio features from the target audio comprises the following steps: and inputting the target audio to a preset voice recognition network model to extract target audio characteristics of the target audio.
According to the method for generating the video by utilizing the audio, the preset voice recognition network model is a deepspech 2 network; the deepsech 2 network comprises 3 convolution layers, 7 circulation layers and 1 full connection layer which are connected in sequence.
According to the method for generating video by utilizing audio, the invention analyzes the face of the target image to obtain the face analysis image, which comprises the following steps: carrying out face analysis on the target image by adopting a Bisenet network to obtain a face analysis image; the network structure of the Bisenet network comprises a Spatial Path and a Context Path; the Spatial Path includes three layers, each layer including a convolution with a stride of 2, followed by a batch normalization and ReLU activation function; the Context Path can rapidly downsample the feature map to obtain a large receptive field, and encode high-level semantic Context information.
According to the method for generating video by utilizing audio, the face analysis image is subjected to pixel-by-pixel projection of rays, wherein the rays projected by any pixel are specifically expressed as follows:
r=o+td;
where r denotes a ray, o denotes a camera coordinate set as an origin, t denotes a distance from a point on the ray to the camera origin, and d denotes a direction of the ray.
According to the method for generating the video by utilizing the audio, provided by the invention, the target face video matched with the target audio is rendered by utilizing a volume rendering technology based on the color and the density of the light points, and the corresponding realization formula is as follows:
Figure BDA0004125293510000031
where T (T) represents the cumulative light transmittance, tn and tf represent the near and far ends on the light, respectively, σ (r (T)) represents the density, C (T), d) represents the color of each sample point, and C (r) represents the predicted color of each pixel.
According to the method for generating the video by utilizing the audio, the target audio is teacher teaching audio.
In a second aspect, the present invention also provides an apparatus for generating video using audio, including:
the target audio feature extraction module is used for extracting target audio features from target audio;
the light point sampling module is used for carrying out face analysis on the target image to obtain a face analysis image, and carrying out pixel-by-pixel projection on the face analysis image so as to sample light points;
the dynamic nerve radiation field module is used for inputting the space coordinates and directions of the light points and the target audio characteristics into the multilayer perceptron to acquire the colors and the densities of the light points;
and the video generation module is used for rendering the target face video matched with the target audio by utilizing a volume rendering technology based on the color and the density of the ray points.
In a third aspect, the invention provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method of generating video using audio as described in any one of the preceding claims when the program is executed.
In a fourth aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a method of generating video using audio as described in any of the above.
The method and the device for generating the video by utilizing the audio, provided by the invention, input the spatial coordinates and the directions of the light points projected by the target audio characteristics and the face analysis image into the multi-layer perceptron, acquire the colors and the densities of the light points, and then generate the video with the expression and the lip motion consistent with the target audio by utilizing the volume drawing technology.
Furthermore, the audio features and the portrait features are directly input into the neural network, a plurality of preset images are not required to be configured in advance, a plurality of models are not required to be prepared, the computing resources are saved, and the problem of semantic missing caused by the conversion of semantics among the plurality of models is also reduced. Simultaneously, the training is performed by using the audio under the Chinese teaching scene in the pretraining of deepsech 2 in combination with the education scene, so that the effect of extracting the semantic features of the audio is more accurate.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for generating video using audio according to the present invention;
fig. 2 is a schematic diagram of a deepsech 2 network structure provided by the present invention;
FIG. 3 is a schematic diagram of face parsing using a Bisenet network according to the present invention;
FIG. 4 is a second flow chart of a method for generating video using audio according to the present invention;
fig. 5 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be noted that in the description of embodiments of the present invention, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The following describes a method and apparatus for generating video using audio according to embodiments of the present invention with reference to fig. 1 to 5.
Fig. 1 is a schematic flow chart of a method for generating video by audio according to the present invention, as shown in fig. 1, including but not limited to the following steps:
step 101: and extracting target audio characteristics from the target audio.
The invention can collect the target audio by using a plurality of devices with recording functions such as a mobile phone or a recording pen, or directly extract the audio from the video to obtain the audio with the format wav. Alternatively, the target audio may be audio in a teacher classroom teaching process, that is, teacher teaching audio.
Further, the target audio is input to a preset voice recognition network model to extract target audio characteristics of the target audio.
Step 102: and carrying out face analysis on the target image to obtain a face analysis image, and carrying out pixel-by-pixel ray projection on the face analysis image to sample ray points.
The target image may be an image containing teacher's face, which needs to be face-resolved because the motion of the head portion is not generally consistent with the motion of the torso portion, and we need to train the head and torso separately in the image.
Further, the obtained face analysis image is subjected to pixel-by-pixel light projection and light point sampling. If the focal point of the camera is known, the line connecting the focal point and the pixel can be connected to form a ray, and for the ray, the density (only related to the space coordinates) and the color (both depending on the space coordinates and the incident angle) of each point in space, the color of each pixel can be obtained by using the volume rendering technology.
Alternatively, the sampling of the light points takes place in a uniform sampling manner.
Step 103: and inputting the spatial coordinates and directions of the light points and the target audio characteristics into a multi-layer perceptron to obtain the colors and densities of the light points.
The step is a process of acquiring the color and density characteristics of the light points by using the dynamic nerve radiation field. The nerve radiation field is a new hot spot in the current computer vision research, and has the advantages that a two-dimensional image is input, and the position and the shape of a three-dimensional object are output by 'reconstructing' the two-dimensional image. And only the multi-layer perceptron is used, the mapping from the two-dimensional picture to the three-dimensional scene can be realized, and the computer resources are saved.
It can be understood that a multi-layer perceptron is constructed, the input of the network consists of audio features, coordinates and directions of light points, color and density values of the light points are output, eight layers in total, the width of the middle layer is 256, and the dimension of the output layer is 4.
Step 104: and rendering a target face video matched with the target audio by using a volume rendering technology based on the color and the density of the light points.
According to the method for generating the video by utilizing the audio, the spatial coordinates and the directions of the light points projected by the target audio features and the face analysis image are input into the multi-layer perceptron, the colors and the densities of the light points are obtained, and then the video with the expression and the lip motion consistent with the target audio is generated by utilizing the volume drawing technology.
Based on the foregoing embodiments, as an optional embodiment, the method for generating video using audio provided by the present invention, the preset speech recognition network model is a deepsech 2 network; the deepsech 2 network comprises 3 convolution layers, 7 circulation layers and 1 full connection layer which are connected in sequence.
FIG. 2 is a schematic diagram of the deep speech2 network structure provided in the present invention, and as shown in FIG. 2, the deep speech2 network is a speech recognition network model including 3 convolutional layers, 7 loop layers and 1 full connection layer, each convolutional layer including the following parameters: the convolution kernel sizes are 11×41×32, 11×21×32, 11×11×96, respectively, the step size of each convolution layer is 2, and a ReLu activation function is adopted; the 7 loop layers contained the following parameters: each loop layer consists of an LSTM structure, each LSTM structure comprising 2048 hidden units, and finally, through a full connection layer, the output of the LSTM layer is connected to a final classification layer, which contains all possible text labels, for extracting 29-dimensional audio features from the target audio features, predicting a 29-dimensional feature code for each 20ms audio segment using the deep 2 model, and extracting semantic information. The audio features of several consecutive frames are jointly transmitted to a time domain convolution network to cancel the noise signal in the original input. Specifically, feature a ε R from 16 adjacent frames is used 16×29 To represent the current state of the audio modality. The use of audio features instead of regression expression coefficients or facial markers is beneficial to alleviating training costs of an intermediate translation network and preventing potential between audio and visual signalsSemantic mismatch problem at the place.
Optionally, in the case that the target audio is teacher teaching audio, in view of the fact that the classroom teaching language environment is chinese, a wennetspecch data set is specially selected to pretrain the deepspech 2 network, and the data set covers the labeled mandarin audio of ten thousand hours, so that the classroom audio features extracted by the model are more characterized.
Based on the foregoing embodiments, as an optional embodiment, the method for generating video by using audio provided by the present invention performs face analysis on a target image to obtain a face analysis image, which specifically includes the following steps.
Fig. 3 is a schematic diagram of face analysis using a bisnet, and referring to fig. 3, the network structure includes a Spatial Path (SP) and a Context Path (CP). The two components are used for solving the problems of space information loss and receptive field shrinkage respectively.
Spatial Path; it contains three layers, each layer containing one stride (stride) of 2 convolutions, followed by batch normalization and ReLU. Therefore, the road network extracts an output feature map corresponding to 1/8 of the original image. Because it utilizes a feature map of a larger scale, relatively rich spatial information can be encoded.
Context Path: the (CP) can rapidly downsample the feature map to obtain a large receptive field, encoding high-level semantic context information. Next, a global averaging pool is added at the end of the model, providing a maximum receptive field through global context information.
ARM module: ARM is used in the context path to optimize the features of each stage, which includes two branches, a global branch and a local branch, respectively. The global branch is mainly used for extracting global information, compresses an input feature map into a global feature vector so as to acquire global context information, and the local branch is used for carrying out feature compression on a 1X 1 convolution layer and then carrying out feature extraction on a 3X 3 convolution layer so as to acquire more local information.
And a feature fusion module: the feature fusion module fuses the features under different scales to help the network to better understand the image. It consists of two branches, an up-sampling branch, which is typically used to extract features of high fraction , and a down-sampling branch, which is used to extract features of low resolution but more global. In particular, the upsampling branch of the feature fusion module typically includes a deconvolution layer for upsampling the low-fraction feature map to a high resolution for fusion with other high-fraction features. The downsampling branches then typically include a pooling or convolution layer to reduce resolution and expand receptive fields to obtain more global feature information. Finally, the outputs of the up-sampling branch and the down-sampling branch are fused to obtain the final output of the feature fusion module.
Based on the foregoing embodiments, as an optional embodiment, the method for generating video using audio provided by the present invention performs pixel-by-pixel projection on the face analysis image, where a ray projected by any pixel is specifically expressed as:
r=o+td;
where r denotes a ray, o denotes a camera coordinate set as an origin, t denotes a distance from a point on the ray to the camera origin, and d denotes a direction of the ray. The sampling of the light points adopts a uniform sampling mode.
In step 103, the dynamic neural radiation field is embodied as:
F:(a,d,x)→(c,σ);
where a represents the 29-dimensional target audio feature, d represents the direction of the ray, x represents the spatial coordinates of the sampled ray point, c represents the color at the sampled ray point, σ represents the density at the sampled ray point, and the entire map F represents the dynamic neural radiation field.
Based on the foregoing, as an alternative embodiment, the present invention renders face video from the resulting neural radiation field using volume rendering techniques.
The specific implementation mode of volume rendering is as follows:
Figure BDA0004125293510000091
where T (T) represents the cumulative light transmittance, tn and tf represent the near and far ends on the light, respectively, σ (T) represents the density, C (r (T), d) represents the color of each sample point, so for the color of this pixel, the color of this point is weighted by T (T) σ (r (T)), i.e., how much light remains for the light to strike this point, and C (r) represents the predicted color of each pixel.
In the actual rendering process, the invention can equally divide the rays into N cells, randomly sample a point in each cell, and carry out weighted summation on the colors of the points obtained by sampling:
t i ~u[tn+(i-1)*(tf-tn)/N,tn+(i)*(tf-tn)/N]
Figure BDA0004125293510000092
wherein delta i =t i+1 -t i
The invention can collect classroom teaching voice (target audio) as input, determine voice data corresponding to voice input operation, and extract data characteristics of voice by using a deepspech 2 neural network. Features of an input audio signal are directly fed into a multi-layer perceptron to generate a dynamic neural radiation field, and the features obtained by the dynamic neural radiation field are synthesized into a high-fidelity talking head video corresponding to the audio signal by using volume rendering. The invention effectively solves the problem of connection between the audio signal and the facial deformation, and avoids distortion phenomenon of the generated teacher video expression and lip action.
Based on the foregoing embodiments, as an alternative embodiment, fig. 4 is a second flowchart of a method for generating video using audio according to the present invention, as shown in fig. 4, the present invention obtains colors and densities of light points corresponding to a head and a trunk, and further renders a target video matching the target audio using a volume rendering technique.
The present invention also provides an apparatus for generating video using audio, the apparatus comprising:
the target audio feature extraction module is used for extracting target audio features from target audio;
the light point sampling module is used for carrying out face analysis on the target image to obtain a face analysis image, and carrying out pixel-by-pixel projection on the face analysis image so as to sample light points;
the dynamic nerve radiation field module is used for inputting the space coordinates and directions of the light points and the target audio characteristics into the multilayer perceptron to acquire the colors and the densities of the light points;
and the video generation module is used for rendering the target face video matched with the target audio by utilizing a volume rendering technology based on the color and the density of the ray points.
It should be noted that, when the apparatus for generating video using audio provided in the embodiment of the present invention specifically operates, the method for generating video using audio described in any one of the above embodiments may be executed, which is not described in detail in this embodiment.
Fig. 5 is a schematic structural diagram of an electronic device according to the present invention, and as shown in fig. 5, the electronic device may include: processor 510, communication interface (Communications Interface) 520, memory 530, and communication bus 540, wherein processor 510, communication interface 520, memory 530 complete communication with each other through communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform a method for generating video using audio, the method comprising: extracting target audio characteristics from target audio; carrying out face analysis on a target image to obtain a face analysis image, and carrying out pixel-by-pixel ray projection on the face analysis image so as to sample ray points; inputting the space coordinates and directions of the light points and the target audio characteristics into a multi-layer perceptron to obtain the color and density of the light points; and rendering a target face video matched with the target audio by using a volume rendering technology based on the color and the density of the light points.
Further, the logic instructions in the memory 530 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, are capable of performing the method for generating video using audio provided by the above embodiments, the method comprising: extracting target audio characteristics from target audio; carrying out face analysis on a target image to obtain a face analysis image, and carrying out pixel-by-pixel ray projection on the face analysis image so as to sample ray points; inputting the space coordinates and directions of the light points and the target audio characteristics into a multi-layer perceptron to obtain the color and density of the light points; and rendering a target face video matched with the target audio by using a volume rendering technology based on the color and the density of the light points.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the method for generating video using audio provided by the above embodiments, the method comprising: extracting target audio characteristics from target audio; carrying out face analysis on a target image to obtain a face analysis image, and carrying out pixel-by-pixel ray projection on the face analysis image so as to sample ray points; inputting the space coordinates and directions of the light points and the target audio characteristics into a multi-layer perceptron to obtain the color and density of the light points; and rendering a target face video matched with the target audio by using a volume rendering technology based on the color and the density of the light points.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for generating video using audio, comprising:
extracting target audio characteristics from target audio;
carrying out face analysis on a target image to obtain a face analysis image, and carrying out pixel-by-pixel ray projection on the face analysis image so as to sample ray points;
inputting the space coordinates and directions of the light points and the target audio characteristics into a multi-layer perceptron to obtain the color and density of the light points;
and rendering a target face video matched with the target audio by using a volume rendering technology based on the color and the density of the light points.
2. The method of generating video from audio of claim 1, wherein the extracting target audio features from the target audio comprises:
and inputting the target audio to a preset voice recognition network model to extract target audio characteristics of the target audio.
3. The method for generating video using audio according to claim 2, wherein the predetermined speech recognition network model is a deepspech 2 network;
the deepsech 2 network comprises 3 convolution layers, 7 circulation layers and 1 full connection layer which are connected in sequence.
4. The method for generating video using audio according to claim 1, wherein performing face parsing on the target image to obtain a face parsed image comprises:
carrying out face analysis on the target image by adopting a Bisenet network to obtain a face analysis image; the network structure of the Bisenet network comprises a Spatial Path and a Context Path;
the Spatial Path includes three layers, each layer including a convolution with a stride of 2, followed by a batch normalization and ReLU activation function;
the Context Path can rapidly downsample the feature map to obtain a large receptive field, and encode high-level semantic Context information.
5. The method of generating video using audio according to claim 1, wherein the face analysis image is subjected to pixel-by-pixel projection of rays, wherein rays projected by any pixel are expressed as:
r=o+td;
where r denotes a ray, o denotes a camera coordinate set as an origin, t denotes a distance from a point on the ray to the camera origin, and d denotes a direction of the ray.
6. The method for generating video using audio according to claim 5, wherein the rendering of the target face video matching the target audio using volume rendering technique is based on the color and density of the ray points, and the corresponding implementation formula is:
Figure FDA0004125293480000021
where T (T) represents the cumulative light transmittance, tn and tf represent the near and far ends on the light, respectively, σ (r (T)) represents the density, C (T), d) represents the color of each sample point, and C (r) represents the predicted color of each pixel.
7. The method of generating video from audio of claim 1, wherein the target audio is teacher teaching audio.
8. An apparatus for generating video using audio, comprising:
the target audio feature extraction module is used for extracting target audio features from target audio;
the light point sampling module is used for carrying out face analysis on the target image to obtain a face analysis image, and carrying out pixel-by-pixel projection on the face analysis image so as to sample light points;
the dynamic nerve radiation field module is used for inputting the space coordinates and directions of the light points and the target audio characteristics into the multilayer perceptron to acquire the colors and the densities of the light points;
and the video generation module is used for rendering the target face video matched with the target audio by utilizing a volume rendering technology based on the color and the density of the ray points.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of generating video using audio as claimed in any one of claims 1 to 7 when the computer program is executed.
10. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of the method of generating video from audio according to any of claims 1 to 7.
CN202310243642.3A 2023-03-14 2023-03-14 Method and device for generating video by utilizing audio Pending CN116389850A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310243642.3A CN116389850A (en) 2023-03-14 2023-03-14 Method and device for generating video by utilizing audio

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310243642.3A CN116389850A (en) 2023-03-14 2023-03-14 Method and device for generating video by utilizing audio

Publications (1)

Publication Number Publication Date
CN116389850A true CN116389850A (en) 2023-07-04

Family

ID=86968512

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310243642.3A Pending CN116389850A (en) 2023-03-14 2023-03-14 Method and device for generating video by utilizing audio

Country Status (1)

Country Link
CN (1) CN116389850A (en)

Similar Documents

Publication Publication Date Title
US10593021B1 (en) Motion deblurring using neural network architectures
CN107979764B (en) Video subtitle generating method based on semantic segmentation and multi-layer attention framework
CN112887698B (en) High-quality face voice driving method based on nerve radiation field
CN110706302B (en) System and method for synthesizing images by text
CN113554737A (en) Target object motion driving method, device, equipment and storage medium
JP2017091525A (en) System and method for attention-based configurable convolutional neural network (abc-cnn) for visual question answering
CN114144790A (en) Personalized speech-to-video with three-dimensional skeletal regularization and representative body gestures
CN113901894A (en) Video generation method, device, server and storage medium
CN116258652B (en) Text image restoration model and method based on structure attention and text perception
Li et al. Uphdr-gan: Generative adversarial network for high dynamic range imaging with unpaired data
CN114581905B (en) Scene text recognition method and system based on semantic enhancement mechanism
CN117237521A (en) Speech driving face generation model construction method and target person speaking video generation method
Vasani et al. Generation of indian sign language by sentence processing and generative adversarial networks
CN116597857A (en) Method, system, device and storage medium for driving image by voice
CN116797868A (en) Text image generation method and diffusion generation model training method
Zakraoui et al. Improving text-to-image generation with object layout guidance
Rastgoo et al. All you need in sign language production
CN111275778A (en) Face sketch generating method and device
Rastgoo et al. A survey on recent advances in Sign Language Production
CN114240811A (en) Method for generating new image based on multiple images
CN117456587A (en) Multi-mode information control-based speaker face video generation method and device
CN117292017A (en) Sketch-to-picture cross-domain synthesis method, system and equipment
CN116977903A (en) AIGC method for intelligently generating short video through text
Wang [Retracted] An Old Photo Image Restoration Processing Based on Deep Neural Network Structure
CN116912268A (en) Skin lesion image segmentation method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination