CN116389850A - Method and device for generating video by utilizing audio - Google Patents
Method and device for generating video by utilizing audio Download PDFInfo
- Publication number
- CN116389850A CN116389850A CN202310243642.3A CN202310243642A CN116389850A CN 116389850 A CN116389850 A CN 116389850A CN 202310243642 A CN202310243642 A CN 202310243642A CN 116389850 A CN116389850 A CN 116389850A
- Authority
- CN
- China
- Prior art keywords
- audio
- target
- target audio
- video
- pixel
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 58
- 238000004458 analytical method Methods 0.000 claims abstract description 45
- 238000009877 rendering Methods 0.000 claims abstract description 29
- 238000005516 engineering process Methods 0.000 claims abstract description 18
- 239000003086 colorant Substances 0.000 claims abstract description 9
- 238000005070 sampling Methods 0.000 claims description 13
- 230000005855 radiation Effects 0.000 claims description 10
- 238000004590 computer program Methods 0.000 claims description 9
- 238000000605 extraction Methods 0.000 claims description 5
- 210000005036 nerve Anatomy 0.000 claims description 5
- 230000006870 function Effects 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 3
- 230000001186 cumulative effect Effects 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 238000002834 transmittance Methods 0.000 claims description 3
- 238000004891 communication Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 6
- 230000008569 process Effects 0.000 description 6
- 230000004927 fusion Effects 0.000 description 5
- 230000001537 neural effect Effects 0.000 description 5
- 230000005236 sound signal Effects 0.000 description 5
- 239000000284 extract Substances 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000001815 facial effect Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 241001672694 Citrus reticulata Species 0.000 description 1
- 210000004460 N cell Anatomy 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 210000004027 cell Anatomy 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000007786 learning performance Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/172—Classification, e.g. identification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/75—Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/81—Monomedia components thereof
- H04N21/8106—Monomedia components thereof involving special audio data, e.g. different tracks for different languages
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/81—Monomedia components thereof
- H04N21/816—Monomedia components thereof involving special video data, e.g 3D video
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Signal Processing (AREA)
- Human Computer Interaction (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Image Generation (AREA)
Abstract
The invention provides a method and a device for generating video by utilizing audio, belonging to the technical field of computers, wherein the method comprises the following steps: extracting target audio characteristics from target audio; carrying out face analysis on a target image to obtain a face analysis image, and carrying out pixel-by-pixel ray projection on the face analysis image so as to sample ray points; inputting the space coordinates and directions of the light points and the target audio characteristics into a multi-layer perceptron to obtain the color and density of the light points; and rendering a target face video matched with the target audio by using a volume rendering technology based on the color and the density of the light points. The method and the device for generating the video by utilizing the audio, provided by the invention, input the spatial coordinates and the directions of the light points projected by the target audio characteristics and the face analysis image into the multi-layer perceptron, acquire the colors and the densities of the light points, and then generate the video with the expression and the lip motion consistent with the target audio by utilizing the volume drawing technology.
Description
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method and apparatus for generating video using audio.
Background
The classroom is an important place for teachers to give lessons and students to acquire knowledge. With the continuous development of society and the development of information technology, the quality of online classroom teaching becomes more important. The network teaching video is processed by the informatization technology, so that not only can teachers be helped to enrich teaching means, but also students can be helped to improve the attention of the students in class, and the students are helped to improve the learning efficiency. If different speaking styles and characters are used for generating the same teaching content video, teaching of different types of students can be realized, and the learning performance of the students can be improved more significantly.
Traditional approaches rely on intermediate representations like 2D landmarks or 3D face models to bridge the gap between the audio input and the video output, which may lead to semantic mismatch between the original audio signal and the face deformation due to information loss caused by the intermediate representation.
In addition, the invention patent application with the patent application number of 202211508415.0 discloses a method and a device for generating video through voice, and the main technical scheme is as follows: determining voice data corresponding to voice input operation; determining a target character from a plurality of preset characters configured for the target object according to the voice data; acquiring a first model based on voice extraction gesture data, and determining initial gesture data of the voice data through the first model; standard posture data corresponding to the target image are obtained, and the initial posture data are redirected according to the standard posture data to obtain target posture data; and determining a second model corresponding to the target image and used for synthesizing the video based on the gesture, and inputting target gesture data into the second model to generate a target video of the target image. Although applicable, the preset character and the plurality of model transformations need to be configured in advance, increasing the complexity of the operation.
Disclosure of Invention
The invention provides a method and a device for generating a video by utilizing audio, which are used for solving the defect of semantic mismatch between an original audio signal and face deformation in the prior art and generating the video with consistent expression and lip actions and voice.
In a first aspect, the present invention provides a method of generating video using audio, comprising: extracting target audio characteristics from target audio; carrying out face analysis on a target image to obtain a face analysis image, and carrying out pixel-by-pixel ray projection on the face analysis image so as to sample ray points; inputting the space coordinates and directions of the light points and the target audio characteristics into a multi-layer perceptron to obtain the color and density of the light points; and rendering a target face video matched with the target audio by using a volume rendering technology based on the color and the density of the light points.
According to the method for generating video by utilizing audio, the method for extracting the target audio features from the target audio comprises the following steps: and inputting the target audio to a preset voice recognition network model to extract target audio characteristics of the target audio.
According to the method for generating the video by utilizing the audio, the preset voice recognition network model is a deepspech 2 network; the deepsech 2 network comprises 3 convolution layers, 7 circulation layers and 1 full connection layer which are connected in sequence.
According to the method for generating video by utilizing audio, the invention analyzes the face of the target image to obtain the face analysis image, which comprises the following steps: carrying out face analysis on the target image by adopting a Bisenet network to obtain a face analysis image; the network structure of the Bisenet network comprises a Spatial Path and a Context Path; the Spatial Path includes three layers, each layer including a convolution with a stride of 2, followed by a batch normalization and ReLU activation function; the Context Path can rapidly downsample the feature map to obtain a large receptive field, and encode high-level semantic Context information.
According to the method for generating video by utilizing audio, the face analysis image is subjected to pixel-by-pixel projection of rays, wherein the rays projected by any pixel are specifically expressed as follows:
r=o+td;
where r denotes a ray, o denotes a camera coordinate set as an origin, t denotes a distance from a point on the ray to the camera origin, and d denotes a direction of the ray.
According to the method for generating the video by utilizing the audio, provided by the invention, the target face video matched with the target audio is rendered by utilizing a volume rendering technology based on the color and the density of the light points, and the corresponding realization formula is as follows:
where T (T) represents the cumulative light transmittance, tn and tf represent the near and far ends on the light, respectively, σ (r (T)) represents the density, C (T), d) represents the color of each sample point, and C (r) represents the predicted color of each pixel.
According to the method for generating the video by utilizing the audio, the target audio is teacher teaching audio.
In a second aspect, the present invention also provides an apparatus for generating video using audio, including:
the target audio feature extraction module is used for extracting target audio features from target audio;
the light point sampling module is used for carrying out face analysis on the target image to obtain a face analysis image, and carrying out pixel-by-pixel projection on the face analysis image so as to sample light points;
the dynamic nerve radiation field module is used for inputting the space coordinates and directions of the light points and the target audio characteristics into the multilayer perceptron to acquire the colors and the densities of the light points;
and the video generation module is used for rendering the target face video matched with the target audio by utilizing a volume rendering technology based on the color and the density of the ray points.
In a third aspect, the invention provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method of generating video using audio as described in any one of the preceding claims when the program is executed.
In a fourth aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a method of generating video using audio as described in any of the above.
The method and the device for generating the video by utilizing the audio, provided by the invention, input the spatial coordinates and the directions of the light points projected by the target audio characteristics and the face analysis image into the multi-layer perceptron, acquire the colors and the densities of the light points, and then generate the video with the expression and the lip motion consistent with the target audio by utilizing the volume drawing technology.
Furthermore, the audio features and the portrait features are directly input into the neural network, a plurality of preset images are not required to be configured in advance, a plurality of models are not required to be prepared, the computing resources are saved, and the problem of semantic missing caused by the conversion of semantics among the plurality of models is also reduced. Simultaneously, the training is performed by using the audio under the Chinese teaching scene in the pretraining of deepsech 2 in combination with the education scene, so that the effect of extracting the semantic features of the audio is more accurate.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for generating video using audio according to the present invention;
fig. 2 is a schematic diagram of a deepsech 2 network structure provided by the present invention;
FIG. 3 is a schematic diagram of face parsing using a Bisenet network according to the present invention;
FIG. 4 is a second flow chart of a method for generating video using audio according to the present invention;
fig. 5 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be noted that in the description of embodiments of the present invention, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The following describes a method and apparatus for generating video using audio according to embodiments of the present invention with reference to fig. 1 to 5.
Fig. 1 is a schematic flow chart of a method for generating video by audio according to the present invention, as shown in fig. 1, including but not limited to the following steps:
step 101: and extracting target audio characteristics from the target audio.
The invention can collect the target audio by using a plurality of devices with recording functions such as a mobile phone or a recording pen, or directly extract the audio from the video to obtain the audio with the format wav. Alternatively, the target audio may be audio in a teacher classroom teaching process, that is, teacher teaching audio.
Further, the target audio is input to a preset voice recognition network model to extract target audio characteristics of the target audio.
Step 102: and carrying out face analysis on the target image to obtain a face analysis image, and carrying out pixel-by-pixel ray projection on the face analysis image to sample ray points.
The target image may be an image containing teacher's face, which needs to be face-resolved because the motion of the head portion is not generally consistent with the motion of the torso portion, and we need to train the head and torso separately in the image.
Further, the obtained face analysis image is subjected to pixel-by-pixel light projection and light point sampling. If the focal point of the camera is known, the line connecting the focal point and the pixel can be connected to form a ray, and for the ray, the density (only related to the space coordinates) and the color (both depending on the space coordinates and the incident angle) of each point in space, the color of each pixel can be obtained by using the volume rendering technology.
Alternatively, the sampling of the light points takes place in a uniform sampling manner.
Step 103: and inputting the spatial coordinates and directions of the light points and the target audio characteristics into a multi-layer perceptron to obtain the colors and densities of the light points.
The step is a process of acquiring the color and density characteristics of the light points by using the dynamic nerve radiation field. The nerve radiation field is a new hot spot in the current computer vision research, and has the advantages that a two-dimensional image is input, and the position and the shape of a three-dimensional object are output by 'reconstructing' the two-dimensional image. And only the multi-layer perceptron is used, the mapping from the two-dimensional picture to the three-dimensional scene can be realized, and the computer resources are saved.
It can be understood that a multi-layer perceptron is constructed, the input of the network consists of audio features, coordinates and directions of light points, color and density values of the light points are output, eight layers in total, the width of the middle layer is 256, and the dimension of the output layer is 4.
Step 104: and rendering a target face video matched with the target audio by using a volume rendering technology based on the color and the density of the light points.
According to the method for generating the video by utilizing the audio, the spatial coordinates and the directions of the light points projected by the target audio features and the face analysis image are input into the multi-layer perceptron, the colors and the densities of the light points are obtained, and then the video with the expression and the lip motion consistent with the target audio is generated by utilizing the volume drawing technology.
Based on the foregoing embodiments, as an optional embodiment, the method for generating video using audio provided by the present invention, the preset speech recognition network model is a deepsech 2 network; the deepsech 2 network comprises 3 convolution layers, 7 circulation layers and 1 full connection layer which are connected in sequence.
FIG. 2 is a schematic diagram of the deep speech2 network structure provided in the present invention, and as shown in FIG. 2, the deep speech2 network is a speech recognition network model including 3 convolutional layers, 7 loop layers and 1 full connection layer, each convolutional layer including the following parameters: the convolution kernel sizes are 11×41×32, 11×21×32, 11×11×96, respectively, the step size of each convolution layer is 2, and a ReLu activation function is adopted; the 7 loop layers contained the following parameters: each loop layer consists of an LSTM structure, each LSTM structure comprising 2048 hidden units, and finally, through a full connection layer, the output of the LSTM layer is connected to a final classification layer, which contains all possible text labels, for extracting 29-dimensional audio features from the target audio features, predicting a 29-dimensional feature code for each 20ms audio segment using the deep 2 model, and extracting semantic information. The audio features of several consecutive frames are jointly transmitted to a time domain convolution network to cancel the noise signal in the original input. Specifically, feature a ε R from 16 adjacent frames is used 16×29 To represent the current state of the audio modality. The use of audio features instead of regression expression coefficients or facial markers is beneficial to alleviating training costs of an intermediate translation network and preventing potential between audio and visual signalsSemantic mismatch problem at the place.
Optionally, in the case that the target audio is teacher teaching audio, in view of the fact that the classroom teaching language environment is chinese, a wennetspecch data set is specially selected to pretrain the deepspech 2 network, and the data set covers the labeled mandarin audio of ten thousand hours, so that the classroom audio features extracted by the model are more characterized.
Based on the foregoing embodiments, as an optional embodiment, the method for generating video by using audio provided by the present invention performs face analysis on a target image to obtain a face analysis image, which specifically includes the following steps.
Fig. 3 is a schematic diagram of face analysis using a bisnet, and referring to fig. 3, the network structure includes a Spatial Path (SP) and a Context Path (CP). The two components are used for solving the problems of space information loss and receptive field shrinkage respectively.
Spatial Path; it contains three layers, each layer containing one stride (stride) of 2 convolutions, followed by batch normalization and ReLU. Therefore, the road network extracts an output feature map corresponding to 1/8 of the original image. Because it utilizes a feature map of a larger scale, relatively rich spatial information can be encoded.
Context Path: the (CP) can rapidly downsample the feature map to obtain a large receptive field, encoding high-level semantic context information. Next, a global averaging pool is added at the end of the model, providing a maximum receptive field through global context information.
ARM module: ARM is used in the context path to optimize the features of each stage, which includes two branches, a global branch and a local branch, respectively. The global branch is mainly used for extracting global information, compresses an input feature map into a global feature vector so as to acquire global context information, and the local branch is used for carrying out feature compression on a 1X 1 convolution layer and then carrying out feature extraction on a 3X 3 convolution layer so as to acquire more local information.
And a feature fusion module: the feature fusion module fuses the features under different scales to help the network to better understand the image. It consists of two branches, an up-sampling branch, which is typically used to extract features of high fraction , and a down-sampling branch, which is used to extract features of low resolution but more global. In particular, the upsampling branch of the feature fusion module typically includes a deconvolution layer for upsampling the low-fraction feature map to a high resolution for fusion with other high-fraction features. The downsampling branches then typically include a pooling or convolution layer to reduce resolution and expand receptive fields to obtain more global feature information. Finally, the outputs of the up-sampling branch and the down-sampling branch are fused to obtain the final output of the feature fusion module.
Based on the foregoing embodiments, as an optional embodiment, the method for generating video using audio provided by the present invention performs pixel-by-pixel projection on the face analysis image, where a ray projected by any pixel is specifically expressed as:
r=o+td;
where r denotes a ray, o denotes a camera coordinate set as an origin, t denotes a distance from a point on the ray to the camera origin, and d denotes a direction of the ray. The sampling of the light points adopts a uniform sampling mode.
In step 103, the dynamic neural radiation field is embodied as:
F:(a,d,x)→(c,σ);
where a represents the 29-dimensional target audio feature, d represents the direction of the ray, x represents the spatial coordinates of the sampled ray point, c represents the color at the sampled ray point, σ represents the density at the sampled ray point, and the entire map F represents the dynamic neural radiation field.
Based on the foregoing, as an alternative embodiment, the present invention renders face video from the resulting neural radiation field using volume rendering techniques.
The specific implementation mode of volume rendering is as follows:
where T (T) represents the cumulative light transmittance, tn and tf represent the near and far ends on the light, respectively, σ (T) represents the density, C (r (T), d) represents the color of each sample point, so for the color of this pixel, the color of this point is weighted by T (T) σ (r (T)), i.e., how much light remains for the light to strike this point, and C (r) represents the predicted color of each pixel.
In the actual rendering process, the invention can equally divide the rays into N cells, randomly sample a point in each cell, and carry out weighted summation on the colors of the points obtained by sampling:
t i ~u[tn+(i-1)*(tf-tn)/N,tn+(i)*(tf-tn)/N]
wherein delta i =t i+1 -t i 。
The invention can collect classroom teaching voice (target audio) as input, determine voice data corresponding to voice input operation, and extract data characteristics of voice by using a deepspech 2 neural network. Features of an input audio signal are directly fed into a multi-layer perceptron to generate a dynamic neural radiation field, and the features obtained by the dynamic neural radiation field are synthesized into a high-fidelity talking head video corresponding to the audio signal by using volume rendering. The invention effectively solves the problem of connection between the audio signal and the facial deformation, and avoids distortion phenomenon of the generated teacher video expression and lip action.
Based on the foregoing embodiments, as an alternative embodiment, fig. 4 is a second flowchart of a method for generating video using audio according to the present invention, as shown in fig. 4, the present invention obtains colors and densities of light points corresponding to a head and a trunk, and further renders a target video matching the target audio using a volume rendering technique.
The present invention also provides an apparatus for generating video using audio, the apparatus comprising:
the target audio feature extraction module is used for extracting target audio features from target audio;
the light point sampling module is used for carrying out face analysis on the target image to obtain a face analysis image, and carrying out pixel-by-pixel projection on the face analysis image so as to sample light points;
the dynamic nerve radiation field module is used for inputting the space coordinates and directions of the light points and the target audio characteristics into the multilayer perceptron to acquire the colors and the densities of the light points;
and the video generation module is used for rendering the target face video matched with the target audio by utilizing a volume rendering technology based on the color and the density of the ray points.
It should be noted that, when the apparatus for generating video using audio provided in the embodiment of the present invention specifically operates, the method for generating video using audio described in any one of the above embodiments may be executed, which is not described in detail in this embodiment.
Fig. 5 is a schematic structural diagram of an electronic device according to the present invention, and as shown in fig. 5, the electronic device may include: processor 510, communication interface (Communications Interface) 520, memory 530, and communication bus 540, wherein processor 510, communication interface 520, memory 530 complete communication with each other through communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform a method for generating video using audio, the method comprising: extracting target audio characteristics from target audio; carrying out face analysis on a target image to obtain a face analysis image, and carrying out pixel-by-pixel ray projection on the face analysis image so as to sample ray points; inputting the space coordinates and directions of the light points and the target audio characteristics into a multi-layer perceptron to obtain the color and density of the light points; and rendering a target face video matched with the target audio by using a volume rendering technology based on the color and the density of the light points.
Further, the logic instructions in the memory 530 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, are capable of performing the method for generating video using audio provided by the above embodiments, the method comprising: extracting target audio characteristics from target audio; carrying out face analysis on a target image to obtain a face analysis image, and carrying out pixel-by-pixel ray projection on the face analysis image so as to sample ray points; inputting the space coordinates and directions of the light points and the target audio characteristics into a multi-layer perceptron to obtain the color and density of the light points; and rendering a target face video matched with the target audio by using a volume rendering technology based on the color and the density of the light points.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the method for generating video using audio provided by the above embodiments, the method comprising: extracting target audio characteristics from target audio; carrying out face analysis on a target image to obtain a face analysis image, and carrying out pixel-by-pixel ray projection on the face analysis image so as to sample ray points; inputting the space coordinates and directions of the light points and the target audio characteristics into a multi-layer perceptron to obtain the color and density of the light points; and rendering a target face video matched with the target audio by using a volume rendering technology based on the color and the density of the light points.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims (10)
1. A method for generating video using audio, comprising:
extracting target audio characteristics from target audio;
carrying out face analysis on a target image to obtain a face analysis image, and carrying out pixel-by-pixel ray projection on the face analysis image so as to sample ray points;
inputting the space coordinates and directions of the light points and the target audio characteristics into a multi-layer perceptron to obtain the color and density of the light points;
and rendering a target face video matched with the target audio by using a volume rendering technology based on the color and the density of the light points.
2. The method of generating video from audio of claim 1, wherein the extracting target audio features from the target audio comprises:
and inputting the target audio to a preset voice recognition network model to extract target audio characteristics of the target audio.
3. The method for generating video using audio according to claim 2, wherein the predetermined speech recognition network model is a deepspech 2 network;
the deepsech 2 network comprises 3 convolution layers, 7 circulation layers and 1 full connection layer which are connected in sequence.
4. The method for generating video using audio according to claim 1, wherein performing face parsing on the target image to obtain a face parsed image comprises:
carrying out face analysis on the target image by adopting a Bisenet network to obtain a face analysis image; the network structure of the Bisenet network comprises a Spatial Path and a Context Path;
the Spatial Path includes three layers, each layer including a convolution with a stride of 2, followed by a batch normalization and ReLU activation function;
the Context Path can rapidly downsample the feature map to obtain a large receptive field, and encode high-level semantic Context information.
5. The method of generating video using audio according to claim 1, wherein the face analysis image is subjected to pixel-by-pixel projection of rays, wherein rays projected by any pixel are expressed as:
r=o+td;
where r denotes a ray, o denotes a camera coordinate set as an origin, t denotes a distance from a point on the ray to the camera origin, and d denotes a direction of the ray.
6. The method for generating video using audio according to claim 5, wherein the rendering of the target face video matching the target audio using volume rendering technique is based on the color and density of the ray points, and the corresponding implementation formula is:
where T (T) represents the cumulative light transmittance, tn and tf represent the near and far ends on the light, respectively, σ (r (T)) represents the density, C (T), d) represents the color of each sample point, and C (r) represents the predicted color of each pixel.
7. The method of generating video from audio of claim 1, wherein the target audio is teacher teaching audio.
8. An apparatus for generating video using audio, comprising:
the target audio feature extraction module is used for extracting target audio features from target audio;
the light point sampling module is used for carrying out face analysis on the target image to obtain a face analysis image, and carrying out pixel-by-pixel projection on the face analysis image so as to sample light points;
the dynamic nerve radiation field module is used for inputting the space coordinates and directions of the light points and the target audio characteristics into the multilayer perceptron to acquire the colors and the densities of the light points;
and the video generation module is used for rendering the target face video matched with the target audio by utilizing a volume rendering technology based on the color and the density of the ray points.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of generating video using audio as claimed in any one of claims 1 to 7 when the computer program is executed.
10. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of the method of generating video from audio according to any of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310243642.3A CN116389850A (en) | 2023-03-14 | 2023-03-14 | Method and device for generating video by utilizing audio |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310243642.3A CN116389850A (en) | 2023-03-14 | 2023-03-14 | Method and device for generating video by utilizing audio |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116389850A true CN116389850A (en) | 2023-07-04 |
Family
ID=86968512
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310243642.3A Pending CN116389850A (en) | 2023-03-14 | 2023-03-14 | Method and device for generating video by utilizing audio |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116389850A (en) |
-
2023
- 2023-03-14 CN CN202310243642.3A patent/CN116389850A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10593021B1 (en) | Motion deblurring using neural network architectures | |
CN107979764B (en) | Video subtitle generating method based on semantic segmentation and multi-layer attention framework | |
CN112887698B (en) | High-quality face voice driving method based on nerve radiation field | |
CN110706302B (en) | System and method for synthesizing images by text | |
CN113554737A (en) | Target object motion driving method, device, equipment and storage medium | |
JP2017091525A (en) | System and method for attention-based configurable convolutional neural network (abc-cnn) for visual question answering | |
CN114144790A (en) | Personalized speech-to-video with three-dimensional skeletal regularization and representative body gestures | |
CN113901894A (en) | Video generation method, device, server and storage medium | |
CN116258652B (en) | Text image restoration model and method based on structure attention and text perception | |
Li et al. | Uphdr-gan: Generative adversarial network for high dynamic range imaging with unpaired data | |
CN114581905B (en) | Scene text recognition method and system based on semantic enhancement mechanism | |
CN117237521A (en) | Speech driving face generation model construction method and target person speaking video generation method | |
Vasani et al. | Generation of indian sign language by sentence processing and generative adversarial networks | |
CN116597857A (en) | Method, system, device and storage medium for driving image by voice | |
CN116797868A (en) | Text image generation method and diffusion generation model training method | |
Zakraoui et al. | Improving text-to-image generation with object layout guidance | |
Rastgoo et al. | All you need in sign language production | |
CN111275778A (en) | Face sketch generating method and device | |
Rastgoo et al. | A survey on recent advances in Sign Language Production | |
CN114240811A (en) | Method for generating new image based on multiple images | |
CN117456587A (en) | Multi-mode information control-based speaker face video generation method and device | |
CN117292017A (en) | Sketch-to-picture cross-domain synthesis method, system and equipment | |
CN116977903A (en) | AIGC method for intelligently generating short video through text | |
Wang | [Retracted] An Old Photo Image Restoration Processing Based on Deep Neural Network Structure | |
CN116912268A (en) | Skin lesion image segmentation method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |