WO2023184714A1 - 驱动虚拟人说话和模型训练方法、装置、计算设备及系统 - Google Patents
驱动虚拟人说话和模型训练方法、装置、计算设备及系统 Download PDFInfo
- Publication number
- WO2023184714A1 WO2023184714A1 PCT/CN2022/098739 CN2022098739W WO2023184714A1 WO 2023184714 A1 WO2023184714 A1 WO 2023184714A1 CN 2022098739 W CN2022098739 W CN 2022098739W WO 2023184714 A1 WO2023184714 A1 WO 2023184714A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- video
- training
- speaking
- model
- lip synchronization
- Prior art date
Links
- 238000012549 training Methods 0.000 title claims abstract description 281
- 238000000034 method Methods 0.000 title claims abstract description 95
- 241000282414 Homo sapiens Species 0.000 claims abstract description 12
- 238000012545 processing Methods 0.000 claims description 38
- 230000015654 memory Effects 0.000 claims description 21
- 238000013528 artificial neural network Methods 0.000 description 57
- 230000006870 function Effects 0.000 description 38
- 238000013527 convolutional neural network Methods 0.000 description 30
- 238000011176 pooling Methods 0.000 description 29
- 230000033001 locomotion Effects 0.000 description 22
- 230000008569 process Effects 0.000 description 21
- 238000010586 diagram Methods 0.000 description 18
- 239000011159 matrix material Substances 0.000 description 15
- 210000002569 neuron Anatomy 0.000 description 15
- 238000004891 communication Methods 0.000 description 10
- 239000000463 material Substances 0.000 description 10
- 238000007781 pre-processing Methods 0.000 description 10
- 230000000694 effects Effects 0.000 description 9
- 239000000284 extract Substances 0.000 description 9
- 238000004422 calculation algorithm Methods 0.000 description 8
- 238000004590 computer program Methods 0.000 description 8
- 230000014509 gene expression Effects 0.000 description 8
- 230000009471 action Effects 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 6
- 238000013500 data storage Methods 0.000 description 6
- 230000008921 facial expression Effects 0.000 description 6
- 210000003128 head Anatomy 0.000 description 6
- 230000000306 recurrent effect Effects 0.000 description 6
- 238000013480 data collection Methods 0.000 description 5
- 238000000605 extraction Methods 0.000 description 5
- 230000015572 biosynthetic process Effects 0.000 description 4
- 210000004709 eyebrow Anatomy 0.000 description 4
- 230000001815 facial effect Effects 0.000 description 4
- 238000003786 synthesis reaction Methods 0.000 description 4
- 230000004913 activation Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 3
- 238000010276 construction Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 238000004519 manufacturing process Methods 0.000 description 3
- 230000036544 posture Effects 0.000 description 3
- 241000282412 Homo Species 0.000 description 2
- 230000004424 eye movement Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- 239000007787 solid Substances 0.000 description 2
- 241000282472 Canis lupus familiaris Species 0.000 description 1
- 241000282326 Felis catus Species 0.000 description 1
- 238000013256 Gubra-Amylin NASH model Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 238000013475 authorization Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 239000003651 drinking water Substances 0.000 description 1
- 235000020188 drinking water Nutrition 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T17/00—Three dimensional [3D] modelling, e.g. data description of 3D objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/18—Eye characteristics, e.g. of the iris
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/10—Transforming into visible information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/10—Transforming into visible information
- G10L2021/105—Synthesis of the lips movements from speech, e.g. for talking heads
Definitions
- the present application relates to the field of artificial intelligence, and in particular to a method, device, computing device and system for driving virtual human speech and model training.
- Virtual Human refers to a synthetic three-dimensional model based on virtual human technology that simulates the movements, expressions, pronunciation, etc. of real people.
- the collected video of people talking is used as a training set, and the model is trained based on the training set, so that the model uses real-time speech to obtain the parameters that drive the virtual person to speak, so as to obtain the video of the virtual person talking.
- collecting videos of people talking requires a lot of manpower and material resources, so the amount of data collected from videos of people talking is limited. Due to the limited amount of data in the video of people talking as a training set, the accuracy of the model obtained by training is low, which in turn leads to inaccurate parameters generated by the model to drive virtual people to speak.
- Embodiments of the present application provide a method, device, computing device and system for driving virtual human speech and model training, which can solve the problem of low model accuracy caused by the limited amount of training set data and inaccurate generated parameters for driving virtual human speech. This improves the accuracy of the parameters that drive virtual humans to speak.
- the first aspect is to provide a method for driving virtual people to speak and model training.
- This method can be executed by a computing device, such as a terminal on the end side or a training device on the cloud side. Specifically, it includes the following steps: the computing device is based on the audio data set, the character's speech The video generates an initial virtual human speaking video, and then generates a lip synchronization parameter generation model based on the initial virtual human speaking video, so that the lip synchronization parameters generated by the lip synchronization parameter generation model can be used to drive the target virtual person to speak to generate the target virtual person's speech. video.
- the computing device uses the initial virtual human speaking video to expand the amount of training data for the lip synchronization parameter generation model.
- the length of the real person speaking video to be recorded is shorter, which reduces Recording videos of real people talking consumes resources such as manpower and material resources. Therefore, the computing device can obtain a larger amount of training data based on shorter videos of real people speaking, ensuring sufficient training data for the lip synchronization parameter generation model, making the model trained by the computing device highly accurate, thereby improving the lip synchronization obtained through training.
- the generalization performance of the parameter generation model enables the target virtual person driven by the lip synchronization parameters output by the lip synchronization parameter generation model to have better lip synchronization.
- the computing device uses the initial virtual human speaking video and the three-dimensional face reconstruction model to generate a lip synchronization parameter generation model. For example, the computing device determines the lip synchronization training parameters based on the initial virtual human speaking video and the three-dimensional face reconstruction model, the lip synchronization training parameters are used as labels for training the lip synchronization parameter generation model, and then trains based on the audio data set and the lip synchronization Parameter training lip synchronization parameter generation model. Among them, lip synchronization training parameters are used as labels for model training, and audio data sets are used as model input data. The lip-sync parameter generation model is used to generate lip-sync parameters based on the input audio. Lip synchronization parameters are used to drive the target virtual person to speak to obtain a video of the target virtual person speaking.
- the computing device determines lip synchronization training parameters based on the initial virtual human speaking video and the three-dimensional face reconstruction model.
- the computing device may map the character's speaking actions in the initial virtual human speaking video to the three-dimensional face model, and from the three-dimensional face model Extract lip synchronization training parameters. Therefore, when the clarity of the initial virtual human speaking video is lower than the clarity of the target virtual human speaking video, direct extraction of lip synchronization training parameters from low-definition videos is avoided, ensuring the accuracy of lip synchronization training parameters. , thereby improving the accuracy of the lip synchronization parameter generation model.
- the computing device generates an initial virtual human speaking video based on the audio data set and the character speaking video, which may include: inputting the audio data set and the character speaking video into a pre-training model to obtain a voice-driven character based on the audio data set.
- An initial virtual human speaking video in which the character in the speaking video speaks. Among them, the duration of the character talking video is shorter than the duration of the speech in the audio data set.
- the computing device uses the pre-trained model to quickly and simply generate a large amount of initial virtual human speaking video based on the multiple language voices, multiple timbre voices, and multiple content voices in the audio data set, thereby expanding the amount of training data. .
- the pre-trained model can be a cross-modal speech-driven facial movement network model.
- the definition of the above-mentioned initial virtual human speaking video is lower than the definition of the virtual human speaking video to be generated, thereby reducing the computing resource overhead of the computing device for processing the video.
- the pre-trained model is used to extract character speaking features from character speaking videos, and output an initial virtual human speaking video based on the audio data set and the character speaking features.
- Character speaking features are the facial features of the character speaking in the character speaking image.
- the computing device can use the pre-trained model to preprocess the video of the person talking to obtain the characteristics of the person talking. For example, the computing device crops the facial area of the person in the video of the person talking to obtain the face video, and then Feature extraction is performed on the video to obtain character speech features.
- the computing device uses preprocessing steps to retain the facial movements of the character's speaking movements in the character's speaking video, ensuring the accuracy of the character's speaking features, thereby improving the accuracy of the lip synchronization parameter generation model obtained through training.
- the audio data set includes multiple language voices, multiple timbre voices, and multiple content voices.
- the recording of audio data sets does not require attention to the sitting posture, expressions, movements, lighting conditions, etc. of the real person being recorded, and the requirements for data recording are lower than those of video.
- the collection of audio data sets is faster and simpler than the video of real people speaking, so that a large number of audio data sets can be quickly collected and the difficulty of collecting training data is reduced.
- the audio data set includes speech data of different languages, timbres and contents, ensuring the diversity of training data, thereby improving the generalization ability of the lip synchronization parameter generation model.
- the voice data in the audio data set can be obtained by recording a person's speech, or it can be obtained from a local database on the network or computing device.
- the audio data set may also include voice data in videos of people talking. This ensures the diversity of training data and improves the generalization ability of the lip synchronization parameter generation model.
- the audio data set can also include audio from videos of people talking to further expand the training set of the lip synchronization parameter generation model.
- the lip synchronization training parameters and the lip synchronization parameters are the same parameters that represent the expression movements of the three-dimensional face model.
- the lip synchronization training parameters may include eye feature parameters and lip feature parameters.
- the eye feature parameters may include parameters representing eye movements such as eye opening, eye closing, eye opening size, and gaze direction.
- Lip feature parameters may include parameters representing mouth movements such as mouth opening, mouth closing, and mouth opening size.
- lip synchronization training parameters may also include head feature parameters, eyebrow feature parameters, etc.
- the lip synchronization training parameters include characteristic parameters of multiple parts of the human face, which is beneficial to improving the diversity of virtual human speech movements and enhancing the generalization performance of the lip synchronization parameter generation model.
- a method for driving a virtual person to speak is provided.
- the method can be executed by a computing device, such as a terminal on the end side, and specifically includes the following steps: the computing device obtains the input audio and the first definition video of the character speaking, and converts the input audio Input the lip synchronization parameters to generate the model and obtain the video of the target virtual person speaking.
- the training set of the lip synchronization parameter generation model and the target virtual human are obtained based on the video containing the video of the person speaking with the first definition, which is lower than the definition of the video of the target virtual person speaking.
- the first-definition video of the character talking is the initial video of the virtual person talking.
- the amount of training data for the lip synchronization parameter generation model is larger, so the accuracy and generalization performance of the lip synchronization parameter generation model obtained through training is better. Therefore, the accuracy of the lip synchronization parameters output by the lip synchronization parameter generation model is high, and the lip synchronization of the virtual human driven by the computing device based on the lip synchronization parameters is high.
- the step of generating a video of the target virtual person speaking may be that the computing device obtains the lip synchronization parameters output by the lip synchronization parameter generation model, and drives the target virtual person to speak according to the lip synchronization parameters to obtain a video of the target virtual person speaking.
- the computing device may also update the lip synchronization parameter generation model. For example, the computing device generates an initial virtual person speaking video based on the input audio and the target virtual person speaking video; wherein the duration of the initial virtual person speaking video is greater than the duration of the target virtual person speaking video, and then the computing device updates using the initial virtual person speaking video Lip synchronization parameter generation model. This improves the generalization ability and accuracy of the lip synchronization parameter generation model.
- a model training device for driving virtual human speech including: a video generation module and a training module.
- the video generation module is used to generate an initial virtual human speech video based on audio data sets and character speech videos. Among them, the duration of the initial virtual human talking video is longer than the duration of the character talking video.
- the training module is used to generate a lip synchronization parameter generation model using the initial virtual human speaking video.
- the lip synchronization parameter generation model is used to obtain the target virtual human speaking video.
- the clarity of the initial virtual human speaking video is lower than the clarity of the target virtual human speaking video.
- the training module is specifically used to: generate a lip synchronization parameter generation model using an initial virtual human speaking video and a three-dimensional face reconstruction model.
- the training module is specifically used to: use a three-dimensional face reconstruction model to extract lip synchronization training parameters from the initial virtual human speaking video; use the lip synchronization training parameters as labels and the audio data set as model input data , the lip synchronization parameter generation model is obtained by training.
- the video generation module is specifically used to: input the audio data set and the character speaking video into the pre-training model to obtain an initial virtual human speaking video based on the speech-driven character speaking video in the audio data set, The duration of the video of the character speaking is shorter than the duration of the voice in the audio data set.
- the pre-trained model is used to extract character speaking features from character speaking videos, and output an initial virtual human speaking video based on the audio data set and character speaking features.
- the duration of the character talking video is less than or equal to 5 minutes, and the duration of the initial virtual person talking video is greater than or equal to ten hours.
- the audio data set includes multiple language voices, multiple timbre voices, and multiple content voices.
- the lip synchronization parameters include eye feature parameters and lip feature parameters.
- the audio data set contains audio from videos of people talking.
- a device for driving a virtual human to speak including: an input module and a model processing module.
- the input module is used to obtain input audio and target virtual humans.
- the model processing module is used to generate a video of a target virtual person speaking based on the input audio, using a lip-synchronization parameter generation model; the training set of the lip-synchronization parameter generation model is based on a video containing a first-definition video of a person speaking and a three-dimensional human face.
- the first clarity obtained by the reconstructed model is lower than the clarity of the target virtual human speaking video.
- the device for driving the virtual human to speak further includes: a training module configured to update the lip synchronization parameter generation model according to the input audio.
- the training module is specifically used to: generate an initial virtual person speaking video based on the input audio and the target virtual person speaking video; wherein the duration of the initial virtual person speaking video is longer than the duration of the target virtual person speaking video; use the initial virtual person speaking video to update the audio Lip synchronization parameter generation model.
- model training device for driving a virtual human to speak described in the third aspect or the device for driving a virtual human to speak described in the fourth aspect may be a terminal device or a network device, or may be configured on a terminal device or a network device.
- the chip (system) or other components or components in may also be a device including terminal equipment or network equipment, which is not limited in this application.
- the technical effects of the model training device for driving a virtual human to speak described in the third aspect can be referred to the technical effects of the model training method for driving a virtual human to speak described in the first aspect, and the technical effects of the device for driving a virtual human to speak described in the fourth aspect.
- the technical effects please refer to the technical effects of the method of driving virtual human speech described in the second aspect, which will not be described again here.
- a computing device including a memory and a processor.
- the memory is used to store a set of computer instructions.
- the processor executes the set of computer instructions, it is used to execute any of the first aspects.
- the technical effects of the computing device described in the fifth aspect can be referred to the technical effects of the model training method for driving the virtual human to speak described in the first aspect, or the technical effects of the method of driving the virtual human to speak described in the second aspect, No further details will be given here.
- a sixth aspect provides a system for driving a virtual person to speak.
- the system for driving a virtual person to speak includes a training device and at least one terminal. At least one terminal is connected to the training device.
- the training device is used to perform any of the possible implementation methods in the first aspect.
- at least one terminal is used to perform the operation steps of the method for driving a virtual human to speak in any possible implementation manner of the second aspect.
- a computer-readable storage medium including: computer software instructions; when the computer software instructions are run in a data processing system, the virtual human speaking system is driven to execute any one of the first aspect or the second aspect.
- a computer program product is provided.
- the computer program product When the computer program product is run on a computer, it causes the data processing system to perform the operation steps of the method described in any possible implementation manner of the first aspect or the second aspect.
- Figure 1 is a schematic structural diagram of a neural network provided by an embodiment of the present application.
- Figure 2 is a schematic structural diagram of a convolutional neural network provided by an embodiment of the present application.
- Figure 3 is a schematic structural diagram of a virtual human generation model provided by an embodiment of the present application.
- Figure 4 is a schematic diagram of a virtual human construction cycle provided by an embodiment of the present application.
- Figure 5 is an architectural schematic diagram of a system for driving virtual human speech provided by an embodiment of the present application.
- Figure 6a is a schematic diagram of a model training method for driving a virtual human to speak provided by an embodiment of the present application
- Figure 6b is a schematic diagram of another model training method for driving a virtual human to speak provided by an embodiment of the present application
- Figure 7 is a schematic diagram of a method for driving a virtual human to speak provided by an embodiment of the present application.
- Figure 8 is a schematic diagram of a model training device for driving a virtual human to speak provided by an embodiment of the present application
- Figure 9 is a schematic diagram of a device for driving a virtual human to speak provided by an embodiment of the present application.
- FIG. 10 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
- a neural network can be composed of neurons, and a neuron can refer to an arithmetic unit that takes x s and intercept 1 as input. The output of this operation unit satisfies the following formula (1).
- s 1, 2,...n, n is a natural number greater than 1
- W s is the weight of x s
- b is the bias of the neuron.
- f is the activation function of a neuron, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neuron into an output signal.
- the output signal of the activation function can be used as the input of the next layer, and the activation function can be a sigmoid function.
- a neural network is a network formed by connecting multiple single neurons mentioned above, that is, the output of one neuron can be the input of another neuron.
- the input of each neuron can be connected to the local receptive field of the previous layer to extract the features of the local receptive field.
- the local receptive field can be an area composed of several neurons. Weights represent the strength of connections between different neurons. Weight determines the influence of input on output. A weight close to 0 means that changing the input does not change the output. Negative weights mean that increasing input decreases output.
- the neural network 100 includes N processing layers, where N is an integer greater than or equal to 3.
- the first layer of the neural network 100 is the input layer 110, which is responsible for receiving input signals.
- the last layer of the neural network 100 is the output layer 130, which is responsible for outputting the processing results of the neural network.
- the other layers except the first layer and the last layer are intermediate layers 140. These intermediate layers 140 together form a hidden layer 120.
- Each intermediate layer 140 in the hidden layer 120 can both receive input signals and output signals.
- the hidden layer 120 is responsible for the processing of the input signal.
- Each layer represents a logical level of signal processing. Through multiple layers, the data signal can be processed by multi-level logic.
- the input signal of the neural network may be a video signal, a voice signal, a text signal, an image signal or a temperature signal, etc. in various forms.
- the voice signal can be various sensor signals such as human voice audio signals recorded by a microphone (sound sensor) such as talking and singing.
- the input signals of the neural network also include various other computer-processable engineering signals, which will not be listed here. If a neural network is used to perform deep learning on image signals, the quality of images processed by the neural network can be improved.
- Convolutional Neural Network is a deep neural network with a convolutional structure.
- the convolutional neural network contains a feature extractor composed of convolutional layers and subsampling layers.
- the feature extractor can be regarded as a filter, and the convolution process can be regarded as using a trainable filter to convolve with an input image or feature map.
- the convolutional layer refers to the neuron layer in the convolutional neural network that convolves the input signal.
- a neuron can be connected to only some of the neighboring layer neurons.
- a convolutional layer can output several feature maps, and the feature map can refer to the intermediate result during the operation of the convolutional neural network.
- Neurons in the same feature map share weights, and the shared weights here are convolution kernels.
- Shared weights can be understood as a way to extract image information independent of position. That is, the statistics of one part of the image are the same as those of other parts. This means that the image information learned in one part can also be used in another part. Therefore, the same learned image information can be used for all positions on the image.
- multiple convolution kernels can be used to extract different image information. Generally, the greater the number of convolution kernels, the richer the image information reflected by the convolution operation.
- the convolution kernel can be initialized in the form of a random-sized matrix. During the training process of the convolutional neural network, the convolution kernel can obtain reasonable weights through learning. In addition, the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
- the convolutional neural network 200 may include an input layer 210, a convolutional/pooling layer 220 (where the pooling layer is optional), and a neural network layer 230.
- the convolutional layer/pooling layer 220 may include layers 221 to 226, for example.
- layer 221 may be a convolution layer
- layer 222 may be a pooling layer
- layer 223 may be a convolution layer
- layer 224 may be a pooling layer
- layer 225 may be a convolution layer
- layer 226 may be, for example, a pooling layer.
- layer 221 and layer 222 may be, for example, a convolution layer
- layer 223 may be, for example, a pooling layer
- layer 224 and layer 225 may be, for example, a convolution layer
- layer 226 may be, for example, a pooling layer.
- the output of a convolutional layer can be used as the input of a subsequent pooling layer or as the input of another convolutional layer to continue the convolution operation.
- the convolution layer 221 may include many convolution operators, and the convolution operators may also be called kernels.
- the role of the convolution operator in image processing is equivalent to a filter that extracts specific information from the input image matrix.
- the convolution operator can essentially be a weight matrix, which is usually predefined. The size of this weight matrix is related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image. During the convolution operation, the weight matrix extends to the entire depth of the input image.
- convolution with a single weight matrix will produce a convolved output with a single depth dimension, but in most cases instead of using a single weight matrix, multiple weight matrices of the same size (rows ⁇ columns) are applied, That is, multiple matrices of the same type.
- the output of each weight matrix is stacked to form the depth dimension of the convolutional image.
- Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract edge information of the image, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to remove unnecessary noise in the image. Perform blurring, etc.
- the multiple weight matrices have the same size (row ⁇ column), and the feature maps extracted by the multiple weight matrices with the same size are also the same size. The extracted multiple feature maps with the same size are then merged to form a convolution operation. output.
- weight values in these weight matrices require a large amount of training in practical applications.
- Each weight matrix formed by the weight values obtained through training can be used to extract information from the input image, thereby allowing the convolutional neural network 200 to make correct predictions. .
- the initial convolutional layer eg layer 221
- the features extracted by the later convolutional layers become more and more complex, such as high-level semantic features.
- Features with higher semantics are more suitable for Problems to be solved.
- pooling layers are often introduced periodically after the convolutional layer.
- Each layer from layer 221 to layer 226 as shown in the convolutional layer/pooling layer 220 in Figure 2 can be a convolutional layer followed by a pooling layer, or multiple convolutional layers followed by a pooling layer. layer or multi-layer pooling layer.
- the pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to obtain a smaller size image.
- the average pooling operator can calculate the pixel values in the image within a specific range to generate an average value as the result of average pooling.
- the max pooling operator can take the pixel with the largest value in a specific range as the result of max pooling.
- the operators in the pooling layer should also be related to the size of the image.
- the size of the image output after processing by the pooling layer can be smaller than the size of the image input to the pooling layer.
- Each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.
- the convolutional neural network 200 After being processed by the convolutional layer/pooling layer 220, the convolutional neural network 200 is not enough to output the required output information. Because as mentioned above, the convolutional layer/pooling layer 220 extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other related information), the convolutional neural network 200 needs to use the neural network layer 230 to generate an output or a set of required number of classes. Therefore, the neural network layer 230 may include multiple hidden layers (layer 231, layer 232 to layer 23n as shown in FIG. 2) and an output layer 240. The parameters included in the multiple hidden layers may be based on specific tasks. Types of relevant training data are pre-trained. For example, the task type can include image recognition, image classification, image super-resolution reconstruction, etc.
- the output layer 240 has a loss function similar to classification cross entropy, specifically used to calculate the prediction error.
- the forward propagation of the entire convolutional neural network 200 (the propagation in the direction from layer 210 to layer 240 in Figure 2 is forward propagation) is completed, and the reverse propagation (the propagation in the direction from layer 240 to layer 210 in Figure 2 is back propagation) ) will start to update the weight values and deviations of each layer mentioned above to reduce the loss of the convolutional neural network 200 and the error between the result output by the convolutional neural network 200 through the output layer and the ideal result.
- the convolutional neural network 200 shown in Figure 2 is only an example of a convolutional neural network.
- the convolutional neural network can also exist in the form of other network models, such as U- Net, 3D Morphable Face Model (3DMM) and Residual Network (ResNet), etc.
- 3DMM 3D Morphable Face Model
- ResNet Residual Network
- RNN Recurrent Neural Networks
- the layers are fully connected, while the nodes within each layer are unconnected.
- this ordinary neural network has solved many difficult problems, it is still incompetent for many problems. For example, if you want to predict the next word of a sentence, you generally need to use the previous word, because the preceding and following words in a sentence are not independent. The reason why RNN is called a recurrent neural network is that the current output of a sequence is also related to the previous output.
- RNN can process sequence data of any length.
- the training of RNN is the same as the training of traditional CNN or DNN.
- the error backpropagation algorithm is also used, but there is one difference: that is, if the RNN is expanded into a network, then the parameters, such as W, are shared; this is not the case with the traditional neural network as shown in the example above.
- the output of each step not only depends on the network of the current step, but also depends on the status of the network of several previous steps. This learning algorithm is called Back propagation Through Time (BPTT).
- BPTT Back propagation Through Time
- recurrent neural networks can exist in the form of various network models, such as Long Short Term Memory Networks (LSTM), end-to-end speech synthesis model (Char2Wav) based on deep learning, etc.
- LSTM Long Short Term Memory Networks
- Char2Wav end-to-end speech synthesis model
- Generative Adversarial Networks is a deep learning model.
- the model includes at least two modules: one module is a generative model (Generative Model), and the other module is a discriminative model (Discriminative Model). Through these two modules, they learn from each other to produce better output.
- Both the generative model and the discriminative model can be neural networks, specifically deep neural networks or convolutional neural networks.
- the basic principle of GAN is as follows: Take the GAN that generates pictures as an example. Suppose there are two networks, G (Generator) and D (Discriminator), where G is a network that generates pictures.
- D is a discriminant network, used to judge whether a picture is "real". Its input parameter is x, x represents a picture, and the output D(x) represents the probability that x is a real picture. If it is 1, it means 100% is a real picture. If it is 0, it means it cannot be real. picture.
- the goal of the generative network G is to generate real pictures as much as possible to deceive the discriminant network D, and the goal of the discriminant network D is to try to distinguish the pictures generated by G from the real pictures. Come.
- G and D constitute a dynamic "game” process, that is, the "confrontation” in the "generative adversarial network".
- generative adversarial networks can exist in the form of various network models, such as the lip synthesis model (wave2lip) based on generative adversarial networks.
- the lip synthesis model is used to implement mouth shape and input in dynamic images. Audio synchronization.
- Another example is the Pix2Pix network using U-Net network structure.
- the convolutional neural network can use the error back propagation (BP) algorithm to modify the size of the parameters in the initial super-resolution model during the training process, so that the reconstruction error loss of the super-resolution model becomes smaller and smaller.
- BP error back propagation
- forward propagation of the input signal until the output will produce an error loss
- the parameters in the initial super-resolution model are updated by back-propagating the error loss information, so that the error loss converges.
- the backpropagation algorithm is a backpropagation movement dominated by error loss, aiming to obtain the optimal parameters of the super-resolution model, such as the weight matrix.
- Modality refers to the existence form of data, such as text, audio, image, video and other file formats. Some data exist in different forms, but they all describe the same thing or event. For example, the voice and action images of a character speaking a word in a video are all data of the character speaking the word.
- Cross-modal driving refers to using data from one modality to generate data from another modality. For example, the text data is generated based on voice data in which a person reads the text data, or the voice data is generated based on a speaking video in which the voice in the voice data is spoken.
- ObamaNet is a virtual human generation model dedicated to a specific image.
- the steps to build a virtual human based on ObamaNet can include processes such as shooting, voice processing, transcription processing, acceptance, etc., to develop and provide customer-specific virtual human models, thereby developing customer-specific virtual human models.
- ObamaNet mainly includes a text-to-speech network based on Char2Wav, an LSTM for generating lip synchronization parameters (Keypoints) related to audio synchronization, and a virtual human generation model for generating a virtual human speaking video through lip synchronization parameters.
- the virtual human The generative model is a Pix2Pix network using U-Net network structure.
- the working process of ObamaNet is as follows: the computing device obtains the text, converts text to speech through the text-to-speech network, converts the speech into lip synchronization parameters that drive the virtual human to speak through LSTM, and inputs the lip synchronization parameters into the virtual human generation model. , generate the final virtual human speaking video.
- the virtual human model construction process is supervised training, which requires a large amount of real-person speaking videos and audios to train a virtual human that can adapt to any text (audio). Therefore, the cycle of building a virtual human based on the Obama network is relatively long. For example, the cycle of building a virtual human based on the Obama network can refer to Figure 4.
- the production cycle of building a virtual human can include: two weeks to capture data (collect photos, audio and other data), three weeks to process data (convert text to speech) , one week to learn the data (train the deep learning neural network), and one week to adapt to the service (configuring the model and providing a data interface for the virtual human model).
- This application provides a model training method for driving a virtual person to speak, and in particular, a model training method for driving a virtual person to speak by expanding training data based on an audio data set, that is, the computing device generates an initial virtual person based on the audio data set and the video of the character speaking.
- the lip synchronization training parameters are determined based on the initial virtual human speaking video, and the computing device then uses the lip synchronization training parameters as labels, and trains the lip synchronization parameter generation model based on the audio data set and the lip synchronization training parameters.
- the computing device expands the training data of the lip synchronization parameter generation model based on the initial virtual human speaking video generated from the audio data set.
- the computing device can obtain more training data in a short time, thereby improving the accuracy and generalization performance of the lip synchronization parameter generation model trained with lip synchronization training parameters.
- Figure 5 is an architectural schematic diagram of a system for driving virtual human speech provided by an embodiment of the present application.
- the system 500 includes an execution device 510, a training device 520, a database 530, a terminal device 540, a data storage system 550 and a data collection device 560.
- the execution device 510 may be a terminal, such as a mobile phone terminal, a tablet computer, a laptop, a virtual reality (VR) device, an augmented reality (AR) device, a mixed reality (Mixed Reality, MR) device, an extended reality ( Extended Reality (ER) devices, cameras or vehicle-mounted terminals, etc., or edge devices (for example, boxes carrying chips with processing capabilities), etc.
- a terminal such as a mobile phone terminal, a tablet computer, a laptop, a virtual reality (VR) device, an augmented reality (AR) device, a mixed reality (Mixed Reality, MR) device, an extended reality ( Extended Reality (ER) devices, cameras or vehicle-mounted terminals, etc., or edge devices (for example, boxes carrying chips with processing capabilities), etc.
- VR virtual reality
- AR augmented reality
- MR mixed reality
- ER Extended Reality
- edge devices for example, boxes carrying chips with processing capabilities
- the training device 520 may be a terminal or other computing device, such as a server or cloud device.
- the execution device 510 and the training device 520 are different processors deployed on different physical devices (such as servers or servers in a cluster).
- the execution device 510 may be a graphics processing unit (GPU), a central processing unit (CPU), other general-purpose processors, a digital signal processor (DSP), an application-specific integrated circuit (ASIC) application-specific integrated circuit (ASIC), field-programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
- a general-purpose processor can be a microprocessor or any conventional processor, etc.
- the training device 520 may be a graphics processing unit (GPU), a neural network processing unit (NPU), a microprocessor, an application-specific integrated circuit (ASIC), or a or A plurality of integrated circuits used to control the execution of the program of this application.
- GPU graphics processing unit
- NPU neural network processing unit
- ASIC application-specific integrated circuit
- the execution device 510 and the training device 520 are deployed on the same physical device, or the execution device 510 and the training device 520 are the same physical device.
- the data collection device 560 is used to collect training data and store the training data in the database 530.
- the data collection device 560, the execution device 510 and the training device 520 can be the same or different devices.
- the training data includes at least one form of data among images, speech and text.
- the training data includes training audio and targets in the training audio, and the targets in the training audio may refer to labels of the training audio.
- the training device 520 is used to train the neural network using training data until the loss function in the neural network converges, and the loss function value is less than a specific threshold, then the neural network training is completed, so that the neural network reaches a certain accuracy. Or, if all the training data in the database 530 is used for training, then the neural network training is completed, so that the trained neural network has functions such as cross-modal driving of virtual humans and lip-lip synchronization parameter generation. Furthermore, the training device 520 configures the trained neural network 501 to the execution device 510 . The execution device 510 is used to implement the function of processing application data according to the trained neural network 501.
- the execution device 510 and the training device 520 are the same computing device.
- the computing device can configure the trained neural network 501 to itself, and use the trained neural network 501 to realize cross-modal driving of virtual humans and lip synchronization. Parameter generation and other functions.
- the training device 520 can configure the trained neural network 501 to multiple execution devices 510 .
- Each execution device 510 uses the trained neural network 501 to implement functions such as cross-modal driving of virtual humans and lip-lip synchronization parameter generation.
- the model training method for driving a virtual human to speak provided in this embodiment can be applied in cross-modal driving scenarios.
- the model training method of the embodiment of the present application can be applied in scenarios such as terminal production of dynamic images and virtual image live broadcast.
- the scenarios of terminal production of dynamic images and virtual image live broadcast are briefly introduced below.
- the user uses the training device 520 (for example: mobile phone, computer, tablet) to obtain the audio data set and the video of the person talking.
- the audio data set and the video of the person talking can be downloaded by the training device 520 from the network database. Or obtained by recording real people speaking, the audio data set can contain audio of people speaking.
- the user operation training device 520 obtains a lip synchronization parameter generation model based on the audio data set and the character speaking video.
- the training device 520 generates an initial virtual human speaking video based on the character speaking characteristics in the audio data set and the character speaking video, determines the lip synchronization training parameters based on the initial virtual human speaking video, and then uses the lip synchronization training parameters as labels to compare with the audio data set Used as training data to train lip synchronization parameter generation model.
- the training device 520 uses the initial virtual human speaking video to expand the data volume of the lip synchronization training parameters, compared with extracting the lip synchronization training parameters from the recorded real person speaking video, the duration of the real person speaking video to be recorded is shorter, and the training The data amount of the video of the person talking that the device 520 needs to obtain or record is small.
- the training device 520 generates lip synchronization training parameters based on the initial virtual human speaking video. Since the lip synchronization training parameters are used as labels for model training, the requirements for the clarity of the initial virtual human speaking video are low.
- the training device 520 generates lip synchronization training parameters based on the initial virtual human speaking video with low definition, which reduces the calculation amount of generating labels and increases the processing speed. Therefore, terminals such as mobile phones, computers, and tablets whose processing capabilities are weaker than that of a dedicated graphics and audio processing server can also be used as the training device 520 to train the lip synchronization parameter generation model.
- the operator uses the training device 520 (for example, a cloud device, a server) to obtain an audio data set and a video of a character talking.
- the audio data set and video of a character talking can be downloaded by the training device 520 from a network database. Or obtained by recording real people speaking, the audio data set may contain audio of people singing.
- the steps for the operator to operate the training device 520 to obtain the lip synchronization parameter generation model based on the audio data set and the character speaking video are the same as the steps for obtaining the lip synchronization parameter generation model in the scene where the terminal produces dynamic images, and will not be described again here.
- the model training method of the embodiment of the present application is applied to the scene where the virtual image performs online live broadcast.
- the training device 520 needs to obtain or record a small amount of data of the character speaking video, and the calculation amount of generating the label is small and the processing speed is fast, which improves the audio quality.
- the construction efficiency of the lip synchronization parameter generation model reduces the cost of model training.
- the training device 520 generates lip synchronization training parameters based on the initial virtual human speaking video, ensuring sufficient training data for the lip synchronization parameter generation model and improving the generalization performance and accuracy of the lip synchronization parameter generation model.
- the training data (for example, audio data sets, people talking videos) maintained in the database 530 may not necessarily come from the data collection device 560, but may also be received from other devices.
- the training device 520 does not necessarily train the neural network entirely based on the training data maintained by the database 530. It may also obtain training data from the cloud or other places to train the neural network.
- the above description should not be used as a limitation on the embodiments of the present application.
- the execution device 510 can be further subdivided into the architecture shown in Figure 5. As shown in the figure, the execution device 510 is configured with a computing module 511, an I/O interface 512 and Preprocessing module 513.
- the I/O interface 512 is used for data interaction with external devices.
- the user can input data to the I/O interface 512 through the terminal device 540.
- Input data can include images or videos. Additionally, input data may also come from database 530.
- the preprocessing module 513 is used to perform preprocessing according to the input data received by the I/O interface 512 .
- the preprocessing module 513 may be used to identify application scenario characteristics of the application data received from the I/O interface 512 .
- the execution device 510 When the execution device 510 preprocesses input data, or when the calculation module 511 of the execution device 510 performs calculations and other related processes, the execution device 510 can call data, codes, etc. in the data storage system 550 for corresponding processing. , the data and instructions obtained by corresponding processing can also be stored in the data storage system 550.
- the first neural network stored by the execution device 510 may be applied to the execution device 510 .
- the calculation module 511 inputs the application data into the first neural network to obtain a processing result. Since the first neural network is trained by the training device 520 based on data with similar application scenario characteristics obtained by the cluster, using the first neural network to process application data can meet the user's accuracy requirements for data processing.
- the I/O interface 512 returns the processing result to the terminal device 540, thereby providing it to the user so that the user can view the processing result. It should be understood that the terminal device 540 and the execution device 510 may also be the same physical device.
- the user can manually set the input data, and the manual setting can be operated through the interface provided by the I/O interface 512 .
- the terminal device 540 can automatically send input data to the I/O interface 512. If requiring the terminal device 540 to automatically send input data requires the user's authorization, the user can set corresponding permissions in the terminal device 540.
- the user can view the processing results output by the execution device 510 on the terminal device 540, and the specific presentation form may be display, sound, action, etc.
- the terminal device 540 can also be used as a data collection terminal to collect the input data input to the I/O interface 512 and the processing results output from the I/O interface 512 as new sample data, and store them in the database 530.
- the I/O interface 512 uses the input data input to the I/O interface 512 and the processing results output from the I/O interface 512 as new sample data as shown in the figure. Store in database 530.
- Figure 5 is only a schematic diagram of a system architecture provided by an embodiment of the present application.
- the positional relationship between the devices, devices, modules, etc. shown in Figure 5 does not constitute any limitation.
- the data storage system 550 is relatively
- the execution device 510 is an external memory. In other cases, the data storage system 550 can also be placed in the execution device 510 .
- Step 610a The training device 520 generates an initial virtual human speaking video based on the audio data set and character speaking characteristics.
- the training device 520 outputs an initial virtual human speaking video that matches the voice in the audio data set according to the audio and character speaking features in the audio data set.
- Character speaking features can be obtained from character speaking videos.
- the training device 520 can input the audio and the character speaking video into the cross-modal speech-driven virtual human model, and the cross-modal speech-driven virtual human model extracts the character's speech features from the character's speech video, and outputs the initial virtual human speech according to the audio data set and the character's speech features. video.
- the video of the person talking can be obtained by the training device 520 from a network database, or it can be a video of a real person talking recorded by the training device 520 using a camera.
- the duration of the character talking video can be several minutes, such as 3 minutes, 5 minutes, etc.
- the training device 520 uses the cross-modal voice-driven virtual human model to preprocess the character speaking video to obtain the character speaking characteristics. Preprocessing can include cropping and feature extraction. For example, there are character images and background images in videos of characters talking, and character images include body part images and face part images.
- the preprocessing of the cross-modal voice-driven virtual human model includes: cropping the partial face image from the video of the person talking, and extracting the characteristics of the person's speaking from the partial face image.
- the training device 520 can also use a program with cropping and feature extraction functions to preprocess the character speaking video before inputting data to the cross-modal voice-driven virtual human model, and then combine the audio data set with the preprocessed character speaking video.
- Audio data sets can include multiple language voices, multiple timbre voices, and multiple content voices.
- the content in a variety of content speech can refer to the words, short sentences, long sentences, and tone contained in the speech.
- the audio data set includes the voices of men saying "Hello”, “Goodbye”, “Sorry” and other words in Chinese in a gentle tone, and the voices of women saying "Please pay attention to network security” and "No sending illegal information” in English in a stern tone. Waiting for the voice of a short sentence, the man said in French in an urgent tone, "The concert is about to start, everyone, please hurry up” and the voice of a long sentence.
- Male and female speech sounds can differ in frequency.
- the duration of speech contained in the audio data set can be tens of hours, such as 30 hours, 35 hours, or 50 hours, etc.
- the training device 520 can extract the audio data in the video of the person talking, and add the audio data in the video of the person talking to the audio data set.
- the cross-modal speech-driven virtual human model is a lip synthesis model or a GAN model.
- the cross-modal voice-driven virtual human model is a pre-trained model.
- the training device 520 uses audio data sets and short-duration character speech videos to generate a large number of initial virtual human speech videos. For the collection of audio data sets, there is no need to pay attention to character expressions, movements, and environment brightness, etc., which reduces the complexity of training materials. The difficulty of acquisition ensures the adequacy of training materials.
- Step 620a The training device 520 determines lip synchronization training parameters based on the initial virtual human speaking video.
- the training device 520 maps the speaking movements of the characters in the initial virtual human speaking video to the three-dimensional face model, that is, making the facial expressions and movements of the three-dimensional human face model consistent with the facial expressions and movements of the characters speaking in the initial virtual human speaking video, and then from the three-dimensional human face Extract labial synchronization training parameters from the model.
- Lip synchronization training parameters are parameters used to represent the characteristics of the three-dimensional face model's speaking action. Lip synchronization training parameters may include eye features and lip features. Eye features may include parameters representing eye movements such as eye opening, eye closing, eye opening size, and gaze direction. The lip synchronization training parameters are used as labels for training the lip synchronization parameter generation model. Since the initial virtual human speaking video is generated based on the audio data set and not extracted from the real person speaking video, the set of lip synchronization training parameters can also be called a pseudo label library.
- the lip feature parameters may include parameters used to represent mouth movements such as mouth opening, mouth closing, and mouth opening size.
- the character's speech characteristics corresponding to the character's voice "ah” are parameters indicating that the mouth is open and the eyes are open.
- the character's speech characteristics corresponding to the character's voice "um” are parameters indicating the mouth is closed and the eyes are closed. .
- the lip synchronization training parameters may also include head feature parameters, eyebrow feature parameters, etc.
- the head characteristic parameters may include parameters representing the head rotation angle and rotation speed.
- the eyebrow feature parameters may include parameters representing eyebrow movement distance.
- the three-dimensional face model can be any pre-trained three-dimensional model that can represent a human face, such as a variability face model.
- the variability face model is a three-dimensional face statistical model.
- the training device 520 uses variability The face model can be fitted to the three-dimensional face model based on the initial virtual human speaking video, and the lip synchronization training parameters can be extracted from the three-dimensional face model.
- the training device 520 can use any model with a fitting function to map the character's speaking actions in the initial virtual human speaking video to a three-dimensional face model, such as a Bayesian Forecasting model (Bayesian Forecasting). Model), residual network model, etc.
- a Bayesian Forecasting model Bayesian Forecasting
- Model residual network model
- the training device 520 directly extracts the lip synchronization training parameters from the initial virtual human speaking video, the accuracy of the lip synchronization training parameters will be poor due to the poor expression details and overall naturalness of the virtual human's movements in the initial virtual human speaking video. low question.
- the training device 520 maps the character's speaking movements in the initial virtual human speaking video to the three-dimensional face model, and then extracts the lip synchronization training parameters from the three-dimensional face model, ensuring the accuracy of the lip synchronization training parameters.
- Step 630a The training device 520 trains the lip synchronization parameter generation model according to the audio data set and the lip synchronization training parameters.
- the training device 520 uses the audio data set as input data, the lip synchronization parameters as output data, the lip synchronization training parameters as supervision labels, and trains the lip synchronization parameter generation model.
- the lip synchronization parameters are used to drive the virtual person to speak to obtain the target virtual person speaking video.
- the parameter types included in the lip synchronization parameters are the same as those of the lip synchronization training parameters, and will not be described again here.
- the network structure of the lip synchronization parameter generation model can be U-Net, convolutional neural network, long short-term memory network, etc.
- the training device 520 generates an initial virtual human speaking video based on the audio data set and determines the lip synchronization training parameters based on the initial virtual human speaking video.
- the clarity of the initial virtual human speaking video for example: the resolution is 128*128, is lower than
- the required clarity of the target virtual person speaking video for example, the resolution is 1280 ⁇ 720
- the training device 520 can quickly generate a large number of initial virtual person speaking videos based on the audio data set, thereby based on the initial virtual person speaking video.
- a large number of lip synchronization training parameters are obtained as supervision labels for model training, which improves the generalization ability of the lip synchronization parameter generation model, so that the expressions and movements of the target virtual human driven by the lip synchronization parameters output by the model are consistent with the speech in the input audio. Greater synchronicity. Therefore, the user uses the training device 520 to obtain the lip synchronization parameter generation model, and only needs to record a small amount of people talking videos as input data to complete the training of the lip synchronization parameter generation model.
- the training device 520 can configure the lip synchronization parameter generation model to an end-side device, such as the execution device 510.
- the execution device 510 uses the lip synchronization parameter generation model to drive the virtual human to speak.
- Training device 520 and execution device 510 may be the same or different computing devices.
- this embodiment also provides another model training method for driving a virtual person to speak.
- the steps of the model training method for driving a virtual person to speak are shown in Figure 6b. It can be as follows:
- Step 610b The computing device 520 generates an initial virtual human speaking video based on the audio data set and the character speaking video.
- the computing device 520 obtains an initial virtual human speaking video based on the voice-driven character speaking video in the audio data set and the character speaking video.
- the computing device 520 inputs the audio data set and the character speaking video into the pre-trained model, and the pre-trained model drives the character speaking in the character speaking video based on the voice in the audio data set, and outputs the initial virtual human speaking video.
- the pre-trained model may be a cross-modal voice-driven virtual human model, and the duration of the video of the character speaking is shorter than the duration of the voice in the audio data set.
- the duration of the video of the character speaking is shorter than the duration of the voice in the audio data set.
- Step 620b The computing device 520 generates a lip synchronization parameter generation model using the initial virtual human speaking video.
- the computing device 520 uses the initial virtual human speaking video and the three-dimensional face reconstruction model to generate a lip synchronization parameter generation model. For example, the computing device 520 uses the three-dimensional face reconstruction model to extract the lip synchronization training parameters from the initial virtual human speaking video, and converts the lip synchronization parameters into the lip synchronization parameters. The lip synchronization training parameters are used as labels, the audio data set is used as model input data, and the lip synchronization parameter generation model is obtained through training. The computing device 520 uses the three-dimensional face reconstruction model to extract lip synchronization training parameters from the initial virtual human speaking video. Please refer to the content of step 620a in Figure 6a.
- the computing device 520 uses the lip synchronization training parameters as labels, audio
- the data set is used as the input data of the model.
- steps of training to obtain the labial synchronization parameter generation model please refer to step 630a in Figure 6a, which will not be described again here.
- Step 710 The execution device 510 obtains the input audio and the first definition video of the person speaking.
- the execution device 510 can read input audio from a network database or a local database, and can also use an audio recording device to collect audio of people speaking as input audio.
- the target virtual person is a virtual person driven by the input audio to speak in the target virtual person speaking video to be generated.
- the target virtual person may be generated by the lip-synchronization parameter generation model based on the first-definition video of the person speaking.
- Step 720 The execution device 510 uses the lip synchronization parameter generation model to generate a target virtual human speaking video based on the input audio.
- the execution device 510 inputs the input audio into the lip synchronization parameter generation model, outputs the lip synchronization parameter, and drives the target virtual person to speak according to the lip synchronization parameter.
- the training set of the lip synchronization parameter generation model is obtained by a video containing a video of a person speaking with the first definition and a three-dimensional face reconstruction model.
- the first definition is lower than the definition of the video of the target virtual person speaking. Please refer to the specific steps. From step 610a to step 630a in Figure 6a, the first definition video of the person talking may be the initial virtual person talking video generated in step 610a in Figure 6a, which will not be described again.
- the lip synchronization parameters output by the lip synchronization parameter generation model are used to drive the target virtual person to speak, that is, to drive the target virtual person to make facial expressions that match the input audio.
- the lip synchronization parameters contain the same parameter types as the lip synchronization training parameters, so they will not be described again here.
- the step of generating a target virtual human speaking video may include: the execution device 510 inputs the lip synchronization parameters into the virtual human generation model, and the virtual human generation model drives the target virtual human to speak according to the lip synchronization parameters to obtain the target virtual human speaking video.
- the target virtual person speaking video output by the execution device 510 may include input audio.
- the virtual human generation model may be a neural network model based on U-Net.
- the character image of the virtual person is the same as the character image in the initial virtual person speaking video described in step 610a in FIG. 6a.
- the lip synchronization parameters output by the lip synchronization parameter generation model are highly accurate and ensure the generalization of the lip synchronization parameter generation model.
- the ability improves the lip synchronization between the target virtual person's speaking movements and the input audio driven by lip synchronization parameters.
- the execution device 510 may also send the target virtual person speaking video to the end-side device, so that the end-side device plays the target virtual person speaking video to the user.
- the monitor in the office lobby plays the video of the target virtual person talking.
- the target virtual person can be a virtual lobby manager
- the target virtual person talking video can be a video of the virtual lobby manager talking.
- the monitor plays the video of the virtual lobby manager speaking while playing the audio of the virtual lobby manager speaking.
- the audio played by the monitor is lip-synchronized with the facial expressions and movements of the virtual lobby manager in the video.
- the execution device 510 when the execution device 510 drives the target virtual person to speak by executing the steps in the method of driving a virtual person to speak, the execution device 510 may also use the input audio as an update audio data set to update the audio data set instead of the step. For the audio data set in step 610a, perform steps 610a to 620a again to obtain the fine-tuned lip synchronization training parameters, and update the lip synchronization parameter generation model according to the fine-tuned lip synchronization training parameters.
- the execution device 510 may use audio data in the input audio that is different from the audio data set as audio data in the updated audio data set.
- the execution device 510 may determine the audio difference value between each piece of audio data in the input audio and each piece of audio data in the audio data set, and if the audio difference value is greater than the threshold, add the piece of audio data in the input audio to the updated audio data set.
- the execution device 510 may use a Dynamic Time Warping (DTW) algorithm and a Mel Frequency Cepstrum Coefficient (MFCC) algorithm to obtain the audio difference value of the two pieces of audio data.
- DTW Dynamic Time Warping
- MFCC Mel Frequency Cepstrum Coefficient
- the execution device 510 uses the fine-tuned lip synchronization training parameters as pseudo labels to fine-tune the lip synchronization parameter generation model to complete the update of the lip synchronization parameter generation model.
- the execution device 510 updates the lip synchronization parameter generation model by fine-tuning the weight of the last layer or the last multiple layers in the hierarchical structure of the lip synchronization parameter generation model.
- the execution device 510 fine-tunes the lip synchronization parameter generation model in the process of using the lip synchronization parameter generation model, thereby improving the accuracy of the lip synchronization parameter generation model for generating lip synchronization parameters from the input audio. Improve the generalization ability of the lip synchronization parameter generation model.
- the execution device 510 and the training device 520 may be the same or different terminals, and the terminals include corresponding hardware structures and/or software modules for executing each function.
- the terminals include corresponding hardware structures and/or software modules for executing each function.
- the model training method for driving a virtual human to speak according to this embodiment is described in detail above with reference to FIG. 6 .
- the model training device for driving a virtual human to speak according to this embodiment will be described below with reference to FIG. 8 .
- FIG. 8 is a schematic diagram of a possible model training device for driving a virtual human to speak provided in this embodiment.
- the model training device that drives the virtual human to speak can be used to implement the functions of the execution device in the above method embodiments, and therefore can also achieve the beneficial effects of the above method embodiments.
- the model training device that drives the virtual human to speak may be the training device 520 as shown in FIG. 5 , or may be a module (such as a chip) applied to the server.
- the model training device 800 for driving a virtual human to speak includes a video generation module 810, a parameter generation module 820 and a training module 830.
- the model training device 800 for driving a virtual human to speak is used to implement the functions of the computing device in the above method embodiment shown in Figure 6a and Figure 6b.
- the video generation module 810 is used to generate an initial virtual human speech video based on the audio data set and the character speech video;
- the parameter generation module 820 is used to determine lip synchronization training parameters based on the initial virtual human speaking video, and the lip synchronization training parameters are used as labels for training the lip synchronization parameter generation model;
- the training module 830 is used to train a lip synchronization parameter generation model based on the audio data set and lip synchronization training parameters.
- the lip synchronization parameter generation model is used to generate lip synchronization parameters based on the input audio.
- the lip synchronization parameters are used to drive the virtual human to speak. , to get a video of a virtual person talking.
- the audio data set includes multiple language voices, multiple timbre voices, and multiple content voices.
- the definition of the initial virtual human speaking video is lower than the definition of the virtual human speaking video.
- the lip synchronization parameters include eye feature parameters and lip feature parameters.
- the model training device 800 for driving the virtual human to speak also includes a preprocessing module.
- the preprocessing module is used to preprocess the video of people talking to obtain the characteristics of the people talking.
- the preprocessing includes cropping and feature extraction.
- the characteristics of the people talking include eye features and lip features.
- the audio dataset contains audio from videos of people speaking.
- the functions of the parameter generation module 820 and the training module 830 can both be implemented by the training module 830.
- Figure 9 is a schematic diagram of a possible device for driving a virtual human to speak provided in this embodiment.
- the driving virtual human speaking device can be used to realize the functions of the execution device in the above method embodiments, and therefore can also achieve the beneficial effects of the above method embodiments.
- the device for driving the virtual human to speak may be the execution device 510 as shown in FIG. 5 , or may be a module (such as a chip) applied to the server.
- the driving virtual human speaking device 900 includes an input module 910 , a model processing module 920 and a driving module 930 .
- the driving virtual human speaking device 900 is used to implement the functions of the computing device in the method embodiment shown in FIG. 7 .
- Input module 910 used to obtain input audio
- the model processing module 920 is used to input the input audio into the lip synchronization parameter generation model and output the lip synchronization parameters.
- the lip synchronization parameter generation model is trained based on the audio data set and the lip synchronization training parameters;
- the driving module 930 is used to drive the virtual person to speak according to the lip synchronization parameters to obtain a video of the virtual person speaking.
- the device 900 for driving virtual human speech also includes a training module, which is used to update the lip synchronization parameter generation model according to the input audio.
- the functions of the model processing module 920 and the driving module 930 can be implemented by the model processing module 920.
- model training device 800 for driving virtual human speech and the device 900 for driving virtual human speech in the embodiment of the present application can be implemented by GPU, NPU, ASIC, or programmable logic device (PLD), as mentioned above.
- PLD can be a complex programmable logical device (CPLD), a field-programmable gate array (FPGA), a general array logic (GAL), or any combination thereof.
- CPLD complex programmable logical device
- FPGA field-programmable gate array
- GAL general array logic
- the model training device 800 for driving the virtual human to speak, the device 900 for driving the virtual human to speak, and their respective modules can also be software modules.
- the model training device 800 for driving the virtual human to speak and the device 900 for driving the virtual human to speak according to the embodiment of the present application may correspond to the model training device 800 for driving the virtual human to speak and the device for driving the virtual human to speak for executing the method described in the embodiment of the present application.
- the above and other operations and/or functions of each unit in the device 900 are respectively intended to implement the corresponding processes of each method in Figure 6 or Figure 7. For the sake of simplicity, they will not be described again here.
- FIG. 10 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
- Computing device 1000 includes memory 1001, processor 1002, communication interface 1003, and bus 1004. Among them, the memory 1001, the processor 1002, and the communication interface 1003 implement communication connections between each other through the bus 1004.
- Memory 1001 may be a read-only memory, a static storage device, a dynamic storage device, or a random access memory.
- the memory 1001 can store computer instructions. When the computer instructions stored in the memory 1001 are executed by the processor 1002, the processor 1002 and the communication interface 1003 are used to execute the model training method of driving the virtual human to speak and the method of driving the virtual human to speak of the software system. A step of.
- the memory can also store a data set. For example, a part of the storage resources in the memory 1001 is divided into an area for storing a program that implements the function of the labial synchronization parameter generation model in the embodiment of the present application.
- the processor 1002 can be a general CPU, an application specific integrated circuit (ASIC), a GPU or any combination thereof.
- Processor 1002 may include one or more chips.
- Processor 1002 may include an AI accelerator, such as an NPU.
- the communication interface 1003 uses a transceiver module such as but not limited to a transceiver to implement communication between the computing device 1000 and other devices or communication networks.
- a transceiver module such as but not limited to a transceiver to implement communication between the computing device 1000 and other devices or communication networks.
- the iterative training request can be obtained through the communication interface 1003, and the iteratively trained neural network can be fed back.
- Bus 1004 may include a path that carries information between various components of computing device 1000 (eg, memory 1001, processor 1002, communications interface 1003).
- the computing device 1000 may be a computer (for example, a server) in a cloud data center, a computer in an edge data center, or a terminal.
- training device 520 may be deployed on each computing device 1000.
- a GPU is used to implement the functions of the training device 520.
- the training device 520 can communicate with the execution device 510 through the bus 1004.
- the training device 520 may communicate with the execution device 510 through a communication network.
- the method steps in this embodiment can be implemented by hardware or by a processor executing software instructions.
- Software instructions can be composed of corresponding software modules, and software modules can be stored in random access memory (random access memory, RAM), flash memory, read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM) , PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically erasable programmable read-only memory (electrically EPROM, EEPROM), register, hard disk, mobile hard disk, CD-ROM or other well-known in the art any other form of storage media.
- An exemplary storage medium is coupled to the processor such that the processor can read information from the storage medium and write information to the storage medium.
- the storage medium can also be an integral part of the processor.
- the processor and storage media may be located in an ASIC.
- the ASIC can be located in the terminal device.
- the processor and the storage medium can also exist as discrete components in network equipment or terminal equipment.
- the computer program product includes one or more computer programs or instructions.
- the computer may be a general purpose computer, a special purpose computer, a computer network, a network device, a user equipment, or other programmable device.
- the computer program or instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another.
- the computer program or instructions may be transmitted from a website, computer, A server or data center transmits via wired or wireless means to another website site, computer, server, or data center.
- the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or data center that integrates one or more available media.
- the available media may be magnetic media, such as floppy disks, hard disks, and magnetic tapes; they may also be optical media, such as digital video discs (DVDs); they may also be semiconductor media, such as solid state drives (solid state drives). ,SSD).
- SSD solid state drives
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- General Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Human Computer Interaction (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Medical Informatics (AREA)
- Artificial Intelligence (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Databases & Information Systems (AREA)
- Computing Systems (AREA)
- Social Psychology (AREA)
- Ophthalmology & Optometry (AREA)
- Psychiatry (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Quality & Reliability (AREA)
- Computer Graphics (AREA)
- Geometry (AREA)
- Processing Or Creating Images (AREA)
Abstract
一种驱动虚拟人说话和模型训练方法、装置、计算设备及系统,包括:计算设备根据音频数据集和人物说话视频生成初始虚拟人说话视频,计算设备根据音频数据集与作为模型训练标签的音唇同步训练参数训练音唇同步参数生成模型,音唇同步参数生成模型用于根据输入音频生成音唇同步参数,音唇同步参数用于驱动虚拟人说话。由于计算设备利用初始虚拟人说话视频对音唇同步训练参数的数据量进行扩充,降低了训练数据的采集难度,保证音唇同步参数生成模型的训练数据充足,从而有效地提高了音唇同步参数生成模型的精度和泛化性能,提高了音唇同步参数驱动的虚拟人的音唇同步性。
Description
本申请要求于2022年3月29日提交国家知识产权局、申请号为202210326144.0、发明名称为“驱动虚拟人说话和模型训练方法、装置、计算设备及系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及人工智能领域,尤其涉及一种驱动虚拟人说话和模型训练方法、装置、计算设备及系统。
虚拟人(Virtual Human)指基于虚拟人技术模拟真实人物的动作、表情、发音等合成的三维模型。目前,将采集的人物说话视频作为训练集,基于训练集训练模型,使模型利用实时语音得到驱动虚拟人说话的参数,以获得虚拟人说话视频。但是,采集人物说话视频需要投入大量的人力物力,因此采集的人物说话视频的数据量有限。由于作为训练集的人物说话视频的数据量有限,导致训练得到的模型精度较低,进而导致模型生成的驱动虚拟人说话的参数不准确。
发明内容
本申请实施例提供一种驱动虚拟人说话和模型训练方法、装置、计算设备及系统,能够解决训练集数据量有限导致的模型精度较低,生成的驱动虚拟人说话的参数不准确的问题,从而提高驱动虚拟人说话的参数的准确度。
第一方面,提供一种驱动虚拟人说话和模型训练方法,该方法可以由计算设备执行,例如端侧的终端或者云侧的训练设备,具体包括如下步骤:计算设备基于音频数据集、人物说话视频生成初始虚拟人说话视频,再根据初始虚拟人说话视频生成音唇同步参数生成模型,从而能够使用音唇同步参数生成模型生成的音唇同步参数驱动目标虚拟人说话,来生成目标虚拟人说话视频。
如此,计算设备利用初始虚拟人说话视频对音唇同步参数生成模型的训练数据量进行扩充,相对于从录制的真人说话视频中提取训练数据,要录制的真人说话视频的时长较短,减少了录制真人说话视频对人力、物力等资源的消耗。因此计算设备基于较短的真人说话视频可以获得较大训练数据量,保证了音唇同步参数生成模型的训练数据充足,使计算设备训练得到的模型精度高,进而提高了训练获得的音唇同步参数生成模型的泛化性能,使音唇同步参数生成模型输出的音唇同步参数驱动的目标虚拟人具有更好的音唇同步性。
作为一种可能的实现方式,计算设备利用所述初始虚拟人说话视频和三维人脸重建模型生成音唇同步参数生成模型。例如,计算设备根据初始虚拟人说话视频和三维人脸重建模型确定音唇同步训练参数,音唇同步训练参数用于作为训练音唇同步参数生成模型的标签,然后根据音频数据集和音唇同步训练参数训练音唇同步参数生成模型。其中,音唇同步训练参数作为模型训练的标签,音频数据集作为模型输入数据。音唇同步参数生成模型用于根据输入音频生成音唇同步参数。音唇同步参数用于驱动目标虚拟人说话,以得到目标虚拟人说话视频。
计算设备根据初始虚拟人说话视频和三维人脸重建模型确定音唇同步训练参数,可以是计算设备将初始虚拟人说话视频中的人物说话动作映射至三维人脸模型中,从三维人脸模型中提取音唇同步训练参数。从而在初始虚拟人说话视频的清晰度低于目标虚拟人说话视频的清晰度的情况下,避免从低清晰度的视频中直接提取音唇同步训练参数,保证了音唇同步训练参数的精确度,进而提高了音唇同步参数生成模型的精度。
作为一种可能的实施方式,计算设备基于音频数据集、人物说话视频生成初始虚拟人说话视频,可以包括:将音频数据集和人物说话视频输入预训练模型,得到根据音频数据集中的语音驱动人物说话视频中的人物说话的初始虚拟人说话视频。其中,人物说话视频的时长小于音频数据集中语音的时长。计算设备利用预训练模型根据音频数据集中的多种语种语音、多种音色语音和多种内容语音,能够迅速、简单地生成大数据量的初始虚拟人说话视频,从而扩大了训练数据的数据量。例如,预训练模型可以是跨模态语音驱动人脸动作的网络模型。此外,上述初始虚拟人说话视频的清晰度低于需生成的虚拟人说话视频的清晰度,从而减少了计算设备处理视频的计算资源开销。
作为一种可能的实施方式,预训练模型用于从人物说话视频中提取人物说话特征,并根据音频数据集和所述人物说话特征输出初始虚拟人说话视频。人物说话特征是人物说话图像中人物说话时的脸部特征。
可选地,计算设备可以利用预训练模型对人物说话视频进行预处理来获得人物说话特征,例如计算设备对人物说话视频中的人物的脸部区域进行裁剪,得到脸部视频,再对脸部视频进行特征提取,得到人物说话特征。计算设备利用预处理步骤保留了人物说话视频中人物说话动作的脸部动作,保证了人物说话特征的精确度,由此提高了训练获得的音唇同步参数生成模型的准确率。
作为一种可能的实施方式,音频数据集包括多种语种语音、多种音色语音和多种内容语音。音频数据集的录制无需注意被录制的真人的坐姿、表情、动作以及光照条件等,对数据录制的要求低于视频。音频数据集的采集相对于真人说话视频更加迅速和简单,从而能够快速采集大量的音频数据集,降低了训练数据的采集难度。音频数据集包括不同语种、音色和内容的语音数据,保证了训练数据的多样性,从而提高了音唇同步参数生成模型的泛化能力。
例如,音频数据集中的语音数据可以是对人物说话进行录制得到,也可以是从网络或计算设备本地的数据库中获取。
又如,音频数据集中除从数据库中获得或录制获得的语音数据之外,音频数据集中还可以包括人物说话视频中的语音数据。由此保证训练数据的多样性,提高音唇同步参数生成模型的泛化能力。另外,音频数据集还可以包含人物说话视频中的音频,以进一步扩张音唇同步参数生成模型的训练集。
音唇同步训练参数以及音唇同步参数均为表示三维人脸模型的表情动作的同一种参数。
例如,音唇同步训练参数可以包括眼部特征参数和唇部特征参数。眼部特征参数可以包括表示眼睛睁开、眼睛闭合、眼睛睁开大小和目视方向等眼部动作的参数。唇 部特征参数可以包括表示嘴部张开、嘴部闭合和嘴部张开大小等嘴部动作的参数。
又如,音唇同步训练参数还可以包括头部特征参数、眉部特征参数等。
音唇同步训练参数包括人脸多个部位的特征参数,有利于提高虚拟人说话动作的多样性,增强了音唇同步参数生成模型的泛化性能。
第二方面,提供一种驱动虚拟人说话方法,该方法可以由计算设备执行,例如端侧的终端,具体包括如下步骤:计算设备获取输入音频和第一清晰度的人物说话视频,将输入音频输入音唇同步参数生成模型,得到目标虚拟人说话视频。音唇同步参数生成模型的训练集和目标虚拟人是基于包含有第一清晰度的人物说话视频的视频得到的,第一清晰度低于目标虚拟人说话视频的清晰度。可选地,第一清晰度的人物说话视频是初始虚拟人说话视频。相较于采用录制的真人视频来训练获得的模型,音唇同步参数生成模型的训练数据的数据量更大,则训练获得的音唇同步参数生成模型的精度和泛化性能更好。因此,音唇同步参数生成模型输出的音唇同步参数的精度高,计算设备根据音唇同步参数驱动的虚拟人的音唇同步性高。
例如,目标虚拟人说话视频的生成步骤可以是计算设备得到音唇同步参数生成模型输出的音唇同步参数,根据所述音唇同步参数驱动目标虚拟人说话,以得到目标虚拟人说话视频。
作为一种可能的实施方式,在音唇同步参数生成模型根据输入音频输出音唇同步参数的过程中,计算设备还可以对音唇同步参数生成模型进行更新。例如,计算设备基于输入音频和目标虚拟人的说话视频生成初始虚拟人说话视频;其中,初始虚拟人说话视频的时长大于目标虚拟人的说话视频的时长,然后计算设备利用初始虚拟人说话视频更新音唇同步参数生成模型。从而提高了音唇同步参数生成模型的泛化能力和精度。
第三方面,提供了一种驱动虚拟人说话的模型训练装置,包括:视频生成模块和训练模块。视频生成模块用于基于音频数据集、人物说话视频生成初始虚拟人说话视频。其中,初始虚拟人说话视频的时长大于人物说话视频的时长。训练模块用于利用初始虚拟人说话视频生成音唇同步参数生成模型。音唇同步参数生成模型用于得到目标虚拟人说话视频。初始虚拟人说话视频的清晰度低于目标虚拟人说话视频的清晰度。
作为一种可能的实施方式,训练模块具体用于:利用初始虚拟人说话视频和三维人脸重建模型生成音唇同步参数生成模型。
作为一种可能的实施方式,训练模块具体用于:利用三维人脸重建模型从初始虚拟人说话视频中提取音唇同步训练参数;将音唇同步训练参数作为标签,音频数据集作为模型输入数据,训练得到音唇同步参数生成模型。
作为一种可能的实施方式,视频生成模块具体用于:将音频数据集和人物说话视频输入预训练模型,得到根据音频数据集中的语音驱动人物说话视频中的人物说话的初始虚拟人说话视频,人物说话视频的时长小于所述音频数据集中语音的时长。
作为一种可能的实施方式,预训练模型用于从人物说话视频中提取人物说话特征,并根据音频数据集和人物说话特征输出初始虚拟人说话视频。
作为一种可能的实施方式,人物说话视频的时长小于或等于5分钟,初始虚拟人说话视频的时长大于或等于十小时。
作为一种可能的实施方式,音频数据集包括多种语种语音、多种音色语音和多种内容语音。
作为一种可能的实施方式,音唇同步参数包括眼部特征参数和唇部特征参数。
作为一种可能的实施方式,音频数据集包含人物说话视频中的音频。
第四方面,提供了一种驱动虚拟人说话装置,包括:输入模块和模型处理模块。输入模块用于获取输入音频和目标虚拟人。模型处理模块用于基于输入音频,利用音唇同步参数生成模型生成目标虚拟人说话视频;音唇同步参数生成模型的训练集是基于包含有第一清晰度的人物说话视频的视频和三维人脸重建模型得到的,第一清晰度低于目标虚拟人说话视频的清晰度。
作为一种可能的实施方式,驱动虚拟人说话装置还包括:训练模块,用于根据输入音频更新所述音唇同步参数生成模型。
例如,训练模块具体用于:基于输入音频和目标虚拟人说话视频生成初始虚拟人说话视频;其中,初始虚拟人说话视频的时长大于目标虚拟人说话视频的时长;利用初始虚拟人说话视频更新音唇同步参数生成模型。
需要说明的是,第三方面所述的驱动虚拟人说话的模型训练装置或第四方面所述的驱动虚拟人说话装置可以是终端设备或网络设备,也可以是可设置于终端设备或网络设备中的芯片(系统)或其他部件或组件,还可以是包含终端设备或网络设备的装置,本申请对此不做限定。
此外,第三方面所述的驱动虚拟人说话的模型训练装置的技术效果可以参考第一方面所述的驱动虚拟人说话的模型训练方法的技术效果,第四方面所述的驱动虚拟人说话装置的技术效果可以参考第二方面所述的驱动虚拟人说话方法的技术效果,此处不再赘述。
第五方面,提供了一种计算设备,包括存储器和处理器,所述存储器用于存储一组计算机指令,当所述处理器执行所述一组计算机指令时,用于执行第一方面中任一种可能设计中的驱动虚拟人说话的模型训练方法的操作步骤,或执行第二方面中任一种可能设计中的驱动虚拟人说话方法的操作步骤。
此外,第五方面所述的计算设备的技术效果可以参考第一方面所述的驱动虚拟人说话的模型训练方法的技术效果,或者参考第二方面所述的驱动虚拟人说话方法的技术效果,此处不再赘述。
第六方面,提供了一种驱动虚拟人说话系统,驱动虚拟人说话系统包括训练设备和至少一个终端,至少一个终端与训练设备连接,训练设备用于执行第一方面中任一种可能实现方式中的驱动虚拟人说话的模型训练方法的操作步骤,至少一个终端用于执行第二方面中任一种可能实现方式中的驱动虚拟人说话方法的操作步骤。
第七方面,提供一种计算机可读存储介质,包括:计算机软件指令;当计算机软件指令在数据处理系统中运行时,使得驱动虚拟人说话系统执行如第一方面或第二方面中任意一种可能的实现方式中所述方法的操作步骤。
第八方面,提供一种计算机程序产品,当计算机程序产品在计算机上运行时,使得数据处理系统执行如第一方面或第二方面中任意一种可能的实现方式中所述方法的操作步骤。
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。
图1为本申请实施例提供的一种神经网络的结构示意图;
图2为本申请实施例提供的一种卷积神经网络的结构示意图;
图3为本申请实施例提供的一种虚拟人生成模型的结构示意图;
图4为本申请实施例提供的一种虚拟人构建周期的示意图;
图5为本申请实施例提供的一种驱动虚拟人说话系统的架构示意图;
图6a为本申请实施例提供的一种驱动虚拟人说话的模型训练方法的示意图;
图6b为本申请实施例提供的另一种驱动虚拟人说话的模型训练方法的示意图;
图7为本申请实施例提供的一种驱动虚拟人说话方法的示意图;
图8为本申请实施例提供的一种驱动虚拟人说话的模型训练装置的示意图;
图9为本申请实施例提供的一种驱动虚拟人说话装置的示意图;
图10为本申请实施例提供的一种计算设备的结构示意图。
为了便于理解,下面先对本申请实施例涉及的相关术语及神经网络等相关概念进行介绍。
(1)神经网络
神经网络可以是由神经元组成的,神经元可以是指以x
s和截距1为输入的运算单元。该运算单元的输出满足如下公式(1)。
其中,s=1、2、……n,n为大于1的自然数,W
s为x
s的权重,b为神经元的偏置。f为神经元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层的输入,激活函数可以是sigmoid函数。神经网络是将多个上述单一的神经元联结在一起形成的网络,即一个神经元的输出可以是另一个神经元的输入。每个神经元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经元组成的区域。权重表征不同神经元之间连接的强度。权重决定着输入对输出的影响力。权重近于0意味着改变输入不改变输出。负权重意味着增加输入降低输出。
如图1所示,为本申请实施例提供的一种神经网络的结构示意图。神经网络100包括N个处理层,N为大于或等于3的整数。神经网络100的第一层为输入层110,负责接收输入信号,神经网络100的最后一层为输出层130,负责输出神经网络的处理结果。除去第一层和最后一层的其他层为中间层140,这些中间层140共同组成隐藏层120,隐藏层120中的每一层中间层140既可以接收输入信号,也可以输出信号。隐藏层120负责输入信号的处理过程。每一层代表了信号处理的一个逻辑级别,通过多个层,数据信号可经过多级逻辑的处理。
在一些可行的实施例中该神经网络的输入信号可以是视频信号、语音信号、文本信号、图像信号或温度信号等各种形式的信号。语音信号可以是麦克风(声音传感器) 录制的人说话、唱歌的人声音频信号等各类传感器信号。该神经网络的输入信号还包括其他各种计算机可处理的工程信号,在此不再一一列举。若利用神经网络对图像信号进行深度学习,可以提高神经网络处理图像的质量。
(2)卷积神经网络
卷积神经网络(Convolutional Neuron Network,CNN)是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器。该特征抽取器可以看作是滤波器,卷积过程可以看作是使用一个可训练的滤波器与一个输入的图像或者特征图(feature map)做卷积。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层可以输出若干个特征图,特征图可以是指卷积神经网络运算过程中的中间结果。同一特征图的神经元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取图像信息的方式与位置无关。也就是,图像的某一部分的统计信息与其他部分是一样的。即意味着在某一部分学习的图像信息也能用在另一部分上。所以对于图像上的所有位置,都能使用同样的学习得到的图像信息。在同一卷积层中,可以使用多个卷积核来提取不同的图像信息,一般地,卷积核数量越多,卷积操作反映的图像信息越丰富。
卷积核可以以随机大小的矩阵的形式初始化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。
示例地,如图2所示,为本申请实施例提供的一种卷积神经网络的结构示意图。卷积神经网络200可以包括输入层210、卷积层/池化层220(其中池化层为可选的)和神经网络层230。
卷积层/池化层220例如可以包括层221至层226。在一种示例中,层221例如可以为卷积层,层222例如可以为池化层,层223例如可以为卷积层,层224例如可以为池化层,层225例如可以为卷积层,层226例如可以为池化层。在另一种示例中,层221和层222例如可以为卷积层,层223例如可以为池化层,层224和层225例如可以为卷积层,层226例如可以为池化层。卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。
将以卷积层221为例,介绍一层卷积层的内部工作原理。
卷积层221可以包括很多个卷积算子,卷积算子也可称为核。卷积算子在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器。卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义。该权重矩阵的大小与图像的大小相关。需要注意的是,权重矩阵的纵深维度(depth dimension)和输入图像的纵深维度是相同的。在进行卷积运算的过程中,权重矩阵会延伸到输入图像的整个深度。因此,与一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出,但是大多数情况下不使用单一权重矩阵,而是应用多个尺寸(行×列)相同的权重矩阵,即多个同型矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度。不同的权重矩阵可以用来提取图像中不同的特征,例如一个权重矩阵用来提取图像边缘信息,另一个权重矩阵用来提取图像的特定颜色,又一个权重矩阵用来对图像中不需要的噪点 进行模糊化等。该多个权重矩阵尺寸(行×列)相同,经过该多个尺寸相同的权重矩阵提取后的特征图的尺寸也相同,再将提取到的多个尺寸相同的特征图合并形成卷积运算的输出。
这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以用来从输入图像中提取信息,从而使得卷积神经网络200进行正确的预测。
当卷积神经网络200有多个卷积层的时候,初始的卷积层(例如层221)往往提取较多的一般特征,该一般特征也可以称之为低级别的特征。随着卷积神经网络200深度的加深,越往后的卷积层(例如层226)提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层。在如图2中卷积层/池化层220所示例的层221至层226各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在图像或音频的处理过程中,池化层的唯一目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子,以用于对输入图像进行采样得到较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值作为平均池化的结果。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外,就像卷积层中用权重矩阵的大小应该与图像尺寸相关一样,池化层中的运算符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺寸,池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。
在经过卷积层/池化层220的处理后,卷积神经网络200还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层220提取特征,并减少输入图像带来的参数。然而为了生成最终的输出信息(所需要的类信息或其他相关信息),卷积神经网络200需要利用神经网络层230来生成一个或者一组所需要的类的数量的输出。因此,在神经网络层230中可以包括多层隐藏层(如图2所示的层231、层232至层23n)以及输出层240,该多层隐藏层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到,例如该任务类型可以包括图像识别,图像分类,图像超分辨率重建等等。
在神经网络层230中的多层隐藏层之后,也就是整个卷积神经网络200的最后层为输出层240,该输出层240具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络200的前向传播(如图2由层210至层240方向的传播为前向传播)完成,反向传播(如图2由层240至层210方向的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络200的损失,及卷积神经网络200通过输出层输出的结果和理想结果之间的误差。
需要说明的是,如图2所示的卷积神经网络200仅作为一种卷积神经网络的示例,在具体的应用中,卷积神经网络还可以以其他网络模型的形式存在,例如U-Net、可变性人脸模型(3D Morphable Face Model,3DMM)和残差网络(Residual Network,ResNet)等。
(3)循环神经网络
循环神经网络(RNN,Recurrent Neural Networks)是用来处理序列数据的。在传统的神经网络模型中,是从输入层到隐含层再到输出层,层与层之间是全连接的,而对于每一层层内之间的各个节点是无连接的。这种普通的神经网络虽然解决了很多难题,但是却仍然对很多问题却无能无力。例如,你要预测句子的下一个单词是什么,一般需要用到前面的单词,因为一个句子中前后单词并不是独立的。RNN之所以称为循环神经网路,即一个序列当前的输出与前面的输出也有关。具体的表现形式为网络会对前面的信息进行记忆并应用于当前输出的计算中,即隐含层本层之间的节点不再无连接而是有连接的,并且隐含层的输入不仅包括输入层的输出还包括上一时刻隐含层的输出。理论上,RNN能够对任何长度的序列数据进行处理。对于RNN的训练和对传统的CNN或DNN的训练一样。同样使用误差反向传播算法,不过有一点区别:即,如果将RNN进行网络展开,那么其中的参数,如W,是共享的;而如上举例上述的传统神经网络却不是这样。并且在使用梯度下降算法中,每一步的输出不仅依赖当前步的网络,还依赖前面若干步网络的状态。该学习算法称为基于时间的反向传播算法Back propagation Through Time(BPTT)。
既然已经有了卷积神经网络,为什么还要循环神经网络?原因很简单,在卷积神经网络中,有一个前提假设是:元素之间是相互独立的,输入与输出也是独立的,比如猫和狗。但现实世界中,很多元素都是相互连接的,比如股票随时间的变化,再比如一个人说了:我喜欢旅游,其中最喜欢的地方是云南,以后有机会一定要去。这里填空,人类应该都知道是填“云南”。因为人类会根据上下文的内容进行推断,但如何让机器做到这一步?RNN就应运而生了。RNN旨在让机器像人一样拥有记忆的能力。因此,RNN的输出就需要依赖当前的输入信息和历史的记忆信息。
在具体的应用中,循环神经网络可以以各种网络模型的形式存在,例如长短期记忆网络(Long Short Term Memory Networks,LSTM)、基于深度学习的端对端语音合成模型(Char2Wav)等。
(4)生成式对抗网络
生成式对抗网络(GAN,Generative Adversarial Networks)是一种深度学习模型。该模型中至少包括两个模块:一个模块是生成模型(Generative Model),另一个模块是判别模型(Discriminative Model),通过这两个模块互相博弈学习,从而产生更好的输出。生成模型和判别模型都可以是神经网络,具体可以是深度神经网络,或者卷积神经网络。GAN的基本原理如下:以生成图片的GAN为例,假设有两个网络,G(Generator)和D(Discriminator),其中G是一个生成图片的网络,它接收一个随机的噪声z,通过这个噪声生成图片,记做G(z);D是一个判别网络,用于判别一张图片是不是“真实的”。它的输入参数是x,x代表一张图片,输出D(x)代表x为真实图片的概率,如果为1,就代表100%是真实的图片,如果为0,就代表不可能是真实的图片。在对该生成式对抗网络进行训练的过程中,生成网络G的目标就是尽可能生成真实的图片去欺骗判别网络D,而判别网络D的目标就是尽量把G生成的图片和真实的图片区分开来。这样,G和D就构成了一个动态的“博弈”过程,也即“生成式对抗网络”中的“对抗”。最后博弈的结果,在理想的状态下,G可以生成足以“以假乱真”的图片G(z),而D难以判定G生成的图片究竟是不是 真实的,即D(G(z))=0.5。这样就得到了一个优异的生成模型G,它可以用来生成图片。
在具体的应用中,生成式对抗网络可以以各种网络模型的形式存在,例如基于生成式对抗网络的唇形合成模型(wave2lip),唇形合成模型用于实现动态图像中的口型与输入音频的同步。又如使用U-Net网络结构的Pix2Pix网络。
(5)损失函数
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到深度神经网络能够预测出真正想要的目标值或与真正想要的目标值非常接近的值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。
(6)反向传播算法
卷积神经网络可以采用误差反向传播(back propagation,BP)算法在训练过程中修正初始的超分辨率模型中参数的大小,使得超分辨率模型的重建误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新初始的超分辨率模型中参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的超分辨率模型的参数,例如权重矩阵。
(7)跨模态驱动
模态是指数据的存在形式,比如文本、音频、图像、视频等文件格式。有些数据的存在形式不同,但都是描述同一事物或事件的,例如人物说话视频中人物说一个词语时的语音和动作图像,均是人物说这个词语的数据。跨模态驱动是指采用一个模态的数据生成另一个模态的数据。例如,根据文本数据生成人阅读该文本数据的语音数据,或是根据语音数据生成人发出语音数据中语音的说话视频。
下面结合图3和图4对现有技术中构建虚拟人的方式进行说明。
如图3所示,奥巴马网络(ObamaNet)是一种用于专用于某种特定形象的虚拟人生成模型。基于ObamaNet构建虚拟人的步骤可以包括拍摄、语音处理、转写处理、验收等过程,开发并提供客户专属的虚拟人模型,从而开发客户专属的虚拟人模型。
ObamaNet主要包括基于Char2Wav的文本转语音网络、用于生成与音频同步相关的音唇同步参数(Keypoints)的LSTM,以及用于通过音唇同步参数生成虚拟人说话视频的虚拟人生成模型,虚拟人生成模型是使用U-Net网络结构的Pix2Pix网络。
ObamaNet的工作过程为:计算设备获取文字文本,通过文本转语音网络实现文字到语音的转化,通过LSTM将语音转化成驱动虚拟人说话的音唇同步参数,将音唇同步参数输入虚拟人生成模型,生成最终的虚拟人说话视频。虚拟人模型构建过程为有监督训练,需要大量真人说话视频和音频,才能训练出能够适应任意文本(音频)的虚拟人。因此,基于奥巴马网络构建虚拟人的周期较长。示例地,基于奥巴马网络构建虚拟人的周期可以参考图4,构建虚拟人的制作周期可以包括:两周时间拍摄数据(收集照片和音频等数据)、 三周时间加工数据(将文本转语音)、一周时间学习数据(训练深度学习神经网络),以及一周时间适应服务(配置模型以及提供虚拟人模型的数据接口)。
基于ObamaNet构建虚拟人还存在如下问题:(1)需要拍摄大量的训练素材,拍摄周期长,拍摄成本高,拍摄过程中对模特的坐姿、表情、动作和背景要求严格,需要说话人注意录制中途不能做出身体活动较大的表情动作,例如喝水、转头等,录制时还要考虑人脸光照情况等;(2)人工采集特定形象的情况下即使投入大量人力物力,训练数据依然很难得到足够数据量的训练数据。所以,由于训练数据的限制,导致模型泛化能力差,生成的虚拟人说话视频的音唇同步效果差。
本申请提供一种驱动虚拟人说话的模型训练方法,尤其是提供一种根据音频数据集扩充训练数据的驱动虚拟人说话的模型训练方法,即计算设备根据音频数据集和人物说话视频生成初始虚拟人说话视频后,根据初始虚拟人说话视频确定音唇同步训练参数,计算设备再将音唇同步训练参数作为标签,根据音频数据集和音唇同步训练参数训练音唇同步参数生成模型。由此,计算设备根据音频数据集生成的初始虚拟人说话视频对音唇同步参数生成模型的训练数据进行扩充,相对于录制的真人说话视频中提取训练数据,降低了训练数据的采集难度和采集耗时,计算设备能够在短时间内获得更多的训练数据,从而提高采用音唇同步训练参数训练得到的音唇同步参数生成模型的精度和泛化性能。
下面将结合附图对本申请实施例的实施方式进行详细描述。
图5为本申请实施例提供的一种驱动虚拟人说话系统的架构示意图。如图5所示,系统500包括执行设备510、训练设备520、数据库530、终端设备540、数据存储系统550和数据采集设备560。
执行设备510可以是终端,如手机终端、平板电脑、笔记本电脑、虚拟现实(virtual reality,VR)设备、增强现实(augmented reality,AR)设备、混合现实(Mixed Reality,MR)设备、扩展现实(Extended Reality,ER)设备、摄像头或车载终端等,还可以是边缘设备(例如,携带具有处理能力芯片的盒子)等。
训练设备520可以是终端,还可以是其他计算设备,如服务器或者云端设备等。
作为一种可能的实施例,执行设备510和训练设备520是部署在不同物理设备(如:服务器或集群中的服务器)上的不同处理器。例如,执行设备510可以是图形处理单元(graphic processing unit,GPU)、中央处理器(central processing unit,CPU)、其他通用处理器、数字信号处理器(digital signal processing,DSP)、专用集成电路(application-specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者是任何常规的处理器等。训练设备520可以是图形处理器(graphics processing unit,GPU)、神经网络处理器(neural network processing unit,NPU)、微处理器、特定应用集成电路(application-specific integrated circuit,ASIC)、或一个或多个用于控制本申请方案程序执行的集成电路。
在另一可能的实施例中,执行设备510和训练设备520部署在同一物理设备,或执行设备510和训练设备520为同一物理设备。
数据采集设备560用于采集训练数据,并将训练数据存入数据库530,数据采集 设备560与执行设备510、训练设备520可以是相同或不同的设备。训练数据包括图像、语音和文字中至少一种形式的数据。例如,训练数据包括训练音频和训练音频中的目标,训练音频中的目标可以是指训练音频的标签。
训练设备520用于利用训练数据对神经网络进行训练,直到神经网络中的损失函数收敛,且损失函数值小于特定阈值则神经网络训练完成,从而使得神经网络达到一定精度。或者,数据库530中所有的训练数据被用于训练,则神经网络训练完成,使训练完成的神经网络具有跨模态驱动虚拟人、音唇同步参数生成等功能。进而,训练设备520将训练完成的神经网络501配置到执行设备510。执行设备510用于实现根据训练完成的神经网络501处理应用数据的功能。
在一些实施例中,执行设备510和训练设备520为同一计算设备,计算设备可以将训练完成的神经网络501配置到自身,利用训练完成的神经网络501实现跨模态驱动虚拟人、音唇同步参数生成等功能。
在另一些实施例中,训练设备520可以将训练完成的神经网络501配置到多个执行设备510。每个执行设备510利用训练完成的神经网络501实现跨模态驱动虚拟人、音唇同步参数生成等功能。
结合驱动虚拟人说话系统500,本实施例提供的驱动虚拟人说话的模型训练方法能够应用在跨模态驱动的场景。具体而言,本申请实施例的模型训练方法能够应用在终端制作动态图像、虚拟形象网络直播等场景中,下面分别对终端制作动态图像、虚拟形象网络直播的场景进行简单介绍。
例如,对于终端制作动态图像的场景:用户使用训练设备520(例如:手机、电脑、平板电脑)获取音频数据集和人物说话视频,音频数据集和人物说话视频可以是训练设备520从网络数据库下载或对真人说话进行录制得到的,音频数据集可以包含人说话的音频。用户操作训练设备520根据音频数据集和人物说话视频得到音唇同步参数生成模型。训练设备520根据音频数据集和人物说话视频中的人物说话特征生成初始虚拟人说话视频,根据初始虚拟人说话视频确定音唇同步训练参数,然后将音唇同步训练参数作为标签,与音频数据集作为训练数据训练音唇同步参数生成模型。
由于训练设备520利用初始虚拟人说话视频对音唇同步训练参数的数据量进行扩充,相对于从录制的真人说话视频中提取音唇同步训练参数,要录制的真人说话视频的时长较短,训练设备520需要获取或录制的人物说话视频的数据量较小。此外,训练设备520根据初始虚拟人说话视频生成音唇同步训练参数,由于音唇同步训练参数是作为模型训练的标签,对初始虚拟人说话视频的清晰度要求较低。训练设备520根据清晰度较低的初始虚拟人说话视频生成音唇同步训练参数,降低了生成标签的计算量小、处理速度快。因此,手机、电脑和平板电脑等处理能力弱于专用图形与音频处理服务器的终端也能够作为训练设备520进行音唇同步参数生成模型的训练。
例如,对于虚拟形象进行网络直播的场景:操作人员使用训练设备520(例如:云端设备、服务器)获取音频数据集和人物说话视频,音频数据集和人物说话视频可以是训练设备520从网络数据库下载或对真人说话进行录制得到的,音频数据集可以是包含人唱歌的音频。操作人员操作训练设备520根据音频数据集和人物说话视频得到音唇同步参数生成模型的步骤与终端制作动态图像的场景中得到音唇同步参数生成 模型的步骤相同,在此不再赘述。
本申请实施例的模型训练方法应用于虚拟形象进行网络直播的场景,训练设备520需要获取或录制的人物说话视频的数据量较小,且生成标签的计算量小、处理速度快,提高了音唇同步参数生成模型的构建效率,减少了模型训练的成本。另外,训练设备520根据初始虚拟人说话视频生成音唇同步训练参数,保证了音唇同步参数生成模型的训练数据充足,能够提高音唇同步参数生成模型的泛化性能和准确性。
需要说明的是,在实际的应用中,数据库530中维护的训练数据(例如:音频数据集、人物说话视频)不一定都来自于数据采集设备560,也有可能是从其他设备接收得到的。另外,训练设备520也不一定完全基于数据库530维护的训练数据训练神经网络,也有可能从云端或其他地方获取训练数据训练神经网络。上述描述不应该作为对本申请实施例的限定。
进一步地,根据执行设备510所执行的功能,还可以进一步将执行设备510细分为如图5所示的架构,如图所示,执行设备510配置有计算模块511、I/O接口512和预处理模块513。
I/O接口512用于与外部设备进行数据交互。用户可以通过终端设备540向I/O接口512输入数据。输入数据可以包括图像或视频。另外,输入数据也可以来自数据库530。
预处理模块513用于根据I/O接口512接收到的输入数据进行预处理。在本申请实施例中,预处理模块513可以用于识别从I/O接口512接收到的应用数据的应用场景特征。
在执行设备510对输入数据进行预处理,或者在执行设备510的计算模块511执行计算等相关的处理过程中,执行设备510可以调用数据存储系统550中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据和指令等存入数据存储系统550中。
例如,执行设备510存储的第一神经网络可以应用于执行设备510。执行设备510获取到应用数据后,计算模块511将应用数据输入第一神经网络得到处理结果。由于第一神经网络是由训练设备520依据类群获取的具有相似的应用场景特征的数据训练得到的,因此,利用第一神经网络对应用数据进行处理,可以满足用户对数据处理的精度需求。
最后,I/O接口512将处理结果返回给终端设备540,从而提供给用户,以便用户查看处理结果。应理解的是,终端设备540与执行设备510也可以是同一物理设备。
在图5所示情况下,用户可以手动给定输入数据,该手动给定可以通过I/O接口512提供的界面进行操作。另一种情况下,终端设备540可以自动地向I/O接口512发送输入数据,如果要求终端设备540自动发送输入数据需要获得用户的授权,则用户可以在终端设备540中设置相应权限。用户可以在终端设备540查看执行设备510输出的处理结果,具体的呈现形式可以是显示、声音、动作等具体方式。终端设备540也可以作为数据采集端,采集如图所示输入I/O接口512的输入数据及输出I/O接口512的处理结果作为新的样本数据,并存入数据库530。当然,也可以不经过终端设备540进行采集,而是由I/O接口512将如图所示输入I/O接口512的输入数据及输出I/O 接口512的处理结果,作为新的样本数据存入数据库530。
图5仅是本申请实施例提供的一种系统架构的示意图,图5中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在图5中,数据存储系统550相对执行设备510是外部存储器,在其它情况下,也可以将数据存储系统550置于执行设备510中。
接下来请参考图6a,对驱动虚拟人说话的模型训练方法进行详细阐述。在这里以图5中的训练设备520为例进行说明。
步骤610a、训练设备520根据音频数据集和人物说话特征生成初始虚拟人说话视频。
训练设备520根据音频数据集中的音频和人物说话特征输出与音频数据集中的语音匹配的初始虚拟人说话视频。人物说话特征可以是从人物说话视频中得到的。训练设备520可以将音频和人物说话视频输入跨模态语音驱动虚拟人模型,跨模态语音驱动虚拟人模型从人物说话视频提取人物说话特征,根据音频数据集和人物说话特征输出初始虚拟人说话视频。
人物说话视频可以是训练设备520从网络数据库中获取的,或者是训练设备520利用摄像机录制的真人说话的视频。人物说话视频的时长可以是数分钟,例如3分钟、5分钟等。
训练设备520利用跨模态语音驱动虚拟人模型对人物说话视频进行预处理,来得到人物说话特征。预处理可以包括裁剪和特征提取。例如,人物说话视频中存在人物图像和背景图像,人物图像包括身体部分图像和人脸部分图像。跨模态语音驱动虚拟人模型的预处理包括:将人脸部分图像从人物说话视频中裁剪处出来,并从人脸部分图像中提取人物说话特征。
另外,训练设备520也可以在对跨模态语音驱动虚拟人模型输入数据之前,利用具有裁剪和特征提取功能的程序对人物说话视频进行预处理,然后将音频数据集和预处理得到的人物说话特征输入跨模态语音驱动虚拟人模型。
音频数据集可以包括多种语种语音、多种音色语音和多种内容语音。多种内容语音中的内容可以是指语音包含的词语、短句、长句以及语气等。例如,音频数据集包括男性语气温和地用中文说“你好”、“再见”、“抱歉”等词语的语音,女性语气严厉地用英语说“请注意网络安全”、“禁止发送违规信息”等短句的语音,男性语气急促地用法语说“演唱会快开始了,大家快一点”等长句的语音。男性和女性说话的语音的频率可以不同。音频数据集包含的语音的时长可以为数十小时,例如30小时、35小时或50小时等。
可选地,训练设备520在获得人物说话视频后,可以将人物说话视频中的音频数据提取出来,并将人物说话视频中的音频数据加入音频数据集。
作为一种可选的实施方式,跨模态语音驱动虚拟人模型为唇形合成模型或GAN模型。跨模态语音驱动虚拟人模型是一种预训练模型。
采用大量真人说话视频进行有监督的模型训练,需要拍摄大量的训练素材,例如数十个小时时长的真人说话视频,训练素材采集的时间周期长,拍摄成本高,对被拍摄人物的坐姿、表情、动作和背景具有要求严格。录制中途不能大的表情动作、喝水、 转头等,录制是还要考虑避免反光等。因此,训练素材录制难度大,训练素材获取不足,导致训练获得的模型泛化性能较差。本实施例中训练设备520利用音频数据集和短时长的人物说话视频生成大量的初始虚拟人说话视频,针对音频数据集的采集,无需注意人物表情、动作和环境亮度等,降低了训练素材的获取难度,保证了训练素材的充足程度。
步骤620a、训练设备520根据初始虚拟人说话视频确定音唇同步训练参数。
训练设备520将初始虚拟人说话视频中的人物说话动作映射至三维人脸模型中,即让三维人脸模型的表情动作与初始虚拟人说话视频中人物说话的表情动作一致,再从三维人脸模型中提取音唇同步训练参数。
音唇同步训练参数是用于表示三维人脸模型进行说话动作的特征的参数。音唇同步训练参数可以包括眼部特征和唇部特征。眼部特征可以包括用于表示眼睛睁开、眼睛闭合、眼睛睁开大小和目视方向等眼部动作的参数。音唇同步训练参数用于作为训练音唇同步参数生成模型的标签。由于初始虚拟人说话视频是根据音频数据集生成的,不是根据真人说话视频提取的,也可以将音唇同步训练参数的集合称为伪标签库。
在一些可能的实施例中,唇部特征参数可以包括用于表示嘴部张开、嘴部闭合和嘴部张开大小等嘴部动作的参数。例如,与人物发出“啊”的语音对应的人物说话特征是表示嘴部张开、眼睛睁开的参数,人物发出“嗯”的语音对应的人物说话特征是表示嘴部闭合、眼睛闭合的参数。
在另外一些可能的实施例中,音唇同步训练参数还可以包括头部特征参数、眉部特征参数等。例如,头部特征参数可以包括用于表示头部转动角度和转动速度的参数。眉部特征参数可以包括用于表示眉毛移动距离的参数。
示例地,三维人脸模型可以是任意一种能够表征人脸的预训练的三维模型,例如可变性人脸模型,可变性人脸模型是一种三维人脸统计模型,训练设备520使用可变性人脸模型能够根据初始虚拟人说话视频与三维人脸模型进行拟合,并从三维人脸模型中提取音唇同步训练参数。
在另外的可能实现的例子中,训练设备520可以采用任意一种具有拟合功能的模型将初始虚拟人说话视频中的人物说话动作映射至三维人脸模型,例如贝叶斯预测模型(Bayesian Forecasting Model)、残差网络模型等。
若训练设备520直接从初始虚拟人说话视频中提取音唇同步训练参数,则由于初始虚拟人说话视频中虚拟人说话的表情细节表现以及整体动作自然性较差,存在音唇同步训练参数准确度低的问题。训练设备520将初始虚拟人说话视频中的人物说话动作映射至三维人脸模型中,再从三维人脸模型中提取音唇同步训练参数,确保了音唇同步训练参数的准确度。
步骤630a、训练设备520根据音频数据集和音唇同步训练参数训练音唇同步参数生成模型。
训练设备520将音频数据集作为输入数据,将音唇同步参数作为输出数据,将音唇同步训练参数作为监督标签,对音唇同步参数生成模型进行训练。音唇同步参数用于驱动虚拟人说话,以得到目标虚拟人说话视频。音唇同步参数包含的参数类型与音唇同步训练参数相同,在此不再赘述。可选地,音唇同步参数生成模型的网络结构可 以是U-Net、卷积神经网络、长短期记忆网络等。
训练设备520根据音频数据集生成的初始虚拟人说话视频,并根据初始虚拟人说话视频确定音唇同步训练参数,在初始虚拟人说话视频的清晰度(例如:分辨率为128*128)低于需得到的目标虚拟人说话视频的清晰度(例如:分辨率为1280×720)的情况下,训练设备520能够根据音频数据集迅速生成大量的初始虚拟人说话视频,从而根据初始虚拟人说话视频得到大量音唇同步训练参数来作为模型训练的监督标签,提高了音唇同步参数生成模型的泛化能力,使模型输出的音唇同步参数驱动的目标虚拟人的表情动作与输入音频中语音的同步性更高。因此,用户使用训练设备520得到音唇同步参数生成模型,仅需录制少量的人物说话视频作为输入数据,即可完成音唇同步参数生成模型的训练。
训练设备520训练获得音唇同步参数生成模型后,训练设备520可以将音唇同步参数生成模型配置到端侧设备,例如执行设备510,执行设备510使用音唇同步参数生成模型驱动虚拟人说话。训练设备520和执行设备510可以是同一个或不同的计算设备。
除了图6a所示的驱动虚拟人说话的模型训练方法,本实施例还提供了另一种驱动虚拟人说话的模型训练方法,如图6b所示,该驱动虚拟人说话的模型训练方法的步骤可以如下:
步骤610b、计算设备520基于音频数据集、人物说话视频生成初始虚拟人说话视频。
计算设备520根据音频数据集和人物说话视频得到根据音频数据集中的语音驱动人物说话视频中的人物说话的初始虚拟人说话视频。
例如,计算设备520将音频数据集和人物说话视频输入预训练模型,预训练模型根据音频数据集中的语音驱动人物说话视频中的人物说话,输出初始虚拟人说话视频。
其中,预训练模型可以是跨模态语音驱动虚拟人模型,人物说话视频的时长小于所述音频数据集中语音的时长。初始虚拟人说话视频的具体生成步骤请参考图6a中的步骤610a,在此不再赘述。
步骤620b、计算设备520利用初始虚拟人说话视频生成音唇同步参数生成模型。
计算设备520利用初始虚拟人说话视频和三维人脸重建模型生成音唇同步参数生成模型,例如,计算设备520利用三维人脸重建模型从初始虚拟人说话视频中提取音唇同步训练参数,将音唇同步训练参数作为标签,音频数据集作为模型输入数据,训练得到音唇同步参数生成模型。计算设备520利用三维人脸重建模型从初始虚拟人说话视频中提取音唇同步训练参数的具体步骤,请参考图6a中的步骤620a的内容,计算设备520将音唇同步训练参数作为标签,音频数据集作为模型输入数据,训练得到音唇同步参数生成模型的具体步骤,请参考图6a中的步骤630a,在此不再赘述。
接下来请参考图7,对执行设备510执行驱动虚拟人说话方法的步骤进行详细阐述。
步骤710、执行设备510获取输入音频和第一清晰度的人物说话视频。
执行设备510可以从网络数据库或本地数据库读取输入音频,还可以采用音频录制设备采集人说话的音频,来作为输入音频。
目标虚拟人是要生成的目标虚拟人说话视频中由输入音频所驱动进行说话的虚拟人。目标虚拟人可以是音唇同步参数生成模型根据第一清晰度的人物说话视频生成的。
步骤720、执行设备510基于输入音频,利用音唇同步参数生成模型生成目标虚拟人说话视频。
执行设备510将输入音频输入音唇同步参数生成模型,输出音唇同步参数,根据音唇同步参数驱动目标虚拟人说话。
音唇同步参数生成模型的训练集是包含有第一清晰度的人物说话视频的视频和三维人脸重建模型得到的,第一清晰度低于目标虚拟人说话视频的清晰度,具体步骤请参考图6a中的步骤610a至步骤630a,第一清晰度的人物说话视频可以是图6a中步骤610a生成的初始虚拟人说话视频,在此不再赘述。
音唇同步参数生成模型输出的音唇同步参数用于驱动目标虚拟人说话,即驱动目标虚拟人做出与输入音频匹配的表情动作。音唇同步参数与音唇同步训练参数包含的参数类型相同,在此不在赘述。
生成目标虚拟人说话视频的步骤可以包括:执行设备510将音唇同步参数输入虚拟人生成模型,虚拟人生成模型根据音唇同步参数驱动目标虚拟人说话,以得到目标虚拟人说话视频。执行设备510输出的目标虚拟人说话视频可以包括输入音频。可选地,虚拟人生成模型可以是基于U-Net的神经网络模型。
示例地,虚拟人的人物形象与图6a中的步骤610a所述的初始虚拟人说话视频中的人物形象相同。
由于音唇同步参数生成模型是采用大量的音频数据和音唇同步训练参数训练得到,音唇同步参数生成模型输出的音唇同步参数的准确性高,且保证了音唇同步参数生成模型的泛化能力,则提高了音唇同步参数驱动的目标虚拟人的说话动作与输入音频的音唇同步性。
执行设备510得到输出目标虚拟人说话视频后,还可以将目标虚拟人说话视频发送至端侧设备,以使端侧设备向用户播放目标虚拟人说话视频。端侧设备向用户播放目标虚拟人说话视频的情景下,例如办事大厅的显示器播放目标虚拟人说话视频,目标虚拟人可以是虚拟大堂经理,目标虚拟人说话视频可以是虚拟大堂经理说话的视频,显示器在播放虚拟大堂经理说话的音频的同时播放虚拟大堂经理说话的视频,显示器播放的音频和视频中虚拟大堂经理的表情动作音唇同步。作为一种可选的实施方式,执行设备510在执行驱动虚拟人说话方法中的步骤驱动目标虚拟人说话时,执行设备510还可以将输入音频作为更新音频数据集,以更新音频数据集替代步骤610a中的音频数据集,再次执行步骤610a至步骤620a获得微调音唇同步训练参数,根据微调音唇同步训练参数更新音唇同步参数生成模型。
作为另一种可选的实施方式,执行设备510可以将输入音频中与音频数据集不同的音频数据作为更新音频数据集中的音频数据。执行设备510可以确定输入音频中的每一条音频数据与音频数据集中的每一条音频数据的音频差异值,若音频差异值大于阈值则将输入音频中的该条音频数据加入更新音频数据集。可选地,执行设备510可以采用动态时间归整(Dynamic Time Warping,DTW)算法和梅尔频率倒谱系数(Mel Frequency Cepstrum Coefficient,MFCC)算法获得两条音频数据的音频差异值。
执行设备510将微调音唇同步训练参数作为伪标签对音唇同步参数生成模型进行微调,以完成对音唇同步参数生成模型的更新。可选地,执行设备510更新音唇同步参数生成模型的方式是对音唇同步参数生成模型的层级结构中最后一层或最后多层的权重进行微调。
本实施例中执行设备510在使用音唇同步参数生成模型的过程中对音唇同步参数生成模型进行微调,提高了音唇同步参数生成模型对输入音频生成音唇同步参数的准确性,以此提高音唇同步参数生成模型的泛化能力。
可以理解的是,为了实现上述实施例中的功能,执行设备510和训练设备520可以是相同或不同的终端,终端包括了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本申请中所公开的实施例描述的各示例的单元及方法步骤,本申请能够以硬件或硬件和计算机软件相结合的形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用场景和设计约束条件。
上文结合图6详细描述了根据本实施例所提供的驱动虚拟人说话的模型训练方法,下面将结合图8,描述根据本实施例所提供的驱动虚拟人说话的模型训练装置。
图8为本实施例提供的可能的驱动虚拟人说话的模型训练装置的示意图。驱动虚拟人说话的模型训练装置可以用于实现上述方法实施例中执行设备的功能,因此也能实现上述方法实施例所具备的有益效果。在本实施例中,该驱动虚拟人说话的模型训练装置可以是如图5所示的训练设备520,还可以是应用于服务器的模块(如芯片)。
驱动虚拟人说话的模型训练装置800包括视频生成模块810、参数生成模块820和训练模块830。驱动虚拟人说话的模型训练装置800用于实现上述图6a和图6b中所示的方法实施例中计算设备的功能。
视频生成模块810,用于根据音频数据集和人物说话视频生成初始虚拟人说话视频;
参数生成模块820,用于根据初始虚拟人说话视频确定音唇同步训练参数,音唇同步训练参数用于作为训练音唇同步参数生成模型的标签;
训练模块830,用于根据音频数据集和音唇同步训练参数训练音唇同步参数生成模型,音唇同步参数生成模型用于根据输入音频生成音唇同步参数,音唇同步参数用于驱动虚拟人说话,以得到虚拟人说话视频。
可选地,音频数据集包括多种语种语音、多种音色语音和多种内容语音。
可选地,初始虚拟人说话视频的清晰度低于虚拟人说话视频的清晰度。
可选地,音唇同步参数包括眼部特征参数和唇部特征参数。
可选地,驱动虚拟人说话的模型训练装置800还包括预处理模块。预处理模块用于对人物说话视频进行预处理得到人物说话特征,预处理包含裁剪和特征提取,人物说话特征包括眼部特征和唇部特征。
可选地,音频数据集包含人物说话视频中的音频。
应说明的是,在一些实施例中若采用另外的模块划分方式,参数生成模块820和训练模块830的功能可以均由训练模块830实现。
上文结合图7详细描述了根据本实施例所提供的驱动虚拟人说话方法,下面将结 合图9,描述根据本实施例所提供的驱动虚拟人说话装置。
图9为本实施例提供的可能的驱动虚拟人说话装置的示意图。驱动虚拟人说话装置可以用于实现上述方法实施例中执行设备的功能,因此也能实现上述方法实施例所具备的有益效果。在本实施例中,该驱动虚拟人说话装置可以是如图5所示的执行设备510,还可以是应用于服务器的模块(如芯片)。
驱动虚拟人说话装置900包括输入模块910、模型处理模块920和驱动模块930。驱动虚拟人说话装置900用于实现上述图7中所示的方法实施例中计算设备的功能。
输入模块910,用于获取输入音频;
模型处理模块920,用于将输入音频输入音唇同步参数生成模型,输出音唇同步参数,音唇同步参数生成模型是根据音频数据集和音唇同步训练参数训练得到的;
驱动模块930,用于根据音唇同步参数驱动虚拟人说话,以得到虚拟人说话视频。
可选地,驱动虚拟人说话装置900还包括训练模块,该训练模块用于根据输入音频更新音唇同步参数生成模型。
应说明的是,在一些实施例中若采用另外的模块划分方式,模型处理模块920和驱动模块930的功能可以由模型处理模块920实现。
应理解的是,本申请实施例的驱动虚拟人说话的模型训练装置800和驱动虚拟人说话装置900可以通过GPU、NPU、ASIC实现,或可编程逻辑器件(programmable logic device,PLD)实现,上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD),现场可编程门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。也可以通过软件实现图4或图5所示的方法时,驱动虚拟人说话的模型训练装置800和驱动虚拟人说话装置900及其各个模块也可以为软件模块。
根据本申请实施例的驱动虚拟人说话的模型训练装置800和驱动虚拟人说话装置900可对应于执行本申请实施例中描述的方法,并且驱动虚拟人说话的模型训练装置800和驱动虚拟人说话装置900中的各个单元的上述和其它操作和/或功能分别为了实现图6或图7中的各个方法的相应流程,为了简洁,在此不再赘述。
本申请实施例还提供了一种计算设备,请参考图10,图10为本申请实施例提供的一种计算设备的结构示意图。计算设备1000包括存储器1001、处理器1002、通信接口1003以及总线1004。其中,存储器1001、处理器1002、通信接口1003通过总线1004实现彼此之间的通信连接。
存储器1001可以是只读存储器,静态存储设备,动态存储设备或者随机存取存储器。存储器1001可以存储计算机指令,当存储器1001中存储的计算机指令被处理器1002执行时,处理器1002和通信接口1003用于执行软件系统的驱动虚拟人说话的模型训练方法和驱动虚拟人说话方法中的步骤。存储器还可以存储数据集合,例如:存储器1001中的一部分存储资源被划分成一个区域,用于存储实现本申请实施例的音唇同步参数生成模型的功能的程序。
处理器1002可以采用通用的CPU,应用专用集成电路(application specific integrated circuit,ASIC),GPU或其任意组合。处理器1002可以包括一个或多个芯片。处理器1002可以包括AI加速器,例如NPU。
通信接口1003使用例如但不限于收发器一类的收发模块,来实现计算设备1000与其他设备或通信网络之间的通信。例如,可以通过通信接口1003获取迭代训练请求,以及反馈迭代训练后神经网络。
总线1004可包括在计算设备1000各个部件(例如,存储器1001、处理器1002、通信接口1003)之间传送信息的通路。
计算设备1000可以为云数据中心中的计算机(例如:服务器),或边缘数据中心中的计算机,或终端。
每个计算设备1000上都可以部署训练设备520的功能。例如,GPU用于实现训练设备520的功能。
对于同一个计算设备1000内部署的训练设备520的功能和执行设备510的功能,训练设备520可以通过总线1004与执行设备510进行通信。
对于不同计算设备1000内部署的训练设备520的功能和执行设备510的功能,训练设备520可以通过通信网络与执行设备510进行通信。
本实施例中的方法步骤可以通过硬件的方式来实现,也可以由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成,软件模块可以被存放于随机存取存储器(random access memory,RAM)、闪存、只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)、寄存器、硬盘、移动硬盘、CD-ROM或者本领域熟知的任何其它形式的存储介质中。一种示例性的存储介质耦合至处理器,从而使处理器能够从该存储介质读取信息,且可向该存储介质写入信息。当然,存储介质也可以是处理器的组成部分。处理器和存储介质可以位于ASIC中。另外,该ASIC可以位于终端设备中。当然,处理器和存储介质也可以作为分立组件存在于网络设备或终端设备中。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机程序或指令。在计算机上加载和执行所述计算机程序或指令时,全部或部分地执行本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、网络设备、用户设备或者其它可编程装置。所述计算机程序或指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机程序或指令可以从一个网站站点、计算机、服务器或数据中心通过有线或无线方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是集成一个或多个可用介质的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,例如,软盘、硬盘、磁带;也可以是光介质,例如,数字视频光盘(digital video disc,DVD);还可以是半导体介质,例如,固态硬盘(solid state drive,SSD)。以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。
Claims (26)
- 一种驱动虚拟人说话的模型训练方法,其特征在于,包括:基于音频数据集、人物说话视频生成初始虚拟人说话视频;其中,所述初始虚拟人说话视频的时长大于所述人物说话视频的时长;利用所述初始虚拟人说话视频生成音唇同步参数生成模型;所述音唇同步参数生成模型用于得到目标虚拟人说话视频;所述初始虚拟人说话视频的清晰度低于所述目标虚拟人说话视频的清晰度。
- 根据权利要求1所述的方法,其特征在于,利用所述初始虚拟人说话视频生成音唇同步参数生成模型,包括:利用所述初始虚拟人说话视频和三维人脸重建模型生成音唇同步参数生成模型。
- 根据权利要求2所述的方法,其特征在于,利用所述初始虚拟人说话视频和三维人脸重建模型生成音唇同步参数生成模型,包括:利用三维人脸重建模型从所述初始虚拟人说话视频中提取音唇同步训练参数;将所述音唇同步训练参数作为标签,所述音频数据集作为模型输入数据,训练得到音唇同步参数生成模型。
- 根据权利要求1-3中任一项所述的方法,其特征在于,基于音频数据集、人物说话视频生成初始虚拟人说话视频,包括:将音频数据集和人物说话视频输入预训练模型,得到根据音频数据集中的语音驱动人物说话视频中的人物说话的初始虚拟人说话视频,所述人物说话视频的时长小于所述音频数据集中语音的时长。
- 根据权利要求4所述的方法,其特征在于,所述预训练模型用于从所述人物说话视频中提取人物说话特征,并根据所述音频数据集和所述人物说话特征输出所述初始虚拟人说话视频。
- 根据权利要求4或5所述的方法,其特征在于,所述人物说话视频的时长小于或等于5分钟,所述初始虚拟人说话视频的时长大于或等于十小时。
- 根据权利要求1-6中任一项所述的方法,其特征在于,所述音频数据集包括多种语种语音、多种音色语音和多种内容语音。
- 根据权利要求1-7中任一项所述的方法,其特征在于,所述音唇同步参数包括眼部特征参数和唇部特征参数。
- 根据权利要求1-8中任一项所述的方法,其特征在于,所述音频数据集包含所述人物说话视频中的音频。
- 一种驱动虚拟人说话方法,其特征在于,包括:获取输入音频,第一清晰度的人物说话视频;基于所述输入音频,利用音唇同步参数生成模型生成目标虚拟人说话视频;所述音唇同步参数生成模型的训练集是基于包含有所述第一清晰度的人物说话视频的视频得到的,所述第一清晰度低于所述目标虚拟人说话视频的清晰度,所述目标虚拟人是基于所述第一清晰度的人物说话视频得到的。
- 根据权利要求10所述的方法,其特征在于,在基于所述输入音频,利用音唇同步参数生成模型生成目标虚拟人说话视频之前,所述方法还包括:更新所述音唇同步参数生成模型。
- 根据权利要求11所述的方法,其特征在于,更新所述音唇同步参数生成模型,包括:基于所述输入音频和所述第一清晰度的人物说话视频生成初始虚拟人说话视频;其中,所述初始虚拟人说话视频的时长大于所述目标虚拟人的说话视频的时长;利用所述初始虚拟人说话视频更新所述音唇同步参数生成模型。
- 一种驱动虚拟人说话的模型训练装置,其特征在于,包括:视频生成模块,用于基于音频数据集、人物说话视频生成初始虚拟人说话视频;其中,所述初始虚拟人说话视频的时长大于所述人物说话视频的时长;训练模块,用于利用所述初始虚拟人说话视频生成音唇同步参数生成模型;所述音唇同步参数生成模型用于得到目标虚拟人说话视频;所述初始虚拟人说话视频的清晰度低于所述目标虚拟人说话视频的清晰度。
- 根据权利要求13所述的装置,其特征在于,所述训练模块具体用于:利用所述初始虚拟人说话视频和三维人脸重建模型生成音唇同步参数生成模型。
- 根据权利要求14所述的装置,其特征在于,所述训练模块具体用于:利用三维人脸重建模型从所述初始虚拟人说话视频中提取音唇同步训练参数;将所述音唇同步训练参数作为标签,所述音频数据集作为模型输入数据,训练得到音唇同步参数生成模型。
- 根据权利要求13-15中任一项所述的装置,其特征在于,所述视频生成模块具体用于:将音频数据集和人物说话视频输入预训练模型,得到根据音频数据集中的语音驱动人物说话视频中的人物说话的初始虚拟人说话视频,所述人物说话视频的时长小于所述音频数据集中语音的时长。
- 根据权利要求16所述的装置,其特征在于,所述预训练模型用于从所述人物说话视频中提取人物说话特征,并根据所述音频数据集和所述人物说话特征输出所述初始虚拟人说话视频。
- 根据权利要求16或17所述的装置,其特征在于,所述人物说话视频的时长小于或等于5分钟,所述初始虚拟人说话视频的时长大于或等于十小时。
- 根据权利要求13-18中任一项所述的装置,其特征在于,所述音频数据集包括多种语种语音、多种音色语音和多种内容语音。
- 根据权利要求13-19中任一项所述的装置,其特征在于,所述音唇同步参数包括眼部特征参数和唇部特征参数。
- 根据权利要求13-20中任一项所述的装置,其特征在于,所述音频数据集包含所述人物说话视频中的音频。
- 一种驱动虚拟人说话装置,其特征在于,包括:输入模块,用于获取输入音频,第一清晰度的人物说话视频;模型处理模块,用于基于所述输入音频,利用音唇同步参数生成模型生成目标虚拟人说话视频;所述音唇同步参数生成模型的训练集是基于包含有所述第一清晰度的人物说话视频的视频得到的,所述第一清晰度低于所述目标虚拟人说话视频的清晰度, 所述目标虚拟人是基于所述第一清晰度的人物说话视频得到的。
- 根据权利要求22所述的装置,其特征在于,所述装置还包括训练模块,所述训练模块用于:更新所述音唇同步参数生成模型。
- 根据权利要求23所述的装置,其特征在于,所述训练模块具体用于:基于所述输入音频和所述第一清晰度的人物说话视频生成初始虚拟人说话视频;其中,所述初始虚拟人说话视频的时长大于所述目标虚拟人的说话视频的时长;利用所述初始虚拟人说话视频更新所述音唇同步参数生成模型。
- 一种计算设备,其特征在于,包括:处理器和存储器;所述存储器用于存储计算机指令,当所述处理器执行该指令时,以使所述计算设备执行如权利要求1-9中任一项所述的驱动虚拟人说话的模型训练方法,或者执行如权利要求10-12中任一项所述的驱动虚拟人说话方法。
- 一种驱动虚拟人说话系统,其特征在于,包括:训练设备和终端,所述训练设备和所述终端;所述训练设备用于执行如权利要求1-9中任一项所述的驱动虚拟人说话的模型训练方法;所述终端用于执行如权利要求10-12中任一项所述的驱动虚拟人说话方法。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP22934539.2A EP4394716A1 (en) | 2022-03-29 | 2022-06-14 | Method and apparatus for driving virtual human to speak and performing model training, computing device, and system |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210326144.0A CN116934953A (zh) | 2022-03-29 | 2022-03-29 | 驱动虚拟人说话和模型训练方法、装置、计算设备及系统 |
CN202210326144.0 | 2022-03-29 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023184714A1 true WO2023184714A1 (zh) | 2023-10-05 |
Family
ID=88198779
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/098739 WO2023184714A1 (zh) | 2022-03-29 | 2022-06-14 | 驱动虚拟人说话和模型训练方法、装置、计算设备及系统 |
Country Status (3)
Country | Link |
---|---|
EP (1) | EP4394716A1 (zh) |
CN (1) | CN116934953A (zh) |
WO (1) | WO2023184714A1 (zh) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117336519A (zh) * | 2023-11-30 | 2024-01-02 | 江西拓世智能科技股份有限公司 | 基于ai数字人的多直播间同步直播的方法及装置 |
CN117671093A (zh) * | 2023-11-29 | 2024-03-08 | 上海积图科技有限公司 | 数字人视频制作方法、装置、设备及存储介质 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112001992A (zh) * | 2020-07-02 | 2020-11-27 | 超维视界(北京)传媒科技有限公司 | 基于深度学习的语音驱动3d虚拟人表情音画同步方法及系统 |
CN113192161A (zh) * | 2021-04-22 | 2021-07-30 | 清华珠三角研究院 | 一种虚拟人形象视频生成方法、系统、装置及存储介质 |
CN113783771A (zh) * | 2021-09-17 | 2021-12-10 | 杭州一知智能科技有限公司 | 一种基于微信的ai虚拟人交互方法和系统 |
CN113851131A (zh) * | 2021-08-17 | 2021-12-28 | 西安电子科技大学广州研究院 | 一种跨模态唇语识别方法 |
-
2022
- 2022-03-29 CN CN202210326144.0A patent/CN116934953A/zh active Pending
- 2022-06-14 WO PCT/CN2022/098739 patent/WO2023184714A1/zh active Application Filing
- 2022-06-14 EP EP22934539.2A patent/EP4394716A1/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112001992A (zh) * | 2020-07-02 | 2020-11-27 | 超维视界(北京)传媒科技有限公司 | 基于深度学习的语音驱动3d虚拟人表情音画同步方法及系统 |
CN113192161A (zh) * | 2021-04-22 | 2021-07-30 | 清华珠三角研究院 | 一种虚拟人形象视频生成方法、系统、装置及存储介质 |
CN113851131A (zh) * | 2021-08-17 | 2021-12-28 | 西安电子科技大学广州研究院 | 一种跨模态唇语识别方法 |
CN113783771A (zh) * | 2021-09-17 | 2021-12-10 | 杭州一知智能科技有限公司 | 一种基于微信的ai虚拟人交互方法和系统 |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117671093A (zh) * | 2023-11-29 | 2024-03-08 | 上海积图科技有限公司 | 数字人视频制作方法、装置、设备及存储介质 |
CN117336519A (zh) * | 2023-11-30 | 2024-01-02 | 江西拓世智能科技股份有限公司 | 基于ai数字人的多直播间同步直播的方法及装置 |
CN117336519B (zh) * | 2023-11-30 | 2024-04-26 | 江西拓世智能科技股份有限公司 | 基于ai数字人的多直播间同步直播的方法及装置 |
Also Published As
Publication number | Publication date |
---|---|
CN116934953A (zh) | 2023-10-24 |
EP4394716A1 (en) | 2024-07-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2023184714A1 (zh) | 驱动虚拟人说话和模型训练方法、装置、计算设备及系统 | |
US11836593B1 (en) | Devices, systems, and methods for learning and using artificially intelligent interactive memories | |
WO2021104110A1 (zh) | 一种语音匹配方法及相关设备 | |
US20210264211A1 (en) | Image classification through label progression | |
US11429860B2 (en) | Learning student DNN via output distribution | |
Vinyals et al. | Show and tell: A neural image caption generator | |
Tan et al. | The artificial intelligence renaissance: deep learning and the road to human-level machine intelligence | |
WO2020103700A1 (zh) | 一种基于微表情的图像识别方法、装置以及相关设备 | |
WO2023284435A1 (zh) | 生成动画的方法及装置 | |
CN112465935A (zh) | 虚拟形象合成方法、装置、电子设备和存储介质 | |
US20130124206A1 (en) | Video generation based on text | |
JP7144699B2 (ja) | 信号変更装置、方法、及びプログラム | |
WO2021159781A1 (zh) | 图像处理方法、装置、设备及存储介质 | |
US11544886B2 (en) | Generating digital avatar | |
US11816876B2 (en) | Detection of moment of perception | |
CN111028216A (zh) | 图像评分方法、装置、存储介质及电子设备 | |
CN113948060A (zh) | 一种网络训练方法、数据处理方法及相关设备 | |
Filntisis et al. | Video-realistic expressive audio-visual speech synthesis for the Greek language | |
US11099396B2 (en) | Depth map re-projection based on image and pose changes | |
WO2024066549A1 (zh) | 一种数据处理方法及相关设备 | |
EP4030352A1 (en) | Task-specific text generation based on multimodal inputs | |
Kasi et al. | A deep learning based cross model text to image generation using DC-GAN | |
Ouyang et al. | Audio-visual emotion recognition with capsule-like feature representation and model-based reinforcement learning | |
US20240104311A1 (en) | Hybrid language translation on mobile devices | |
Padman et al. | Speech Emotion Recognition using Hybrid Textual Features, MFCC and Deep Learning Technique |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22934539 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2022934539 Country of ref document: EP |
|
ENP | Entry into the national phase |
Ref document number: 2022934539 Country of ref document: EP Effective date: 20240327 |