WO2022116977A1 - 目标对象的动作驱动方法、装置、设备及存储介质及计算机程序产品 - Google Patents
目标对象的动作驱动方法、装置、设备及存储介质及计算机程序产品 Download PDFInfo
- Publication number
- WO2022116977A1 WO2022116977A1 PCT/CN2021/134541 CN2021134541W WO2022116977A1 WO 2022116977 A1 WO2022116977 A1 WO 2022116977A1 CN 2021134541 W CN2021134541 W CN 2021134541W WO 2022116977 A1 WO2022116977 A1 WO 2022116977A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- parameters
- target
- image
- parameter
- source
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 110
- 230000009471 action Effects 0.000 title claims abstract description 104
- 238000004590 computer program Methods 0.000 title claims abstract description 15
- 238000012545 processing Methods 0.000 claims abstract description 107
- 238000000605 extraction Methods 0.000 claims abstract description 38
- 238000006243 chemical reaction Methods 0.000 claims abstract description 21
- 238000009877 rendering Methods 0.000 claims description 113
- 239000002131 composite material Substances 0.000 claims description 57
- 239000013598 vector Substances 0.000 claims description 39
- 230000015654 memory Effects 0.000 claims description 34
- 230000001815 facial effect Effects 0.000 claims description 27
- 238000012549 training Methods 0.000 claims description 14
- 230000008921 facial expression Effects 0.000 claims description 11
- 230000004044 response Effects 0.000 claims description 3
- 238000013473 artificial intelligence Methods 0.000 abstract description 11
- 238000005516 engineering process Methods 0.000 description 27
- 238000010586 diagram Methods 0.000 description 18
- 230000008569 process Effects 0.000 description 16
- 238000004422 calculation algorithm Methods 0.000 description 10
- 238000005070 sampling Methods 0.000 description 10
- 238000013527 convolutional neural network Methods 0.000 description 9
- 230000006870 function Effects 0.000 description 9
- 230000003993 interaction Effects 0.000 description 7
- 238000013528 artificial neural network Methods 0.000 description 6
- 230000015572 biosynthetic process Effects 0.000 description 6
- 230000000306 recurrent effect Effects 0.000 description 6
- 230000001360 synchronised effect Effects 0.000 description 6
- 238000003786 synthesis reaction Methods 0.000 description 6
- 230000000007 visual effect Effects 0.000 description 6
- 239000000284 extract Substances 0.000 description 5
- 238000003058 natural language processing Methods 0.000 description 5
- 230000004913 activation Effects 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 239000000203 mixture Substances 0.000 description 3
- 238000012015 optical character recognition Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000013515 script Methods 0.000 description 2
- 230000002194 synthesizing effect Effects 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 241001465754 Metazoa Species 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000005294 ferromagnetic effect Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000005291 magnetic effect Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000001308 synthesis method Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 230000037303 wrinkles Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
- G06V40/171—Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T15/00—3D [Three Dimensional] image rendering
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/4302—Content synchronisation processes, e.g. decoder synchronisation
- H04N21/4307—Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen
- H04N21/43072—Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen of multiple content streams on the same device
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/174—Facial expression recognition
- G06V40/176—Dynamic expression
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/24—Speech recognition using non-acoustical features
- G10L15/25—Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/102—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
- H04N19/132—Sampling, masking or truncation of coding units, e.g. adaptive resampling, frame skipping, frame interpolation or high-frequency transform coefficient masking
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/21—Server components or server architectures
- H04N21/218—Source of audio or video content, e.g. local disk arrays
- H04N21/2187—Live feed
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/234—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
- H04N21/23412—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs for generating or manipulating the scene composition of objects, e.g. MPEG-4 objects
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/234—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
- H04N21/2343—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
- H04N21/234336—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements by media transcoding, e.g. video is transformed into a slideshow of still pictures or audio is converted into text
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/236—Assembling of a multiplex stream, e.g. transport stream, by combining a video stream with other content or additional data, e.g. inserting a URL [Uniform Resource Locator] into a video stream, multiplexing software data into a video stream; Remultiplexing of multiplex streams; Insertion of stuffing bits into the multiplex stream, e.g. to obtain a constant bit-rate; Assembling of a packetised elementary stream
- H04N21/2368—Multiplexing of audio and video streams
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/25—Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
- H04N21/251—Learning process for intelligent management, e.g. learning user preferences for recommending movies
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/81—Monomedia components thereof
- H04N21/8146—Monomedia components thereof involving graphical data, e.g. 3D object, 2D graphics
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N7/00—Television systems
- H04N7/14—Systems for two-way working
- H04N7/15—Conference systems
- H04N7/157—Conference systems defining a virtual conference space and using avatars or agents
Definitions
- the embodiments of the present application are based on the Chinese patent application with the application number of 202011413461.3 and the application date of December 4, 2020, and claim the priority of the Chinese patent application.
- the entire content of the Chinese patent application is incorporated into the embodiments of the present application as refer to.
- the present application relates to the field of Internet technologies, and relates to, but is not limited to, an action driving method, apparatus, device, computer-readable storage medium, and computer program product for a target object.
- an implementation method is to use a recurrent neural network to learn the key points of the mouth from the speech features, and then generate the mouth texture based on the information of the key points of the mouth, and finally combine it with the target video frame to obtain the lip sync speech video frame.
- Another implementation is to first learn a common and shared "voice-expression" space based on multiple sound clips from different sources, and then determine the final lip-synched video frame based on the obtained expression parameters.
- Embodiments of the present application provide an action driving method, apparatus, device, computer-readable storage medium, and computer program product for a target object, which can improve the smoothness and authenticity of a final synthesized video.
- An embodiment of the present application provides an action driving method for a target object, the method comprising:
- image reconstruction processing is performed on the target object in the target video to obtain a reconstructed image
- a synthetic video is generated from the reconstructed image, wherein the synthetic video includes the target object, and the action of the target object corresponds to the source speech.
- An embodiment of the present application provides an action driving device for a target object, and the device includes:
- an acquisition module configured to acquire a source voice and acquire a target video, where the target video includes a target object
- a facial parameter conversion module configured to perform facial parameter conversion processing on the voice parameters of the source voice at each moment, to obtain the source parameters of the source voice at the corresponding moment;
- a parameter extraction module configured to perform parameter extraction processing on the target video to obtain target parameters of the target video
- an image reconstruction module configured to perform image reconstruction processing on the target object in the target video according to a combination parameter obtained by combining the source parameter and the target parameter to obtain a reconstructed image
- the generating module is configured to generate a synthetic video from the reconstructed image, wherein the synthetic video includes the target object, and the action of the target object corresponds to the source voice.
- the embodiment of the present application provides an action driving system for a target object, which at least includes: a terminal and a server;
- the terminal configured to send an action-driven request of the target object to the server, where the action-driven request includes a source voice and a target video, and the target video includes a target object;
- the server is configured to implement the above-mentioned action-driven method of the target object in response to the action-driven request.
- Embodiments of the present application provide a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium; wherein, a processor of a computer device is obtained from the computer The computer instructions are read from the readable storage medium, and the processor is configured to execute the computer instructions to implement the above-mentioned action driving method for the target object.
- An embodiment of the present application provides an action driving device for a target object, including: a memory for storing executable instructions; and a processor for implementing the above-mentioned action driving of the target object when executing the executable instructions stored in the memory method.
- An embodiment of the present application provides a computer-readable storage medium storing executable instructions for causing a processor to execute the executable instructions to implement the above-mentioned action driving method for a target object.
- the embodiments of the present application have the following beneficial effects: through the combination parameters of the source parameters and the target parameters, a synthetic video of the action of the final voice-driven target object is obtained, the smoothness and authenticity of the finally obtained synthetic video are improved, and the video synthesis is improved. visual effects.
- FIG. 1 is a system frame diagram of an action-driven method for a target object in the related art
- FIG. 2 is a schematic structural diagram of an action-driven system for a target object provided by an embodiment of the present application
- FIG. 3 is a schematic structural diagram of an action driving device for a target object provided by an embodiment of the present application.
- FIG. 4 is a schematic flowchart of an action driving method for a target object provided by an embodiment of the present application
- FIG. 5 is a schematic flowchart of an action driving method for a target object provided by an embodiment of the present application.
- 6A-6B are a schematic flowchart of an action driving method for a target object provided by an embodiment of the present application.
- FIG. 7 is a schematic diagram of an implementation flow of a training method for an image rendering model provided by an embodiment of the present application.
- FIG. 8 is a system framework diagram of a method for driving an action of a target object provided by an embodiment of the present application.
- FIG. 9 is a frame diagram of a text-to-speech module provided by an embodiment of the present application.
- FIG. 10 is a frame diagram of a speech-to-face parameter network provided by an embodiment of the present application.
- FIG. 11 is an effect diagram of the Dlib algorithm provided by an embodiment of the present application.
- FIG. 12 is a frame diagram of an image rendering model provided by an embodiment of the present application.
- FIG. 13 is a framework diagram of a condition-based generative adversarial network provided by an embodiment of the present application.
- Figure 14 is a virtual human synchronous speech video synthesized by a method in the related art
- FIG. 15 is a composite video generated by the action driving method of the target object according to the embodiment of the present application.
- lip sync speech video generation schemes are mainly divided into two categories: text-driven and speech-driven.
- text-driven is to input a piece of text and a video of the target person, convert the text into speech through the text-to-speech (TTS, Text To Speech) technology, and then learn the facial features from the speech features, and finally output a piece of reading by the target person.
- TTS Text To Speech
- the text-driven method is an extension of the voice-driven method.
- the lip sync speech video generation scheme is mainly based on deep learning.
- the recurrent neural network is used to learn 20 key points of the mouth from the speech features, and then the mouth texture is generated based on the key point information. , and finally combined with the target video frame to obtain the lip sync speech video frame.
- a text-driven ObamaNet method mainly includes three modules, namely "text-speech” module, “speech-keypoint” module and “keypoint-video frame” module, among which the "text-speech” module adopts Char2Wav in the TTS algorithm, the "voice-keypoint” module also uses a recurrent neural network to learn keypoint information from speech features, while the "keypoint-video frame” module uses U-Net with skip connections to achieve information transfer network, the first deep learning-based text-driven model for lip-syncing speech video generation.
- UV map is a map mapped from 3D face coordinates to a two-dimensional plane.
- This method also uses a U-Net network to render video frames.
- Another method proposes a voice identity information removal network to convert the voice features of different speakers into a global domain, and then uses a recurrent neural network to learn expression parameters from the voice features, and compares the obtained expression parameters with the 3D representation of the target person.
- the face parameters are combined with reconstruction to obtain a 3D grid, and the 3D grid is input to the U-Net network to obtain the final video frame.
- the rendering module is mainly improved, and a memory-enhanced generative adversarial network (GAN, Generative Adversarial Networks) is proposed to save the identity features and spatial feature pairs of different speakers, so as to realize the video synthesis of different speakers. .
- GAN Generative Adversarial Networks
- an action-driven method for target objects based on a speech-driven model is also proposed.
- the method first learns a general and shared "voice-expression" space based on multiple sound clips from different sources. Shape composition, the expression parameters of different people can be composed of linear combinations of different mixed shapes in space. Then perform 3D face reconstruction according to the obtained expression parameters, and then obtain the corresponding UV map, and then use U-Net based on hole convolution to render the video frame.
- Figure 1 is an action-driven method of a target object in the related art.
- the system framework of the action-driven method of the target object is composed of a generalized network (Generalized Network) 11 and a specialized network (Specialized Network) 12, wherein, the specific processing flow of the technical solution system framework is as follows : First, input sound clips 111 from different sources into the speech recognition system (DeepSpeech RNN) 112 for speech feature extraction to obtain speech features and then pass through a convolutional neural network (CNN, Convolutional Neural Networks) 113 to map different people's speech features To a common and shared latent speech expression space (Latent Audio Expression Space) 114, for the speech features of different people, it can be formed by linear combination of different blend shapes in this space.
- a convolutional neural network CNN, Convolutional Neural Networks
- the output of the generalized network 11 will enter the content-aware filter (Content-Aware Filtering) 121 of the professional network 12 to obtain smooth speech-expression parameters (Smooth Audio-Expressions) 122, and then obtain a reconstructed 3D face model (3D Model). )123 and UV Map 124. Finally, the UV Map 124 and the background image 125 are input to the Neural Rendering Network 126 to obtain the final output image 127 .
- Content-aware filter Content-aware Filtering
- smooth speech-expression parameters Speech-expression parameters
- UV Map 124 and the background image 125 are input to the Neural Rendering Network 126 to obtain the final output image 127 .
- the related art is a voice-driven method, which cannot realize a given text and output the corresponding lip sync speech video; the face parameters used in the related art are only determined by the 3D face model.
- the obtained UV Map but the UV Map can only provide the network with the prior data of the mouth shape, and the network does not have any auxiliary information for the details of the teeth; in the related art, only the corresponding frames of the predicted value and the real value are penalized when training the rendering network. If the input before and after frames are not considered, the difference between the before and after frames will not be optimized, resulting in jitter in the final video.
- there is still a problem that the generated video corresponding to the final lip-synched speech frame is not smooth and unreal.
- the current main challenges in the field of 3D virtual lip sync speech video generation include two points: face reconstruction and video frame rendering.
- the embodiment of the present application proposes a speech-to-face parameter network, which can simultaneously learn 2D mouth key points and 3D facial expression parameters from speech features, so as to obtain the precise position provided by the 2D key points. It can also retain the advantages of depth information of 3D face parameters, and the combination of 2D and 3D parameters to reconstruct the face can ensure its accuracy. After the reconstructed face is obtained, it is also fused with the background.
- the embodiment of the present application proposes a two-stage rendering network. The first rendering network realizes the rendering of the mouth texture area from the reconstructed face, and the second rendering network aims to render the mouth texture area.
- the regions are combined with the background to render the final video frame.
- the advantages of using a two-stage rendering network are: 1) Training the two rendering networks separately can reduce the training difficulty, while ensuring the accuracy of the mouth texture generated by the first rendering network; 2) When training the second rendering network, the The mouth shape area is punished to correct the mouth shape and optimize the details such as teeth and wrinkles.
- a video frame similarity loss is also used to ensure that there is little difference between the output before and after the frame, avoiding the video jitter phenomenon and the video is not smooth and unreal.
- a source voice and a target video are obtained, and the target video includes the target object;
- a synthetic video is generated by reconstructing the image, and the synthetic video has a target object, and the action of the target object corresponds to the source speech.
- the face in the embodiment of the present application is not limited to a human face, and may also be the face of an animal or the face of a virtual object.
- the action driving device for the target object provided in the embodiment of the present application can be implemented as a notebook computer, a tablet computer, a desktop computer, a mobile Devices (eg, mobile phones, portable music players, personal digital assistants, dedicated messaging devices, portable game devices), intelligent robots, and other terminals with video playback functions.
- the embodiments of the present application provide The target object's action-driven device can also be implemented as a server. Next, an exemplary application when the action driving device of the target object is implemented as a server will be described.
- FIG. 2 is a schematic structural diagram of an action driving system 20 for a target object provided by an embodiment of the present application.
- the action driving system 20 for the target object provided in the embodiment of the present application includes a terminal 100, a network 200, and a server 300.
- the terminal 100 obtains the target video and the source voice, generates an action-driven request of the target object according to the target video and the source voice, and sends the action-driven request to the server 300 through the network 200.
- Face parameter conversion processing is performed on the voice parameters of a moment to obtain the source parameters of the source voice at the corresponding moment; parameter extraction is performed on the target video to obtain the target parameters; then, according to the combination parameters obtained by combining the source parameters and the target parameters , perform image reconstruction on the target object in the target video to obtain a reconstructed image; generate a synthetic video by reconstructing the image, wherein the synthetic video has a target object, and the action of the target object corresponds to the source voice.
- the composite video is sent to the terminal 100 through the network 200.
- the terminal 100 plays the composite video on the current interface 100-1 of the terminal 100.
- the terminal 100 obtains the target video and the source voice, wherein the target video and the source voice may be locally stored video and voice. , it can also be video and voice recorded in real time.
- the terminal performs facial parameter conversion processing on the voice parameters of the source voice at each moment to obtain the source parameters of the source voice at the corresponding moment; and extracts the parameters of the target video to obtain the target parameters.
- image reconstruction is performed on the target object in the target video to obtain a reconstructed image; a composite video is generated by the reconstructed image, wherein the composite video has a target object. object, and the action of the target object corresponds to the source speech.
- the composite video is played on the current interface 100-1 of the terminal 100.
- the action driving method of the target object provided by the embodiment of the present application also relates to the field of artificial intelligence technology, and the synthetic video is synthesized through the artificial intelligence technology.
- it can be implemented at least through computer vision technology, speech technology and natural language processing technology in artificial intelligence technology.
- computer vision technology (CV, Computer Vision) is a science that studies how to make machines "see”. Further, it refers to the use of cameras and computers instead of human eyes to identify, track and measure targets. Machine vision, And further do graphics processing, so that computer processing becomes more suitable for human eye observation or transmission to the instrument detection image.
- CV Computer Vision technology
- computer vision technology is a science that studies how to make machines "see”. Further, it refers to the use of cameras and computers instead of human eyes to identify, track and measure targets. Machine vision, And further do graphics processing, so that computer processing becomes more suitable for human eye observation or transmission to the instrument detection image.
- computer vision studies related theories and technologies trying to build artificial intelligence systems that can obtain information from images or multidimensional data.
- Computer vision technology usually includes image processing, image recognition, image semantic understanding, image retrieval, Optical Character Recognition (OCR, Optical Character Recognition), video processing, video semantic understanding, video content/behavior recognition, 3D object reconstruction, 3D technology, virtual Reality, augmented reality, simultaneous positioning and map construction technologies, as well as common biometric identification technologies such as face recognition and fingerprint recognition.
- OCR Optical Character Recognition
- video processing video semantic understanding, video content/behavior recognition
- 3D object reconstruction 3D technology
- virtual Reality augmented reality
- simultaneous positioning and map construction technologies as well as common biometric identification technologies such as face recognition and fingerprint recognition.
- the key technologies of speech technology are automatic speech recognition technology (ASR, Automatic Speech Recognition), speech synthesis technology (TTS, Text To Speech) and voiceprint recognition technology. Making computers able to hear, see, speak, and feel is the development direction of human-computer interaction in the future, and voice will become one of the most promising human-computer interaction methods in the future.
- Natural Language Processing is an important direction in the field of computer science and artificial intelligence. It studies various theories and methods that can realize effective communication between humans and computers using natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Therefore, research in this field will involve natural language, the language that people use on a daily basis, so it is closely related to the study of linguistics. Natural language processing technology usually includes text processing, semantic understanding, machine translation, robot question answering, knowledge graph and other technologies.
- the action driving method of the target object provided by the embodiments of the present application may also be implemented based on a cloud platform and through cloud technology.
- the above server 300 may be a cloud server.
- FIG. 3 is a schematic structural diagram of an action driving device of a target object provided by an embodiment of the present application.
- the server 300 shown in FIG. 3 includes: at least one processor 310 and a memory 350 , at least one network interface 320 and user interface 330 .
- the various components in server 300 are coupled together by bus system 340 .
- bus system 340 is used to implement the connection communication between these components.
- the bus system 340 also includes a power bus, a control bus and a status signal bus.
- the various buses are labeled as bus system 340 in FIG. 3 .
- the processor 310 may be an integrated circuit chip with signal processing capabilities, such as a general-purpose processor, a digital signal processor (DSP, Digital Signal Processor), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., where a general-purpose processor may be a microprocessor or any conventional processor or the like.
- DSP Digital Signal Processor
- User interface 330 includes one or more output devices 331 that enable presentation of media content, including one or more speakers and/or one or more visual display screens.
- User interface 330 also includes one or more input devices 332, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, and other input buttons and controls.
- Memory 350 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like. Memory 350 optionally includes one or more storage devices that are physically remote from processor 310 . Memory 350 includes volatile memory or non-volatile memory, and may also include both volatile and non-volatile memory. The non-volatile memory may be a read-only memory (ROM, Read Only Memory), and the volatile memory may be a random access memory (RAM, Random Access Memory). The memory 350 described in the embodiments of the present application is intended to include any suitable type of memory. In some embodiments, memory 350 is capable of storing data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.
- the operating system 351 includes system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;
- An input processing module 353 for detecting one or more user inputs or interactions from one of the one or more input devices 332 and translating the detected inputs or interactions.
- FIG. 3 shows an action driving device 354 of a target object stored in the memory 350, and the action driving device 354 of the target object may be
- the action driving device of the target object in the server 300 which can be software in the form of programs and plug-ins, includes the following software modules: an acquisition module 3541, a face parameter conversion module 3542, a parameter extraction module 3543, an image reconstruction module 3544, and a generation Modules 3545, these modules are logical, so they can be arbitrarily combined or further split according to the functions implemented. The function of each module will be explained below.
- the apparatus provided by the embodiments of the present application may be implemented in hardware.
- the apparatus provided by the embodiments of the present application may be a processor in the form of a hardware decoding processor, which is programmed to execute the present application
- the action driving method of the target object provided by the embodiment for example, the processor in the form of a hardware decoding processor may adopt one or more application specific integrated circuits (ASIC, Application Specific Integrated Circuit), DSP, programmable logic device (PLD, Programmable Logic Device), Complex Programmable Logic Device (CPLD, Complex Programmable Logic Device), Field Programmable Gate Array (FPGA, Field-Programmable Gate Array) or other electronic components.
- ASIC application specific integrated circuits
- DSP digital signal processor
- PLD programmable logic device
- CPLD Complex Programmable Logic Device
- FPGA Field Programmable Gate Array
- FIG. 4 is a schematic flowchart of a method for driving an action of a target object provided by an embodiment of the present application. The steps shown in FIG. 4 will be described below.
- step S401 a source voice and a target video are acquired, and the target video includes a target object.
- the server may receive an action-driven request of the target object sent by the user through the terminal, and the action-driven request is used to request to synthesize the source voice and the target video to generate an action that has both the target object and the source voice, and the source voice drives the action of the target object That is, the synthetic video requested to be generated has the target object in the target video, and the voice corresponding to the target object is the source voice.
- the source voice may be the voice pre-recorded by the user, the voice downloaded from the network, or the voice obtained by converting a specific text.
- the sound feature of the source speech may be the sound feature of a specific object, and may also be the sound feature of the target object in the target video.
- Step S402 performing facial parameter conversion processing on the speech parameters of the source speech at each moment to obtain the source parameters of the source speech at the corresponding moment.
- the source parameters at each moment include but are not limited to expression parameters and mouth key point parameters, wherein the expression parameters are expression parameters corresponding to the speech parameters at the moment, for example, when the speech parameters correspond to cheerful speech,
- the expression parameter may be a smiling expression parameter, and when the speech parameter corresponds to a sad speech, the expression parameter may be a frowning expression parameter.
- the mouth key point parameter is the mouth shape parameter when expressing the speech parameter at the moment.
- the expression parameters are 3D expression parameters
- the mouth key point parameters are 2D key point parameters
- Step S403 performing parameter extraction processing on the target video to obtain target parameters.
- a preset algorithm can be used to extract parameters from the target video, that is, to extract parameters from the target object in the target video.
- the target parameters include but are not limited to the target mouth key point parameters and the target face parameters.
- the target parameters It can also include the pose parameters, position parameters, shape parameters and action parameters of the target object.
- Step S404 Perform image reconstruction processing on the target object in the target video according to the combination parameter obtained by combining the source parameter and the target parameter to obtain a reconstructed image.
- the source parameters and the target parameters are first combined to obtain the combined parameters, and the combined parameters are parameters used to characterize the posture, position, shape, action and mouth shape of the target object in the final composite video.
- image reconstruction is performed on the target object according to the combination parameters to obtain a reconstructed image, and the reconstructed image is an image used to generate a final composite video.
- Step S405 generating a composite video by reconstructing the image.
- the action of the target object corresponds to the source speech.
- a corresponding reconstructed image is generated, and each reconstructed image is rendered to generate a composite image.
- the reconstructed image can be There is at least one, and the duration of the synthesized video is equal to the duration of the source voice, or the duration of the synthesized video is greater than the duration of the source voice.
- the final generated composite video is a composite image; when there are multiple reconstructed images, the final generated composite video has the same duration as the source voice, and the composite video is composed of A video formed by concatenating multiple composite images in chronological order.
- the target video may have at least one video frame, and the target video has the target object.
- the target video includes one video frame, the video frame has the target object, and the video composition request is used to request the generation of the target object.
- the synthesized video of the target object, and the synthesized video is a dynamic video obtained based on one frame of video frame; when the target video includes multiple video frames, at least one video frame has the target object, and the video synthesis request is used to request the generation of A composite video of the target object, and the composite video is a dynamic video obtained based on multiple video frames.
- the duration of the target video may be the same as or different from the duration of the source voice.
- the duration of the target video is the same as the duration of the source voice, a composite image can be formed according to the voice parameters of each video frame corresponding to the source voice at each moment, and finally a composite video with the same duration as the target video can be formed.
- the embodiments of the present application can be applied to the following scenarios: in the education industry, if you want to generate a teaching video about a certain knowledge point, you can input the source voice corresponding to the knowledge point (that is, the classroom teacher's voice) and the target video with the teacher's lecture. To the server, the server can directly generate and output a teaching video (ie, a synthetic video) in which the teacher explains the knowledge point by using the method of the embodiment of the present application.
- a teaching video ie, a synthetic video
- the action driving method for a target object performs facial parameter conversion processing on the voice parameters of the source voice at each moment, obtains the source parameters of the source voice at the corresponding moment, and extracts the parameters of the target video to obtain the target parameters, And reconstruct the image of the target object according to the combination parameters of the source parameter and the target parameter to obtain a reconstructed image, and finally, generate a synthetic video through the reconstructed image, wherein the synthetic video has a target object, and the action of the target object is related to the source voice. correspond.
- the synthetic video of the final voice-driven action of the target object is obtained based on the combined parameters of the source parameter and the target parameter, the synthetic video finally obtained is smoother and more realistic, and the visual effect of the video synthesis is improved.
- the action driving system of the target object includes at least a terminal and a server, and through the interaction between the terminal and the server, a response to an action driving request of the terminal is implemented, and a composite video desired by the user is generated.
- the action-driven request includes source voice and target video, and the action-driven request may also include source text, and the source voice can be obtained through the source text.
- FIG. 5 is a schematic flowchart of a method for driving an action of a target object provided by an embodiment of the present application. As shown in FIG. 5 , the method includes the following steps:
- Step S501 the terminal acquires the source voice and the target video.
- the source voice may be the voice collected by the user through a voice collection device on the terminal, or may be the voice downloaded by the user through the terminal.
- the target video can be a video of any duration, and the target video has a target object.
- Step S502 the terminal obtains the source text and the target video.
- the source text is the text used to generate the source voice.
- the input source voice be processed to generate a synthetic video with the source voice, but also the input source text can be parsed and converted to generate the source text.
- speech which in turn forms a synthesized video with the source speech.
- Step S503 the terminal performs text analysis on the source text to obtain linguistic features of the source text.
- linguistic features include, but are not limited to, linguistic features such as pinyin, pause, punctuation, and tones.
- text parsing of the source text may also be performed based on artificial intelligence technology to obtain linguistic features of the source text.
- Step S504 the terminal extracts the acoustic parameters of the linguistic feature to obtain the acoustic parameters of the source text in the time domain.
- the acoustic parameters are the parameter representation of the source text in the time domain, and the acoustic parameters of the source text in the time domain are obtained by extracting the acoustic parameters of the linguistic features.
- Step S505 the terminal performs conversion processing on the acoustic parameters to obtain the speech waveform of the source text in the frequency domain.
- the speech waveform is the acoustic representation corresponding to the acoustic parameters
- the speech waveform is the parametric representation of the source text in the frequency domain.
- Step S506 the terminal determines the voice corresponding to the voice waveform as the source voice.
- Step S507 the terminal encapsulates the source voice and the target video to form an action-driven request.
- the terminal may also encapsulate the source text in the action-driven request, and send the action-driven request to the server, and the server implements the steps of converting the source text into the source speech in steps S503 to S506.
- Step S508 the terminal sends the action driving request to the server.
- Step S509 the server parses the action-driven request to obtain the source voice and the target video.
- step S510 the server performs facial parameter conversion processing on the speech parameters of the source speech at each moment to obtain the source parameters of the source speech at the corresponding moment.
- Step S511 the server extracts parameters from the target video to obtain target parameters.
- Step S512 the server performs image reconstruction on the target object in the target video according to the combination parameter obtained by combining the source parameter and the target parameter to obtain a reconstructed image.
- Step S513 the server generates a composite video by reconstructing the image, wherein the composite video has a target object, and the action of the target object corresponds to the source voice.
- steps S510 to S513 are the same as the above-mentioned steps S402 to S405, and are not repeated in this embodiment of the present application.
- Step S514 the server sends the composite video to the terminal.
- Step S515 the terminal plays the composite video on the current interface.
- FIG. 6A is a schematic flowchart of a method for driving an action of a target object provided by an embodiment of the present application. As shown in FIG. 6A , step S402 can be implemented by the following steps:
- Step S601 perform feature extraction on the source speech to obtain a speech feature vector of the source speech.
- step S602 convolution processing and full connection processing are sequentially performed on the speech feature vector to obtain expression parameters and mouth key point parameters of the source speech at the corresponding moment.
- step S602 may be implemented in the following manner: sequentially perform convolution processing on the speech feature vector through at least two first convolution layers with specific convolution kernels to obtain a convolution processing vector;
- the connection layer performs full connection processing on the convolution processing vector in turn to obtain a fully connected processing vector.
- the full connection processing vector includes the vector corresponding to the expression parameter and the vector corresponding to the mouth key point parameter, wherein the sum of the dimensions of the vector corresponding to the expression parameter and the vector corresponding to the mouth key point parameter is equal to the full connection processing vector. dimension.
- step S403 can be implemented by the following steps:
- Step S603 perform mouth parameter extraction and face parameter extraction sequentially on the target object in the current video frame of the target video, and correspondingly obtain target mouth key point parameters and target face parameters.
- the target mouth key point parameters and the target face parameters are the parameters of the target object.
- the target mouth key point parameters and target face of the target object in each video frame can be extracted. part parameters.
- Step S604 determining the target mouth key point parameter and the target face parameter as target parameters.
- step S404 can be implemented by the following steps:
- Step S605 combine the source parameter and the target parameter to obtain the combined parameter.
- the parameters used to generate the final composite image can be extracted, and the parameters not used to generate the final composite image are deleted to obtain the combined parameters.
- Step S606 Perform image reconstruction on the target object in the target video according to the combination parameters to obtain a mouth contour map and a UV map.
- the reconstructed image includes a mouth contour map and a UV map (UV map), wherein the mouth contour map is used to reflect the mouth contour of the target object in the final generated composite image, and the UV map is used to match the mouth contour with the mouth.
- UV map UV map
- the facial contour maps are combined to generate the texture of the mouth region of the target object in the composite image.
- Step S607 using the mouth contour map and the UV map as the reconstructed image.
- the source parameters include: expression parameters and mouth key point parameters; the target parameters include target mouth key point parameters and target face parameters; the target face parameters include at least: target pose parameters, target shape parameters and target parameters Expression parameters.
- step S605 can be implemented in the following ways: replacing the target facial expression parameters in the target facial parameters with the facial expression parameters to obtain the replaced facial parameters; replacing the target mouth key point parameters with the mouth key point parameters, The replaced mouth key point parameters are obtained; the replaced face parameters and the replaced mouth key point parameters are used as combined parameters.
- step S405 the process of generating a composite video by reconstructing the image in step S405 can be realized by the following steps:
- Step S6054 the image rendering model is invoked based on the replaced face parameters, the replaced mouth key point parameters and the background image corresponding to the target video at each moment.
- the replaced face parameters, the replaced mouth key point parameters and the background image corresponding to the target video at each moment are input into the image rendering model.
- the reconstructed image includes the replaced face parameters and the replaced mouth key point parameters.
- Step S6055 through the first rendering network in the image rendering model, perform mouth shape region rendering on the replaced face parameters at each moment and the replaced mouth key point parameters at each moment, and obtain the mouth shape at each moment. type area texture image.
- the first rendering network includes at least one second convolution layer, at least one first down-sampling layer and at least one first up-sampling layer; wherein, the mouth shape region rendering process can be implemented by the following steps : Convolution and downsampling are performed on the replaced face parameters and the replaced mouth key point parameters through the second convolution layer and the first downsampling layer in turn to obtain the depth features of the reconstructed image; An up-sampling layer, performing up-sampling processing on the depth features of the reconstructed image to restore the resolution of the reconstructed image and obtain a texture image of the mouth shape region.
- Step S6056 Perform splicing processing on the texture image of the mouth shape region and the background image through the second rendering network in the image rendering model to obtain a composite image at each moment.
- the second rendering network includes at least one third convolutional layer, at least one second downsampling layer and at least one second upsampling layer; wherein, the splicing process may be implemented by the following steps: sequentially Through the third convolution layer and the second downsampling layer, convolution and downsampling are performed on the texture image of the mouth shape region and the background image to obtain the depth features of the texture image and the background image of the mouth shape region; through the second upsampling layer , perform up-sampling processing on the depth features of the texture image of the mouth shape region and the background image to restore the resolution of the texture image of the mouth shape region and the background image, and obtain the composite image at the current moment.
- Step S6057 Determine a synthesized video including the target object and the source voice according to the synthesized image at each moment.
- the above image rendering model is used to render the reconstructed image at each moment to generate a composite image at the corresponding moment, and the composite image includes not only the target object but also the voice of the source voice at the corresponding moment.
- the image rendering model includes at least a first rendering network and a second rendering network.
- the first rendering network is used to perform feature extraction and mouth shape region rendering on the reconstructed image and the target image, respectively, and the second rendering network is used for the mouth shape region.
- the texture image and the target image are stitched together.
- FIG. 7 is a schematic diagram of an implementation flow of a training method for an image rendering model provided by an embodiment of the present application. As shown in FIG. 7 , the method includes the following steps:
- Step S701 calling an image rendering model based on the reconstructed image sample and the target image sample.
- the reconstructed image sample may be obtained through the following steps: performing facial parameter conversion processing on the voice parameter of the voice sample at the current moment to obtain a voice parameter sample; performing parameter extraction on the target image sample to obtain the target parameter sample; The speech parameter sample and the target parameter sample are combined to obtain a combined parameter sample, and an image of the target object in the target image sample is reconstructed according to the combined parameter sample to obtain the reconstructed image sample.
- the reconstructed image sample can also be obtained by the following steps: performing text analysis on the text sample to obtain linguistic features of the text sample, and performing acoustic parameter extraction on the linguistic features of the text sample to obtain the time domain of the text sample. Acoustic parameters on the above; convert the acoustic parameters to obtain the speech waveform of the text sample in the frequency domain, and determine the speech corresponding to the speech waveform as the speech sample.
- the voice parameter of the voice sample at the current moment is processed by face parameter conversion to obtain a voice parameter sample; the parameter extraction is performed on the target image sample to obtain the target parameter sample; the voice parameter sample and the target parameter sample are combined to obtain a combined parameter sample , and image reconstruction is performed on the target object in the target image sample according to the combined parameter sample to obtain the reconstructed image sample.
- the target image sample includes the target object sample, and the final generated composite image sample also includes the target object sample.
- Step S702 Perform feature extraction processing and mouth shape region rendering on the reconstructed image sample and the target image sample through the first rendering network of the image rendering model to obtain a mouth shape texture image sample.
- the first rendering network includes at least one second convolution layer, at least one first downsampling layer, and at least one first upsampling layer.
- the parameters corresponding to the input reconstructed image sample and the target image sample can be convolved through the second convolution layer, and the parameters after convolution processing can be down-sampled through the first down-sampling layer.
- the first upsampling layer can perform upsampling processing on the extracted first image feature samples to restore the resolution of the reconstructed image samples and the target image samples, and obtain the mouth shape texture image samples .
- a second convolutional layer is connected before each first downsampling layer, and a second convolutional layer is also connected after each first upsampling layer, that is, in each downsampling process A convolution process is performed before, and a convolution process is performed after each upsampling process.
- a skip connection is introduced between the first down-sampling layer and the first up-sampling layer, and feature information of different resolutions is preserved through the skip connection.
- Step S703 performing splicing processing on the sample mouth shape texture image and the sample target image through the second rendering network in the image rendering model to obtain a composite image sample.
- the second rendering network includes at least one third convolutional layer, at least one second downsampling layer and at least one second upsampling layer.
- the parameters corresponding to the input mouth texture image sample and the target image sample can be convolved first through the third convolution layer, and the parameters after convolution processing can be downsampled through the second downsampling layer.
- processing to extract the depth features of the mouth texture image sample and the target image sample that is, to obtain a second image feature sample.
- an upsampling process is performed on the extracted second image feature samples through the second upsampling layer, so as to restore the resolutions of the mouth shape texture image samples and the target image samples, and obtain a composite image sample.
- a third convolutional layer is connected before each second downsampling layer, and a third convolutional layer is also connected after each second upsampling layer, that is, in each downsampling process A convolution process is performed before, and a convolution process is performed after each upsampling process.
- skip connections may be introduced between the second down-sampling layer and the second up-sampling layer, and feature information of different resolutions can be preserved through skip connections.
- Step S704 calling a preset loss model based on the synthetic image sample to obtain a loss result.
- step S704 can be implemented by the following ways: obtaining a real synthetic image corresponding to the reconstructed image sample and the target image sample; splicing the synthetic image sample and the real synthetic image and inputting them into a preset loss model, The loss model is set to calculate the similarity loss between the synthetic image samples and the real synthetic image before and after the frame, and obtain the loss result.
- the following loss functions can be calculated: the loss between the two losses of the image rendering model about the real synthetic image and the synthetic image sample, the generational adversarial loss, the L1 loss, the use of The difference between the real synthetic image and the feature map output by the synthetic image sample in the N activation layers calculated by the L1 loss, and the difference is linearly weighted to obtain the final loss and the similarity loss of the previous and previous frames, where the loss result is Calculated according to at least one of the above loss functions, that is, the loss between the two losses of the image rendering model with respect to the real synthetic image and the synthetic image sample, the generative adversarial loss, the L1 loss, the calculated using the L1 loss
- the difference of the feature maps output by the real synthetic image and the synthetic image sample in the N activation layers, and the difference is linearly weighted to obtain the final loss, and the weighted sum of the similarity loss before and after the frame, and the loss result is obtained.
- Step S705 correcting the parameters in the first rendering network and the second rendering network according to the loss result, to obtain a trained image rendering model.
- a generative adversarial strategy when training the image rendering model, a generative adversarial strategy can be adopted, and the model training is considered based on the similarity between the preceding and following frames, and then the loss result of the image rendering model in each prediction is calculated.
- the image rendering model can be accurately trained, and the image rendering model obtained by training considers the continuous change between the previous and subsequent frames, so that the change between two consecutive video frames in the generated composite video is smoother, so that the The resulting composite video is smoother and more realistic, which can improve the visual effect of the composite video generated by the image rendering model.
- the embodiments of the present application can be applied to video generation scenarios of lip-synched speech, such as smart speaker screens, smart TVs, artificial intelligence (AI) education, virtual anchors, and live broadcasts.
- the actions of the target objects provided by the embodiments of the present application The driving method can synthesize a synchronous speech video corresponding to a specific target person according to the input text or speech, which significantly improves the human-computer interaction effect and user experience of the smart product.
- the target object is a virtual teacher
- the action-driven method of the target object provided in the embodiment of the present application automatically generates a personalized 3D virtual teacher that speaks synchronously according to the text or voice input by the teacher.
- the teacher's teaching video, teaching on the student side simulates the real teacher's online teaching function, which not only improves the user experience of the student side, but also reduces the workload of the teacher side.
- the action-driven method for the target object provided by the embodiment of the present application automatically generates a live video of the virtual host who speaks synchronously according to the text or voice input by the host,
- the virtual anchor can broadcast the game live to attract attention, and can also enhance interaction through chat programs, and can also obtain high clicks through cover dance, etc., thereby improving the efficiency of live broadcast.
- the action driving method of the target object provided by the embodiment of the present application is a text-driven or voice-driven 3D virtual mouth shape synchronous speech video generation technology.
- the mouth shape is predicted by combining 2D and 3D face parameters, and then the The rendering network trained by the differential loss synthesizes the final output picture; the embodiment of the present application solves the problems that the speech-driven model is limited to speech input, and the synthesized video is unreal and jittery.
- a piece of text or speech can be used to learn 2D/3D face parameters, and a realistic speech video of a specific target person synchronizing with the mouth shape can be synthesized accordingly.
- Face parameters a new 2D/3D face model is reconstructed by replacing the learned parameters with the parameters of the target person, and the reconstructed face model (ie, the reconstructed image) is input into the rendering network to generate video frames , so as to realize the video generation of the synchronous speech of the target person's mouth shape.
- FIG. 8 is a system frame diagram of an action-driven method for a target object provided by an embodiment of the present application.
- the input of the system may be a piece of source text 801 or a source voice 802. If the input is source text 801, then The corresponding source voice will be generated through the text-to-speech module 803, and then the source voice will obtain the corresponding face parameters through the voice-to-face parameter network 804.
- the face parameters here include 2D mouth key points and 3D expression parameters.
- the obtained The face parameters are combined with the target parameters obtained by the face parameter extraction module 805 to reconstruct a new face model 806, wherein the UV Map 8061 and the reconstructed mouth key points 8062 can be obtained from the face model 806 , and then input the UV Map 8061 and the reconstructed mouth key points 8062 into the two-stage image rendering model 807 trained by the similarity loss of the front and rear frames to generate the final output picture 808 (ie, the composite image).
- Text-to-speech module 803 This module aims to realize a given piece of input source text, convert it into a corresponding source speech, and use it as the input of the speech-to-face parameter network.
- FIG. 9 is a frame diagram of a text-to-speech module provided by an embodiment of the present application.
- the text-to-speech module mainly includes three sub-modules: a text analysis module 901 , an acoustic model module 902 and a vocoder module 903 .
- the text analysis module 901 is used to parse the input text (that is, the source text), determine the pronunciation, tone, intonation, etc. of each character, and map the text to linguistic features.
- the linguistic features here include but are not limited to: pinyin, Linguistic features such as pauses, punctuation, and tones; the acoustic model module 902 is used to map linguistic features to acoustic parameters, where the acoustic parameters are the parameter representation of the source text in the time domain; the vocoder module 903 is used to convert the acoustic The parameters are converted into speech waveforms, where the speech waveforms are the parametric representation of the source text in the frequency domain.
- FIG. 10 is a frame diagram of the voice-to-face parameter network provided by the embodiment of the present application, as shown in FIG. 10 , AI represents the input audio segment (Input Audio) (i.e. source voice), Obtained by the user speaking or the above text-to-speech module, F A represents audio features, c 1 -c 4 represent four convolution layers (Convolution layers), and f 1 -f 3 represent three fully connected layers ( Fully connection layer), T S represents the source 3D facial expression parameters (Three dimensional facial expression parameters of source), K S represents the source 2D mouth Keypoints (2D mouth Keypoints of source).
- AI represents the input audio segment (Input Audio) (i.e. source voice), Obtained by the user speaking or the above text-to-speech module
- F A represents audio features
- c 1 -c 4 represent four convolution layers (Convolution layers)
- f 1 -f 3 represent three fully connected layers ( Fully connection layer)
- T S represents the source 3D facial expression parameters (Three
- the purpose of the speech-to-face parameter network is to predict the corresponding source 3D facial expression parameters and 2D mouth key points from the input speech fragments, where the 3D facial expression parameters have 10-dimensional coefficients, and the 2D mouth key points
- the points are based on the 20 key points used in the Dlib algorithm. Since the 2D key points are composed of two coordinate systems (x, y), the 20 key points correspond to 40-dimensional vectors.
- the speech feature F A is firstly extracted by the recurrent neural network (RNN) proposed in the DeepSpeech method, and then enters a convolutional layer c 1 -c 4 and three fully connected layers f 1 -f 3 convolutional neural network (CNN), and finally two sets of face parameters are obtained by CNN, which are 3D facial expression parameters T S and 2D mouth key points K S .
- RNN recurrent neural network
- the extracted speech feature FA can be a 16 ⁇ 29 tensor, and the convolution layers c 1 -c 4 all use 3 ⁇ 1 convolution kernels, reducing the dimension of FA to 8 ⁇ 32, 4 ⁇ 32, 2 ⁇ 64 and 1 ⁇ 64, the feature map output by the convolutional layer c 4 will go through three fully connected layers f 1 -f 3 to obtain 128, 64 and 50-dimensional vectors respectively.
- Face parameter extraction module 805 This module is designed to extract the target person's 2D mouth key point position and 3D face parameters from the target person's video frame. Among them, the key points of the 2D mouth are obtained by the Dlib algorithm. Given a picture, the algorithm will predict 68 key points on the face, as shown in Figure 11, which is an effect diagram of the Dlib algorithm provided by the embodiment of the present application , where the left picture 1101 is the original picture, and the points on the face in the right picture 1102 are the key points predicted by the Dlib algorithm. In this embodiment of the present application, only the predicted 20 key points of the mouth may be used as 2D face parameters.
- 62-dimensional 3D face parameters are predicted, including 12-dimensional pose parameters, 40-dimensional shape parameters and 10-dimensional expression parameters.
- the 2D mouth key points and 3D facial expression parameters obtained by the face parameter extraction module will be replaced by the results obtained in the speech-to-face parameter network, while the pose parameters and shape parameters of the target person are retained, and the recombined 3D face parameters. Then, use the recombined 3D face parameters to reconstruct the face of the target character and obtain the corresponding UV Map, and the new 2D mouth key point information will be directly used as one of the inputs for subsequent rendering.
- FIG. 12 is a frame diagram of the image rendering model provided by the embodiment of the present application.
- the purpose of the image rendering model is to synthesize the final image. Lip sync speech video frames.
- the resolutions of K R and UR are both 256 ⁇ 256, and they are stitched together as the input of the image rendering model.
- the image rendering model is divided into two stages.
- the first stage (ie the first rendering network) synthesizes the mouth region texture r 1 , r 1 and the target video background frame b g (ie the background image) for splicing as the second rendering network input; the second stage (ie, the second rendering network) combines the background image to synthesize the final output r 2 .
- the structures used by the two rendering networks are U-Net networks.
- the U-Net network is a network that continuously uses downsampling and convolution operations on the input to extract deep features, and then restores its resolution through a step-by-step upsampling layer, while downsampling. A skip connection is introduced between upsampling and upsampling to preserve feature information at different resolutions.
- condition-based generative adversarial networks may be used when training the rendering network, as shown in FIG.
- the predicted value F that is, the synthetic image F
- the real value R that is, the real image R
- the predicted value F and the real value R of the rendering network will be spliced with the input I of the rendering network (that is, the input image I) and sent to the discriminator 1301 to obtain information about the real value.
- the final loss function LD of the discriminator 1301 is represented by the following formula (1-1):
- the rendering network can be regarded as a generator, and its loss function includes the generation of adversarial loss LG_GAN .
- LG_GAN is the same as LD_fake in the discriminator, but the generator will maximize this value, and its goal is to make the discriminator unable to distinguish true and false, and the discriminator will minimize this value, with the goal of accurately identifying the composite image.
- L1 loss is also used in the generator, as shown in the following formula (1-2):
- L G_L1 L 1 (F,R) (1-2)
- L G_L1 represents the loss value corresponding to the L1 loss.
- the synthetic images and real images are also constrained at the feature level, for example, the synthetic images and real images are input into the VGG19 network respectively, and then the L1 loss is used to calculate the feature maps output by the two in the five activation layers. difference, and perform linear weighting to obtain the final loss L G_VGG , as shown in the following formula (1-3):
- Relu f i and Relu ri i represent the feature maps of the synthetic image and the real image in the ith activation layer, respectively.
- the embodiments of the present application also introduce a similarity loss L G_Smi between the previous and subsequent frames to reduce the difference between the two frames in the synthetic video and the real video. Please continue to refer to Fig.
- L G L G_GAN + ⁇ *L G_L1 + ⁇ *L G_VGG + ⁇ *L G_Smi (1-4)
- ⁇ , ⁇ and ⁇ are all hyperparameters.
- Fig. 14 is a synchronous speech video of a virtual person synthesized by a method in the related art. As shown in Fig. 14, the synthesized video frames are often not smooth enough and not realistic enough. The picture of video frame 1401 and the video frame 1402 The picture is not continuous enough.
- the embodiment of the present application overcomes the above problems by combining 2D and 3D face parameters and introducing the loss of similarity between front and rear frames.
- the order of ten video frames is left to right, top to bottom. It can be seen from FIG. 15 that the synthesized video generated by the embodiment of the present application is smoother and more realistic, and the visual effect is better.
- the method in the embodiment of the present application belongs to a text-driven method, and a video of the target person speaking can be generated by combining a mature TTS technology to realize a given piece of text and any video of the target person.
- Typical application scenarios of the embodiments of the present application include the AI education industry that has emerged in recent years.
- the embodiments of the present application extend the input requirements into text or voice, which can further enhance the user experience.
- a convolutional neural network is used for the speech features extracted by using DeepSpeech in the above-mentioned speech-to-face parameter network to predict the facial parameters.
- the embodiment of the present application does not limit the model type of the deep convolutional network.
- a recurrent neural network or a generative adversarial network can also be used to replace the convolutional neural network, and the accuracy and efficiency can be adjusted according to the requirements of practical applications or products. to choose.
- the two rendering networks in the image rendering model can not only use the U-Net structure, but also other encoder-decoder structures, such as hourglass networks.
- the action driving device 354 of the target object provided by the embodiments of the present application is implemented as a software module.
- the action driving device 354 of the target object stored in the memory 350 The software module can be an action driving device of the target object in the server 300, and the device includes:
- the acquisition module 3541 is configured to acquire the source voice and acquire the target video, the target video includes the target object;
- the face parameter conversion module 3542 is configured to perform facial parameter conversion on the voice parameters of the source voice at each moment processing to obtain the source parameters of the source voice at the corresponding moment;
- the parameter extraction module 3543 is configured to perform parameter extraction processing on the target video to obtain the target parameters of the target video;
- the image reconstruction module 3544 is configured to The combination parameter obtained by combining the source parameter and the target parameter, performs image reconstruction processing on the target object in the target video to obtain a reconstructed image;
- the generating module 3545 is configured to generate a composite video by using the reconstructed image, Wherein, the synthesized video includes the target object, and the action of the target object corresponds to the source speech.
- the obtaining module 3541 is further configured to: obtain source text, perform text parsing processing on the source text, and obtain linguistic features of the source text; perform acoustic parameter extraction on the linguistic features processing to obtain the acoustic parameters of the source text in the time domain; converting the acoustic parameters to obtain the voice waveform of the source text in the frequency domain; taking the voice corresponding to the voice waveform as the source voice .
- the source parameters include: expression parameters and mouth key point parameters;
- the face parameter conversion module 3542 is further configured to: perform the following processing on the speech parameters of the source speech at any time: Perform feature extraction processing on the voice parameters to obtain the voice feature vector of the source voice; perform convolution processing and full connection processing on the voice feature vector in turn to obtain the expression parameter and all the source voice at the time. Describe the key point parameters of the mouth.
- the face parameter conversion module 3542 is further configured to: perform the convolution processing on the speech feature vector through at least two first convolution layers including a specific convolution kernel, to obtain convolution processing vector; perform the full connection processing on the convolution processing vector through at least two fully connected layers to obtain a full connection processing vector; wherein, the fully connected processing vector includes the vector corresponding to the expression parameter and the mouth The vector corresponding to the key point parameter, the sum of the dimensions of the vector corresponding to the expression parameter and the vector corresponding to the mouth key point parameter is equal to the dimension of the fully connected processing vector.
- the target parameters include: target mouth key point parameters and the target face parameters; the parameter extraction module 3543 is further configured to: perform a mouth measurement on the target object in the target video The parameter extraction process is performed to obtain the key point parameters of the target mouth; the facial parameter extraction process is performed on the target object in the target video to obtain the target facial parameters.
- the image reconstruction module 3544 is further configured to: combine the source parameters and the target parameters to obtain the combined parameters;
- the target object performs image reconstruction processing to obtain a mouth contour map and a face coordinate map; the mouth contour map and the face coordinate map are used as the reconstructed image.
- the source parameters include: expression parameters and mouth key point parameters;
- the target parameters include target mouth key point parameters and target face parameters;
- the target face parameters include: target pose parameters, target shape parameters and target expression parameters;
- the image reconstruction module 3544 is further configured to: replace the target expression parameters in the target facial parameters by the expression parameters, to obtain the replaced facial parameters;
- the mouth key point parameters replace the target mouth key point parameters to obtain the replaced mouth key point parameters;
- the replaced face parameters and the replaced mouth key point parameters are used as the combined parameters .
- the reconstructed image includes the replaced face parameters and the replaced mouth key point parameters; the generating module 3545 is further configured to: based on the replaced face parameters at each moment The face parameters, the replaced mouth key point parameters and the background image corresponding to the target video call the image rendering model; through the first rendering network in the image rendering model, the Perform mouth shape region rendering with the replaced face parameters and the replaced mouth key point parameters at each moment to obtain a texture image of the mouth shape region at each moment; through the second rendering network in the image rendering model , perform splicing processing on the mouth shape area texture image and the background image at each moment to obtain a composite image at each moment; according to the composite image at each moment, determine the target object and the source The synthesized video of the speech.
- the first rendering network includes at least one second convolution layer, at least one first downsampling layer, and at least one first upsampling layer; the generating module 3545 is further configured to: pass The second convolutional layer and the first downsampling layer perform convolution processing and downsampling processing on the replaced face parameters and the replaced mouth key point parameters to obtain the reconstruction The depth feature of the image; through the first upsampling layer, the depth feature of the reconstructed image is subjected to upsampling processing to obtain the mouth shape region texture image.
- the second rendering network includes at least one third convolutional layer, at least one second downsampling layer, and at least one second upsampling layer; the generating module 3545 is further configured to: pass The third convolution layer and the second downsampling layer perform convolution processing and downsampling processing on the mouth shape region texture image and the background image to obtain the mouth shape region texture image and the background The depth feature of the image; through the second upsampling layer, the depth feature is upsampled to obtain a composite image at each moment.
- the image rendering model is trained by the following steps: calling the image rendering model based on reconstructed image samples and target image samples; The image sample and the target image sample are subjected to feature extraction processing and mouth shape region rendering to obtain a mouth shape texture image sample; through the second rendering network in the image rendering model, the sample mouth shape texture image and the sample The target image is spliced to obtain a composite image sample; a preset loss model is called based on the composite image sample to obtain a loss result; the parameters in the first rendering network and the second rendering network are performed according to the loss result. Correction to obtain the image rendering model after training.
- the image rendering model is trained by the steps of: acquiring a real synthetic image corresponding to the reconstructed image sample and the target image sample; stitching the synthetic image sample and the real synthetic image Then, it is input into the preset loss model, and the loss result of the loss is obtained by performing the similarity loss calculation between the before and after frames on the synthetic image sample and the real synthetic image through the preset loss model.
- Embodiments of the present application provide a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
- the processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the foregoing method in the embodiments of the present application.
- the embodiments of the present application provide a storage medium storing executable instructions, wherein the executable instructions are stored, and when the executable instructions are executed by a processor, the processor will cause the processor to execute the method provided by the embodiments of the present application, for example, as shown in FIG. 4 shows the method.
- the storage medium may be a computer-readable storage medium, for example, Ferromagnetic Random Access Memory (FRAM), Read Only Memory (ROM), Programmable Read Only Memory (PROM). Read Only Memory), Erasable Programmable Read Only Memory (EPROM, Erasable Programmable Read Only Memory), Electrically Erasable Programmable Read Only Memory (EEPROM, Electrically Erasable Programmable Read Only Memory), Flash Memory, Magnetic Surface Memory, Optical Disc, Or memory such as CD-ROM (Compact Disk-Read Only Memory); it can also be various devices including one or any combination of the above memories.
- FRAM Ferromagnetic Random Access Memory
- ROM Read Only Memory
- PROM Programmable Read Only Memory
- EPROM Erasable Programmable Read Only Memory
- EEPROM Electrically Erasable Programmable Read Only Memory
- Flash Memory Magnetic Surface Memory
- Optical Disc Or memory such as CD-ROM (Compact Disk-Read Only Memory); it can also be various devices including one or any combination of the above memories.
- executable instructions may take the form of programs, software, software modules, scripts, or code, written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and which Deployment may be in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
- executable instructions may, but do not necessarily correspond to files in a file system, may be stored as part of a file that holds other programs or data, for example, a Hyper Text Markup Language (HTML, Hyper Text Markup Language) document
- HTML Hyper Text Markup Language
- One or more scripts in stored in a single file dedicated to the program in question, or in multiple cooperating files (eg, files that store one or more modules, subroutines, or code sections).
- executable instructions may be deployed to be executed on one computing device, or on multiple computing devices located at one site, or alternatively, distributed across multiple sites and interconnected by a communication network execute on.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Signal Processing (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Acoustics & Sound (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- Computer Graphics (AREA)
- Mathematical Physics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Medical Informatics (AREA)
- Processing Or Creating Images (AREA)
- Image Analysis (AREA)
Abstract
Description
Claims (17)
- 一种目标对象的动作驱动方法,由目标对象的动作驱动执行,所述方法包括:获取源语音,并获取目标视频,所述目标视频中包括目标对象;对所述源语音在每一时刻的语音参数进行脸部参数转换处理,得到所述源语音在对应时刻的源参数;对所述目标视频进行参数提取处理,得到所述目标视频的目标参数;根据所述源参数和所述目标参数结合得到的结合参数,对所述目标视频中的目标对象进行图像重构处理,得到重构图像;通过所述重构图像生成合成视频,其中,所述合成视频中包括所述目标对象、且所述目标对象的动作与所述源语音对应。
- 根据权利要求1所述的方法,其中,所述获取源语音,包括:获取源文本,并对所述源文本进行文本解析处理,得到所述源文本的语言学特征;对所述语言学特征进行声学参数提取处理,得到所述源文本在时域上的声学参数;对所述声学参数进行转换处理,得到所述源文本在频域上的语音波形;将所述语音波形对应的语音作为所述源语音。
- 根据权利要求1所述的方法,其中,所述源参数包括:表情参数和嘴部关键点参数;所述对所述源语音在每一时刻的语音参数进行脸部参数转换处理,得到所述源语音在对应时刻的源参数,包括:针对所述源语音在任一时刻的语音参数执行以下处理:对所述语音参数进行特征提取处理,得到所述源语音的语音特征向量;对所述语音特征向量依次进行卷积处理和全连接处理,得到所述源语音在所述时刻的所述表情参数和所述嘴部关键点参数。
- 根据权利要求3所述的方法,其中,所述对所述语音特征向量依次进行卷积处理和全连接处理,得到所述源语音在所述时刻的所述表情参数和所述嘴部关键点参数,包括:通过包含特定卷积核的至少两层第一卷积层对所述语音特征向量进行所述卷积处理,得到卷积处理向量;通过至少两层全连接层对所述卷积处理向量进行所述全连接处理,得到全连接处理向量;其中,所述全连接处理向量中包括所述表情参数对应的向量和所述嘴部关键点参数对应的向量,所述表情参数对应的向量与所述嘴部关键点参数对应的向量的维度之和,等于所述全连接处理向量的维度。
- 根据权利要求1所述的方法,其中,所述目标参数包括:目标嘴部关键点参数和所述目标脸部参数;所述对所述目标视频进行参数提取处理,得到所述目标视频的目标参数,包括:对所述目标视频中的所述目标对象进行嘴部参数提取处理,得到所述目标嘴部关键点参数;对所述目标视频中的所述目标对象进行脸部参数提取处理,得到所述目标脸部参数。
- 根据权利要求1所述的方法,其中,在所述对所述目标视频中的目标对象进行图像重构处理,得到重构图像之前,所述方法还包括:对所述源参数和所述目标参数进行结合处理,得到所述结合参数;所述根据所述源参数和所述目标参数结合得到的结合参数,对所述目标视频中的目 标对象进行图像重构处理,得到重构图像,包括:根据所述结合参数,对所述目标视频中的目标对象进行图像重构处理,得到嘴部轮廓图和脸部坐标映射图;将所述嘴部轮廓图和所述脸部坐标映射图作为所述重构图像。
- 根据权利要求1所述的方法,其中,所述源参数包括:表情参数和嘴部关键点参数;所述目标参数包括目标嘴部关键点参数和目标脸部参数;所述目标脸部参数包括:目标姿态参数、目标形状参数和目标表情参数;所述对所述源参数和所述目标参数进行结合处理,得到所述结合参数,包括:通过所述表情参数替换所述目标脸部参数中的所述目标表情参数,得到替换后的脸部参数;通过所述嘴部关键点参数替换所述目标嘴部关键点参数,得到替换后的嘴部关键点参数;将所述替换后的脸部参数和所述替换后的嘴部关键点参数作为所述结合参数。
- 根据权利要求7所述的方法,其中,所述重构图像包括所述替换后的脸部参数和所述替换后的嘴部关键点参数;所述通过所述重构图像生成合成视频,包括:基于每一时刻的所述替换后的脸部参数、所述替换后的嘴部关键点参数和与所述目标视频对应的背景图像调用图像渲染模型;通过所述图像渲染模型中的第一渲染网络,对每一时刻的所述替换后的脸部参数和每一时刻的所述替换后的嘴部关键点参数进行嘴型区域渲染,得到每一时刻的嘴型区域纹理图像;通过所述图像渲染模型中的第二渲染网络,对所述每一时刻的嘴型区域纹理图像和所述背景图像进行拼接处理,得到每一时刻的合成图像;根据所述每一时刻的合成图像,确定所述目标对象和所述源语音的所述合成视频。
- 根据权利要求8所述的方法,其中,所述第一渲染网络包括至少一层第二卷积层、至少一层第一下采样层和至少一层第一上采样层;所述通过所述图像渲染模型中的第一渲染网络,对每一时刻的所述替换后的脸部参数和每一时刻的所述替换后的嘴部关键点参数进行嘴型区域渲染,得到每一时刻的嘴型区域纹理图像,包括:通过所述第二卷积层和所述第一下采样层,对所述替换后的脸部参数和所述替换后的嘴部关键点参数进行卷积处理和下采样处理,得到所述重构图像的深度特征;通过所述第一上采样层,对所述重构图像的深度特征进行上采样处理,得到所述嘴型区域纹理图像。
- 根据权利要求8所述的方法,其中,所述第二渲染网络包括至少一层第三卷积层、至少一层第二下采样层和至少一层第二上采样层;所述通过所述图像渲染模型中的第二渲染网络,对所述每一时刻的嘴型区域纹理图像和所述背景图像进行拼接处理,得到每一时刻的合成图像,包括:通过所述第三卷积层和所述第二下采样层,对所述嘴型区域纹理图像和所述背景图像进行卷积处理和下采样处理,得到所述嘴型区域纹理图像和所述背景图像的深度特征;通过所述第二上采样层,对所述深度特征进行上采样处理,得到每一时刻的合成图像。
- 根据权利要求8所述的方法,其中,所述图像渲染模型通过以下步骤进行训练:基于重构图像样本和目标图像样本调用所述图像渲染模型;通过所述图像渲染模型的第一渲染网络,对所述重构图像样本和所述目标图像样本进行特征提取处理和嘴型区域渲染,得到嘴型纹理图像样本;通过所述图像渲染模型中的第二渲染网络,对所述样本嘴型纹理图像和所述样本目标图像进行拼接处理,得到合成图像样本;基于所述合成图像样本调用预设损失模型,得到损失结果;根据所述损失结果对所述第一渲染网络和所述第二渲染网络中的参数进行修正,得到训练后的所述图像渲染模型。
- 根据权利要求11所述的方法,其中,所述基于所述合成图像样本调用预设损失模型,得到损失结果,包括:获取对应于所述重构图像样本和所述目标图像样本的真实合成图像;将所述合成图像样本和所述真实合成图像拼接后输入至所述预设损失模型中,通过所述预设损失模型对所述合成图像样本和所述真实合成图像进行前后帧相似性损失计算,得到所述损失结果。
- 一种目标对象的动作驱动装置,所述装置包括:获取模块,配置为获取源语音,并获取目标视频,所述目标视频中包括目标对象;脸部参数转换模块,配置为对所述源语音在每一时刻的语音参数进行脸部参数转换处理,得到所述源语音在对应时刻的源参数;参数提取模块,配置为对所述目标视频进行参数提取处理,得到所述目标视频的目标参数;图像重构模块,配置为根据所述源参数和所述目标参数结合得到的结合参数,对所述目标视频中的目标对象进行图像重构处理,得到重构图像;生成模块,配置为通过所述重构图像生成合成视频,其中,所述合成视频中包括所述目标对象、且所述目标对象的动作与所述源语音对应。
- 一种目标对象的动作驱动系统,包括:终端和服务器;所述终端,用于向所述服务器发送所述目标对象的动作驱动请求,所述动作驱动请求中包括源语音和目标视频,所述目标视频中包括目标对象;所述服务器,用于响应于所述动作驱动请求,实现权利要求1至12任一项所述的目标对象的动作驱动方法。
- 一种目标对象的动作驱动设备,包括:存储器,用于存储可执行指令;处理器,用于执行所述存储器中存储的可执行指令时,实现权利要求1至12任一项所述的目标对象的动作驱动方法。
- 一种计算机可读存储介质,存储有可执行指令,用于引起处理器执行所述可执行指令时,实现权利要求1至12任一项所述的目标对象的动作驱动方法。
- 一种计算机程序产品,包括计算机程序或指令,所述计算机程序或指令使得计算机执行如权利要求1至12任一项所述的目标对象的动作驱动方法。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2023518520A JP7557055B2 (ja) | 2020-12-04 | 2021-11-30 | 目標対象の動作駆動方法、装置、機器及びコンピュータプログラム |
US17/968,747 US20230042654A1 (en) | 2020-12-04 | 2022-10-18 | Action synchronization for target object |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011413461.3A CN113554737A (zh) | 2020-12-04 | 2020-12-04 | 目标对象的动作驱动方法、装置、设备及存储介质 |
CN202011413461.3 | 2020-12-04 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/968,747 Continuation US20230042654A1 (en) | 2020-12-04 | 2022-10-18 | Action synchronization for target object |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022116977A1 true WO2022116977A1 (zh) | 2022-06-09 |
Family
ID=78129986
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/134541 WO2022116977A1 (zh) | 2020-12-04 | 2021-11-30 | 目标对象的动作驱动方法、装置、设备及存储介质及计算机程序产品 |
Country Status (4)
Country | Link |
---|---|
US (1) | US20230042654A1 (zh) |
JP (1) | JP7557055B2 (zh) |
CN (1) | CN113554737A (zh) |
WO (1) | WO2022116977A1 (zh) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115330913A (zh) * | 2022-10-17 | 2022-11-11 | 广州趣丸网络科技有限公司 | 三维数字人口型生成方法、装置、电子设备及存储介质 |
CN116071811A (zh) * | 2023-04-06 | 2023-05-05 | 中国工商银行股份有限公司 | 人脸信息验证方法及装置 |
Families Citing this family (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113554737A (zh) * | 2020-12-04 | 2021-10-26 | 腾讯科技(深圳)有限公司 | 目标对象的动作驱动方法、装置、设备及存储介质 |
KR102251781B1 (ko) * | 2020-12-30 | 2021-05-14 | (주)라이언로켓 | 인공신경망을 이용한 입모양 합성 장치 및 방법 |
KR102540763B1 (ko) * | 2021-06-03 | 2023-06-07 | 주식회사 딥브레인에이아이 | 머신 러닝 기반의 립싱크 영상 생성을 위한 학습 방법 및 이를 수행하기 위한 립싱크 영상 생성 장치 |
US20230135244A1 (en) * | 2021-10-28 | 2023-05-04 | Lenovo (United States) Inc. | Method and system to modify speech impaired messages utilizing neural network audio filters |
US20230252714A1 (en) * | 2022-02-10 | 2023-08-10 | Disney Enterprises, Inc. | Shape and appearance reconstruction with deep geometric refinement |
CN114782596A (zh) * | 2022-02-28 | 2022-07-22 | 清华大学 | 语音驱动的人脸动画生成方法、装置、设备及存储介质 |
CN114881023B (zh) * | 2022-04-07 | 2024-08-16 | 长沙千博信息技术有限公司 | 一种文本驱动虚拟人非语言行为的系统及方法 |
CN115170703A (zh) * | 2022-06-30 | 2022-10-11 | 北京百度网讯科技有限公司 | 虚拟形象驱动方法、装置、电子设备及存储介质 |
CN115767181A (zh) * | 2022-11-17 | 2023-03-07 | 北京字跳网络技术有限公司 | 直播视频流渲染方法、装置、设备、存储介质及产品 |
CN115550744B (zh) * | 2022-11-29 | 2023-03-14 | 苏州浪潮智能科技有限公司 | 一种语音生成视频的方法和装置 |
CN116074577B (zh) * | 2022-12-23 | 2023-09-26 | 北京生数科技有限公司 | 视频处理方法、相关装置及存储介质 |
CN115914505B (zh) * | 2023-01-06 | 2023-07-14 | 粤港澳大湾区数字经济研究院(福田) | 基于语音驱动数字人模型的视频生成方法及系统 |
CN116312612B (zh) * | 2023-02-02 | 2024-04-16 | 北京甲板智慧科技有限公司 | 基于深度学习的音频处理方法和装置 |
CN116310146B (zh) * | 2023-05-16 | 2023-10-27 | 北京邃芒科技有限公司 | 人脸图像重演方法、系统、电子设备、存储介质 |
CN117523677B (zh) * | 2024-01-02 | 2024-06-11 | 武汉纺织大学 | 一种基于深度学习的课堂行为识别方法 |
CN117710543A (zh) * | 2024-02-04 | 2024-03-15 | 淘宝(中国)软件有限公司 | 基于数字人的视频生成与交互方法、设备、存储介质与程序产品 |
CN117974849B (zh) * | 2024-03-28 | 2024-06-04 | 粤港澳大湾区数字经济研究院(福田) | 音频驱动面部运动生成的方法、系统、终端及存储介质 |
CN118283219B (zh) * | 2024-06-03 | 2024-08-30 | 深圳市顺恒利科技工程有限公司 | 一种音视频会议的实现方法及系统 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110277099A (zh) * | 2019-06-13 | 2019-09-24 | 北京百度网讯科技有限公司 | 基于语音的嘴型生成方法和装置 |
CN111028318A (zh) * | 2019-11-25 | 2020-04-17 | 天脉聚源(杭州)传媒科技有限公司 | 一种虚拟人脸合成方法、系统、装置和存储介质 |
WO2020155299A1 (zh) * | 2019-02-01 | 2020-08-06 | 网宿科技股份有限公司 | 视频帧中目标对象的拟合方法、系统及设备 |
CN111508064A (zh) * | 2020-04-14 | 2020-08-07 | 北京世纪好未来教育科技有限公司 | 基于音素驱动的表情合成方法、装置和计算机存储介质 |
CN113554737A (zh) * | 2020-12-04 | 2021-10-26 | 腾讯科技(深圳)有限公司 | 目标对象的动作驱动方法、装置、设备及存储介质 |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2010182150A (ja) * | 2009-02-06 | 2010-08-19 | Seiko Epson Corp | 顔の特徴部位の座標位置を検出する画像処理装置 |
WO2018167522A1 (en) * | 2017-03-14 | 2018-09-20 | Google Llc | Speech synthesis unit selection |
AU2018244917B2 (en) * | 2017-03-29 | 2019-12-05 | Google Llc | End-to-end text-to-speech conversion |
US10891969B2 (en) * | 2018-10-19 | 2021-01-12 | Microsoft Technology Licensing, Llc | Transforming audio content into images |
CN109508678B (zh) * | 2018-11-16 | 2021-03-30 | 广州市百果园信息技术有限公司 | 人脸检测模型的训练方法、人脸关键点的检测方法和装置 |
KR20210009596A (ko) * | 2019-07-17 | 2021-01-27 | 엘지전자 주식회사 | 지능적 음성 인식 방법, 음성 인식 장치 및 지능형 컴퓨팅 디바이스 |
CN111370020B (zh) | 2020-02-04 | 2023-02-14 | 清华珠三角研究院 | 一种将语音转换成唇形的方法、系统、装置和存储介质 |
US11477366B2 (en) * | 2020-03-31 | 2022-10-18 | Snap Inc. | Selfie setup and stock videos creation |
US11735204B2 (en) * | 2020-08-21 | 2023-08-22 | SomniQ, Inc. | Methods and systems for computer-generated visualization of speech |
EP4205105A1 (en) * | 2020-08-28 | 2023-07-05 | Microsoft Technology Licensing, LLC | System and method for cross-speaker style transfer in text-to-speech and training data generation |
-
2020
- 2020-12-04 CN CN202011413461.3A patent/CN113554737A/zh active Pending
-
2021
- 2021-11-30 WO PCT/CN2021/134541 patent/WO2022116977A1/zh active Application Filing
- 2021-11-30 JP JP2023518520A patent/JP7557055B2/ja active Active
-
2022
- 2022-10-18 US US17/968,747 patent/US20230042654A1/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020155299A1 (zh) * | 2019-02-01 | 2020-08-06 | 网宿科技股份有限公司 | 视频帧中目标对象的拟合方法、系统及设备 |
CN110277099A (zh) * | 2019-06-13 | 2019-09-24 | 北京百度网讯科技有限公司 | 基于语音的嘴型生成方法和装置 |
CN111028318A (zh) * | 2019-11-25 | 2020-04-17 | 天脉聚源(杭州)传媒科技有限公司 | 一种虚拟人脸合成方法、系统、装置和存储介质 |
CN111508064A (zh) * | 2020-04-14 | 2020-08-07 | 北京世纪好未来教育科技有限公司 | 基于音素驱动的表情合成方法、装置和计算机存储介质 |
CN113554737A (zh) * | 2020-12-04 | 2021-10-26 | 腾讯科技(深圳)有限公司 | 目标对象的动作驱动方法、装置、设备及存储介质 |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115330913A (zh) * | 2022-10-17 | 2022-11-11 | 广州趣丸网络科技有限公司 | 三维数字人口型生成方法、装置、电子设备及存储介质 |
CN116071811A (zh) * | 2023-04-06 | 2023-05-05 | 中国工商银行股份有限公司 | 人脸信息验证方法及装置 |
Also Published As
Publication number | Publication date |
---|---|
JP2023545642A (ja) | 2023-10-31 |
JP7557055B2 (ja) | 2024-09-26 |
CN113554737A (zh) | 2021-10-26 |
US20230042654A1 (en) | 2023-02-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022116977A1 (zh) | 目标对象的动作驱动方法、装置、设备及存储介质及计算机程序产品 | |
WO2022048403A1 (zh) | 基于虚拟角色的多模态交互方法、装置及系统、存储介质、终端 | |
CN110688911B (zh) | 视频处理方法、装置、系统、终端设备及存储介质 | |
US20120130717A1 (en) | Real-time Animation for an Expressive Avatar | |
CN113077537B (zh) | 一种视频生成方法、存储介质及设备 | |
Granström et al. | Audiovisual representation of prosody in expressive speech communication | |
CN114144790A (zh) | 具有三维骨架正则化和表示性身体姿势的个性化语音到视频 | |
WO2023284435A1 (zh) | 生成动画的方法及装置 | |
CN110931042A (zh) | 同声传译方法、装置、电子设备以及存储介质 | |
KR102174922B1 (ko) | 사용자의 감정 또는 의도를 반영한 대화형 수어-음성 번역 장치 및 음성-수어 번역 장치 | |
CN115049016B (zh) | 基于情绪识别的模型驱动方法及设备 | |
Abdulsalam et al. | Emotion recognition system based on hybrid techniques | |
CN115953521B (zh) | 远程数字人渲染方法、装置及系统 | |
WO2023246163A1 (zh) | 一种虚拟数字人驱动方法、装置、设备和介质 | |
CN117315102A (zh) | 虚拟主播处理方法、装置、计算设备及存储介质 | |
CN117115310A (zh) | 一种基于音频和图像的数字人脸生成方法及系统 | |
CN116310004A (zh) | 虚拟人授课动画生成方法、装置、计算机设备和存储介质 | |
CN115409923A (zh) | 生成三维虚拟形象面部动画的方法、装置及系统 | |
CN114898018A (zh) | 数字对象的动画生成方法、装置、电子设备及存储介质 | |
Kolivand et al. | Realistic lip syncing for virtual character using common viseme set | |
CN111696182A (zh) | 一种虚拟主播生成系统、方法和存储介质 | |
Manglani et al. | Lip Reading Into Text Using Deep Learning | |
CN117373455B (zh) | 一种音视频的生成方法、装置、设备及存储介质 | |
Dhanushkodi et al. | SPEECH DRIVEN 3D FACE ANIMATION. | |
Garg et al. | Speech to Face Generation using GAN’s |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21900008 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2023518520 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 27.10.2023) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21900008 Country of ref document: EP Kind code of ref document: A1 |