WO2022116977A1 - 目标对象的动作驱动方法、装置、设备及存储介质及计算机程序产品 - Google Patents

目标对象的动作驱动方法、装置、设备及存储介质及计算机程序产品 Download PDF

Info

Publication number
WO2022116977A1
WO2022116977A1 PCT/CN2021/134541 CN2021134541W WO2022116977A1 WO 2022116977 A1 WO2022116977 A1 WO 2022116977A1 CN 2021134541 W CN2021134541 W CN 2021134541W WO 2022116977 A1 WO2022116977 A1 WO 2022116977A1
Authority
WO
WIPO (PCT)
Prior art keywords
parameters
target
image
parameter
source
Prior art date
Application number
PCT/CN2021/134541
Other languages
English (en)
French (fr)
Inventor
张文杰
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to JP2023518520A priority Critical patent/JP7557055B2/ja
Publication of WO2022116977A1 publication Critical patent/WO2022116977A1/zh
Priority to US17/968,747 priority patent/US20230042654A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/4302Content synchronisation processes, e.g. decoder synchronisation
    • H04N21/4307Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen
    • H04N21/43072Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen of multiple content streams on the same device
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • G06V40/176Dynamic expression
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/132Sampling, masking or truncation of coding units, e.g. adaptive resampling, frame skipping, frame interpolation or high-frequency transform coefficient masking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/21Server components or server architectures
    • H04N21/218Source of audio or video content, e.g. local disk arrays
    • H04N21/2187Live feed
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23412Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs for generating or manipulating the scene composition of objects, e.g. MPEG-4 objects
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • H04N21/234336Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements by media transcoding, e.g. video is transformed into a slideshow of still pictures or audio is converted into text
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/236Assembling of a multiplex stream, e.g. transport stream, by combining a video stream with other content or additional data, e.g. inserting a URL [Uniform Resource Locator] into a video stream, multiplexing software data into a video stream; Remultiplexing of multiplex streams; Insertion of stuffing bits into the multiplex stream, e.g. to obtain a constant bit-rate; Assembling of a packetised elementary stream
    • H04N21/2368Multiplexing of audio and video streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/25Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
    • H04N21/251Learning process for intelligent management, e.g. learning user preferences for recommending movies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/8146Monomedia components thereof involving graphical data, e.g. 3D object, 2D graphics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems
    • H04N7/157Conference systems defining a virtual conference space and using avatars or agents

Definitions

  • the embodiments of the present application are based on the Chinese patent application with the application number of 202011413461.3 and the application date of December 4, 2020, and claim the priority of the Chinese patent application.
  • the entire content of the Chinese patent application is incorporated into the embodiments of the present application as refer to.
  • the present application relates to the field of Internet technologies, and relates to, but is not limited to, an action driving method, apparatus, device, computer-readable storage medium, and computer program product for a target object.
  • an implementation method is to use a recurrent neural network to learn the key points of the mouth from the speech features, and then generate the mouth texture based on the information of the key points of the mouth, and finally combine it with the target video frame to obtain the lip sync speech video frame.
  • Another implementation is to first learn a common and shared "voice-expression" space based on multiple sound clips from different sources, and then determine the final lip-synched video frame based on the obtained expression parameters.
  • Embodiments of the present application provide an action driving method, apparatus, device, computer-readable storage medium, and computer program product for a target object, which can improve the smoothness and authenticity of a final synthesized video.
  • An embodiment of the present application provides an action driving method for a target object, the method comprising:
  • image reconstruction processing is performed on the target object in the target video to obtain a reconstructed image
  • a synthetic video is generated from the reconstructed image, wherein the synthetic video includes the target object, and the action of the target object corresponds to the source speech.
  • An embodiment of the present application provides an action driving device for a target object, and the device includes:
  • an acquisition module configured to acquire a source voice and acquire a target video, where the target video includes a target object
  • a facial parameter conversion module configured to perform facial parameter conversion processing on the voice parameters of the source voice at each moment, to obtain the source parameters of the source voice at the corresponding moment;
  • a parameter extraction module configured to perform parameter extraction processing on the target video to obtain target parameters of the target video
  • an image reconstruction module configured to perform image reconstruction processing on the target object in the target video according to a combination parameter obtained by combining the source parameter and the target parameter to obtain a reconstructed image
  • the generating module is configured to generate a synthetic video from the reconstructed image, wherein the synthetic video includes the target object, and the action of the target object corresponds to the source voice.
  • the embodiment of the present application provides an action driving system for a target object, which at least includes: a terminal and a server;
  • the terminal configured to send an action-driven request of the target object to the server, where the action-driven request includes a source voice and a target video, and the target video includes a target object;
  • the server is configured to implement the above-mentioned action-driven method of the target object in response to the action-driven request.
  • Embodiments of the present application provide a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium; wherein, a processor of a computer device is obtained from the computer The computer instructions are read from the readable storage medium, and the processor is configured to execute the computer instructions to implement the above-mentioned action driving method for the target object.
  • An embodiment of the present application provides an action driving device for a target object, including: a memory for storing executable instructions; and a processor for implementing the above-mentioned action driving of the target object when executing the executable instructions stored in the memory method.
  • An embodiment of the present application provides a computer-readable storage medium storing executable instructions for causing a processor to execute the executable instructions to implement the above-mentioned action driving method for a target object.
  • the embodiments of the present application have the following beneficial effects: through the combination parameters of the source parameters and the target parameters, a synthetic video of the action of the final voice-driven target object is obtained, the smoothness and authenticity of the finally obtained synthetic video are improved, and the video synthesis is improved. visual effects.
  • FIG. 1 is a system frame diagram of an action-driven method for a target object in the related art
  • FIG. 2 is a schematic structural diagram of an action-driven system for a target object provided by an embodiment of the present application
  • FIG. 3 is a schematic structural diagram of an action driving device for a target object provided by an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of an action driving method for a target object provided by an embodiment of the present application
  • FIG. 5 is a schematic flowchart of an action driving method for a target object provided by an embodiment of the present application.
  • 6A-6B are a schematic flowchart of an action driving method for a target object provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of an implementation flow of a training method for an image rendering model provided by an embodiment of the present application.
  • FIG. 8 is a system framework diagram of a method for driving an action of a target object provided by an embodiment of the present application.
  • FIG. 9 is a frame diagram of a text-to-speech module provided by an embodiment of the present application.
  • FIG. 10 is a frame diagram of a speech-to-face parameter network provided by an embodiment of the present application.
  • FIG. 11 is an effect diagram of the Dlib algorithm provided by an embodiment of the present application.
  • FIG. 12 is a frame diagram of an image rendering model provided by an embodiment of the present application.
  • FIG. 13 is a framework diagram of a condition-based generative adversarial network provided by an embodiment of the present application.
  • Figure 14 is a virtual human synchronous speech video synthesized by a method in the related art
  • FIG. 15 is a composite video generated by the action driving method of the target object according to the embodiment of the present application.
  • lip sync speech video generation schemes are mainly divided into two categories: text-driven and speech-driven.
  • text-driven is to input a piece of text and a video of the target person, convert the text into speech through the text-to-speech (TTS, Text To Speech) technology, and then learn the facial features from the speech features, and finally output a piece of reading by the target person.
  • TTS Text To Speech
  • the text-driven method is an extension of the voice-driven method.
  • the lip sync speech video generation scheme is mainly based on deep learning.
  • the recurrent neural network is used to learn 20 key points of the mouth from the speech features, and then the mouth texture is generated based on the key point information. , and finally combined with the target video frame to obtain the lip sync speech video frame.
  • a text-driven ObamaNet method mainly includes three modules, namely "text-speech” module, “speech-keypoint” module and “keypoint-video frame” module, among which the "text-speech” module adopts Char2Wav in the TTS algorithm, the "voice-keypoint” module also uses a recurrent neural network to learn keypoint information from speech features, while the "keypoint-video frame” module uses U-Net with skip connections to achieve information transfer network, the first deep learning-based text-driven model for lip-syncing speech video generation.
  • UV map is a map mapped from 3D face coordinates to a two-dimensional plane.
  • This method also uses a U-Net network to render video frames.
  • Another method proposes a voice identity information removal network to convert the voice features of different speakers into a global domain, and then uses a recurrent neural network to learn expression parameters from the voice features, and compares the obtained expression parameters with the 3D representation of the target person.
  • the face parameters are combined with reconstruction to obtain a 3D grid, and the 3D grid is input to the U-Net network to obtain the final video frame.
  • the rendering module is mainly improved, and a memory-enhanced generative adversarial network (GAN, Generative Adversarial Networks) is proposed to save the identity features and spatial feature pairs of different speakers, so as to realize the video synthesis of different speakers. .
  • GAN Generative Adversarial Networks
  • an action-driven method for target objects based on a speech-driven model is also proposed.
  • the method first learns a general and shared "voice-expression" space based on multiple sound clips from different sources. Shape composition, the expression parameters of different people can be composed of linear combinations of different mixed shapes in space. Then perform 3D face reconstruction according to the obtained expression parameters, and then obtain the corresponding UV map, and then use U-Net based on hole convolution to render the video frame.
  • Figure 1 is an action-driven method of a target object in the related art.
  • the system framework of the action-driven method of the target object is composed of a generalized network (Generalized Network) 11 and a specialized network (Specialized Network) 12, wherein, the specific processing flow of the technical solution system framework is as follows : First, input sound clips 111 from different sources into the speech recognition system (DeepSpeech RNN) 112 for speech feature extraction to obtain speech features and then pass through a convolutional neural network (CNN, Convolutional Neural Networks) 113 to map different people's speech features To a common and shared latent speech expression space (Latent Audio Expression Space) 114, for the speech features of different people, it can be formed by linear combination of different blend shapes in this space.
  • a convolutional neural network CNN, Convolutional Neural Networks
  • the output of the generalized network 11 will enter the content-aware filter (Content-Aware Filtering) 121 of the professional network 12 to obtain smooth speech-expression parameters (Smooth Audio-Expressions) 122, and then obtain a reconstructed 3D face model (3D Model). )123 and UV Map 124. Finally, the UV Map 124 and the background image 125 are input to the Neural Rendering Network 126 to obtain the final output image 127 .
  • Content-aware filter Content-aware Filtering
  • smooth speech-expression parameters Speech-expression parameters
  • UV Map 124 and the background image 125 are input to the Neural Rendering Network 126 to obtain the final output image 127 .
  • the related art is a voice-driven method, which cannot realize a given text and output the corresponding lip sync speech video; the face parameters used in the related art are only determined by the 3D face model.
  • the obtained UV Map but the UV Map can only provide the network with the prior data of the mouth shape, and the network does not have any auxiliary information for the details of the teeth; in the related art, only the corresponding frames of the predicted value and the real value are penalized when training the rendering network. If the input before and after frames are not considered, the difference between the before and after frames will not be optimized, resulting in jitter in the final video.
  • there is still a problem that the generated video corresponding to the final lip-synched speech frame is not smooth and unreal.
  • the current main challenges in the field of 3D virtual lip sync speech video generation include two points: face reconstruction and video frame rendering.
  • the embodiment of the present application proposes a speech-to-face parameter network, which can simultaneously learn 2D mouth key points and 3D facial expression parameters from speech features, so as to obtain the precise position provided by the 2D key points. It can also retain the advantages of depth information of 3D face parameters, and the combination of 2D and 3D parameters to reconstruct the face can ensure its accuracy. After the reconstructed face is obtained, it is also fused with the background.
  • the embodiment of the present application proposes a two-stage rendering network. The first rendering network realizes the rendering of the mouth texture area from the reconstructed face, and the second rendering network aims to render the mouth texture area.
  • the regions are combined with the background to render the final video frame.
  • the advantages of using a two-stage rendering network are: 1) Training the two rendering networks separately can reduce the training difficulty, while ensuring the accuracy of the mouth texture generated by the first rendering network; 2) When training the second rendering network, the The mouth shape area is punished to correct the mouth shape and optimize the details such as teeth and wrinkles.
  • a video frame similarity loss is also used to ensure that there is little difference between the output before and after the frame, avoiding the video jitter phenomenon and the video is not smooth and unreal.
  • a source voice and a target video are obtained, and the target video includes the target object;
  • a synthetic video is generated by reconstructing the image, and the synthetic video has a target object, and the action of the target object corresponds to the source speech.
  • the face in the embodiment of the present application is not limited to a human face, and may also be the face of an animal or the face of a virtual object.
  • the action driving device for the target object provided in the embodiment of the present application can be implemented as a notebook computer, a tablet computer, a desktop computer, a mobile Devices (eg, mobile phones, portable music players, personal digital assistants, dedicated messaging devices, portable game devices), intelligent robots, and other terminals with video playback functions.
  • the embodiments of the present application provide The target object's action-driven device can also be implemented as a server. Next, an exemplary application when the action driving device of the target object is implemented as a server will be described.
  • FIG. 2 is a schematic structural diagram of an action driving system 20 for a target object provided by an embodiment of the present application.
  • the action driving system 20 for the target object provided in the embodiment of the present application includes a terminal 100, a network 200, and a server 300.
  • the terminal 100 obtains the target video and the source voice, generates an action-driven request of the target object according to the target video and the source voice, and sends the action-driven request to the server 300 through the network 200.
  • Face parameter conversion processing is performed on the voice parameters of a moment to obtain the source parameters of the source voice at the corresponding moment; parameter extraction is performed on the target video to obtain the target parameters; then, according to the combination parameters obtained by combining the source parameters and the target parameters , perform image reconstruction on the target object in the target video to obtain a reconstructed image; generate a synthetic video by reconstructing the image, wherein the synthetic video has a target object, and the action of the target object corresponds to the source voice.
  • the composite video is sent to the terminal 100 through the network 200.
  • the terminal 100 plays the composite video on the current interface 100-1 of the terminal 100.
  • the terminal 100 obtains the target video and the source voice, wherein the target video and the source voice may be locally stored video and voice. , it can also be video and voice recorded in real time.
  • the terminal performs facial parameter conversion processing on the voice parameters of the source voice at each moment to obtain the source parameters of the source voice at the corresponding moment; and extracts the parameters of the target video to obtain the target parameters.
  • image reconstruction is performed on the target object in the target video to obtain a reconstructed image; a composite video is generated by the reconstructed image, wherein the composite video has a target object. object, and the action of the target object corresponds to the source speech.
  • the composite video is played on the current interface 100-1 of the terminal 100.
  • the action driving method of the target object provided by the embodiment of the present application also relates to the field of artificial intelligence technology, and the synthetic video is synthesized through the artificial intelligence technology.
  • it can be implemented at least through computer vision technology, speech technology and natural language processing technology in artificial intelligence technology.
  • computer vision technology (CV, Computer Vision) is a science that studies how to make machines "see”. Further, it refers to the use of cameras and computers instead of human eyes to identify, track and measure targets. Machine vision, And further do graphics processing, so that computer processing becomes more suitable for human eye observation or transmission to the instrument detection image.
  • CV Computer Vision technology
  • computer vision technology is a science that studies how to make machines "see”. Further, it refers to the use of cameras and computers instead of human eyes to identify, track and measure targets. Machine vision, And further do graphics processing, so that computer processing becomes more suitable for human eye observation or transmission to the instrument detection image.
  • computer vision studies related theories and technologies trying to build artificial intelligence systems that can obtain information from images or multidimensional data.
  • Computer vision technology usually includes image processing, image recognition, image semantic understanding, image retrieval, Optical Character Recognition (OCR, Optical Character Recognition), video processing, video semantic understanding, video content/behavior recognition, 3D object reconstruction, 3D technology, virtual Reality, augmented reality, simultaneous positioning and map construction technologies, as well as common biometric identification technologies such as face recognition and fingerprint recognition.
  • OCR Optical Character Recognition
  • video processing video semantic understanding, video content/behavior recognition
  • 3D object reconstruction 3D technology
  • virtual Reality augmented reality
  • simultaneous positioning and map construction technologies as well as common biometric identification technologies such as face recognition and fingerprint recognition.
  • the key technologies of speech technology are automatic speech recognition technology (ASR, Automatic Speech Recognition), speech synthesis technology (TTS, Text To Speech) and voiceprint recognition technology. Making computers able to hear, see, speak, and feel is the development direction of human-computer interaction in the future, and voice will become one of the most promising human-computer interaction methods in the future.
  • Natural Language Processing is an important direction in the field of computer science and artificial intelligence. It studies various theories and methods that can realize effective communication between humans and computers using natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Therefore, research in this field will involve natural language, the language that people use on a daily basis, so it is closely related to the study of linguistics. Natural language processing technology usually includes text processing, semantic understanding, machine translation, robot question answering, knowledge graph and other technologies.
  • the action driving method of the target object provided by the embodiments of the present application may also be implemented based on a cloud platform and through cloud technology.
  • the above server 300 may be a cloud server.
  • FIG. 3 is a schematic structural diagram of an action driving device of a target object provided by an embodiment of the present application.
  • the server 300 shown in FIG. 3 includes: at least one processor 310 and a memory 350 , at least one network interface 320 and user interface 330 .
  • the various components in server 300 are coupled together by bus system 340 .
  • bus system 340 is used to implement the connection communication between these components.
  • the bus system 340 also includes a power bus, a control bus and a status signal bus.
  • the various buses are labeled as bus system 340 in FIG. 3 .
  • the processor 310 may be an integrated circuit chip with signal processing capabilities, such as a general-purpose processor, a digital signal processor (DSP, Digital Signal Processor), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., where a general-purpose processor may be a microprocessor or any conventional processor or the like.
  • DSP Digital Signal Processor
  • User interface 330 includes one or more output devices 331 that enable presentation of media content, including one or more speakers and/or one or more visual display screens.
  • User interface 330 also includes one or more input devices 332, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, and other input buttons and controls.
  • Memory 350 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like. Memory 350 optionally includes one or more storage devices that are physically remote from processor 310 . Memory 350 includes volatile memory or non-volatile memory, and may also include both volatile and non-volatile memory. The non-volatile memory may be a read-only memory (ROM, Read Only Memory), and the volatile memory may be a random access memory (RAM, Random Access Memory). The memory 350 described in the embodiments of the present application is intended to include any suitable type of memory. In some embodiments, memory 350 is capable of storing data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.
  • the operating system 351 includes system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;
  • An input processing module 353 for detecting one or more user inputs or interactions from one of the one or more input devices 332 and translating the detected inputs or interactions.
  • FIG. 3 shows an action driving device 354 of a target object stored in the memory 350, and the action driving device 354 of the target object may be
  • the action driving device of the target object in the server 300 which can be software in the form of programs and plug-ins, includes the following software modules: an acquisition module 3541, a face parameter conversion module 3542, a parameter extraction module 3543, an image reconstruction module 3544, and a generation Modules 3545, these modules are logical, so they can be arbitrarily combined or further split according to the functions implemented. The function of each module will be explained below.
  • the apparatus provided by the embodiments of the present application may be implemented in hardware.
  • the apparatus provided by the embodiments of the present application may be a processor in the form of a hardware decoding processor, which is programmed to execute the present application
  • the action driving method of the target object provided by the embodiment for example, the processor in the form of a hardware decoding processor may adopt one or more application specific integrated circuits (ASIC, Application Specific Integrated Circuit), DSP, programmable logic device (PLD, Programmable Logic Device), Complex Programmable Logic Device (CPLD, Complex Programmable Logic Device), Field Programmable Gate Array (FPGA, Field-Programmable Gate Array) or other electronic components.
  • ASIC application specific integrated circuits
  • DSP digital signal processor
  • PLD programmable logic device
  • CPLD Complex Programmable Logic Device
  • FPGA Field Programmable Gate Array
  • FIG. 4 is a schematic flowchart of a method for driving an action of a target object provided by an embodiment of the present application. The steps shown in FIG. 4 will be described below.
  • step S401 a source voice and a target video are acquired, and the target video includes a target object.
  • the server may receive an action-driven request of the target object sent by the user through the terminal, and the action-driven request is used to request to synthesize the source voice and the target video to generate an action that has both the target object and the source voice, and the source voice drives the action of the target object That is, the synthetic video requested to be generated has the target object in the target video, and the voice corresponding to the target object is the source voice.
  • the source voice may be the voice pre-recorded by the user, the voice downloaded from the network, or the voice obtained by converting a specific text.
  • the sound feature of the source speech may be the sound feature of a specific object, and may also be the sound feature of the target object in the target video.
  • Step S402 performing facial parameter conversion processing on the speech parameters of the source speech at each moment to obtain the source parameters of the source speech at the corresponding moment.
  • the source parameters at each moment include but are not limited to expression parameters and mouth key point parameters, wherein the expression parameters are expression parameters corresponding to the speech parameters at the moment, for example, when the speech parameters correspond to cheerful speech,
  • the expression parameter may be a smiling expression parameter, and when the speech parameter corresponds to a sad speech, the expression parameter may be a frowning expression parameter.
  • the mouth key point parameter is the mouth shape parameter when expressing the speech parameter at the moment.
  • the expression parameters are 3D expression parameters
  • the mouth key point parameters are 2D key point parameters
  • Step S403 performing parameter extraction processing on the target video to obtain target parameters.
  • a preset algorithm can be used to extract parameters from the target video, that is, to extract parameters from the target object in the target video.
  • the target parameters include but are not limited to the target mouth key point parameters and the target face parameters.
  • the target parameters It can also include the pose parameters, position parameters, shape parameters and action parameters of the target object.
  • Step S404 Perform image reconstruction processing on the target object in the target video according to the combination parameter obtained by combining the source parameter and the target parameter to obtain a reconstructed image.
  • the source parameters and the target parameters are first combined to obtain the combined parameters, and the combined parameters are parameters used to characterize the posture, position, shape, action and mouth shape of the target object in the final composite video.
  • image reconstruction is performed on the target object according to the combination parameters to obtain a reconstructed image, and the reconstructed image is an image used to generate a final composite video.
  • Step S405 generating a composite video by reconstructing the image.
  • the action of the target object corresponds to the source speech.
  • a corresponding reconstructed image is generated, and each reconstructed image is rendered to generate a composite image.
  • the reconstructed image can be There is at least one, and the duration of the synthesized video is equal to the duration of the source voice, or the duration of the synthesized video is greater than the duration of the source voice.
  • the final generated composite video is a composite image; when there are multiple reconstructed images, the final generated composite video has the same duration as the source voice, and the composite video is composed of A video formed by concatenating multiple composite images in chronological order.
  • the target video may have at least one video frame, and the target video has the target object.
  • the target video includes one video frame, the video frame has the target object, and the video composition request is used to request the generation of the target object.
  • the synthesized video of the target object, and the synthesized video is a dynamic video obtained based on one frame of video frame; when the target video includes multiple video frames, at least one video frame has the target object, and the video synthesis request is used to request the generation of A composite video of the target object, and the composite video is a dynamic video obtained based on multiple video frames.
  • the duration of the target video may be the same as or different from the duration of the source voice.
  • the duration of the target video is the same as the duration of the source voice, a composite image can be formed according to the voice parameters of each video frame corresponding to the source voice at each moment, and finally a composite video with the same duration as the target video can be formed.
  • the embodiments of the present application can be applied to the following scenarios: in the education industry, if you want to generate a teaching video about a certain knowledge point, you can input the source voice corresponding to the knowledge point (that is, the classroom teacher's voice) and the target video with the teacher's lecture. To the server, the server can directly generate and output a teaching video (ie, a synthetic video) in which the teacher explains the knowledge point by using the method of the embodiment of the present application.
  • a teaching video ie, a synthetic video
  • the action driving method for a target object performs facial parameter conversion processing on the voice parameters of the source voice at each moment, obtains the source parameters of the source voice at the corresponding moment, and extracts the parameters of the target video to obtain the target parameters, And reconstruct the image of the target object according to the combination parameters of the source parameter and the target parameter to obtain a reconstructed image, and finally, generate a synthetic video through the reconstructed image, wherein the synthetic video has a target object, and the action of the target object is related to the source voice. correspond.
  • the synthetic video of the final voice-driven action of the target object is obtained based on the combined parameters of the source parameter and the target parameter, the synthetic video finally obtained is smoother and more realistic, and the visual effect of the video synthesis is improved.
  • the action driving system of the target object includes at least a terminal and a server, and through the interaction between the terminal and the server, a response to an action driving request of the terminal is implemented, and a composite video desired by the user is generated.
  • the action-driven request includes source voice and target video, and the action-driven request may also include source text, and the source voice can be obtained through the source text.
  • FIG. 5 is a schematic flowchart of a method for driving an action of a target object provided by an embodiment of the present application. As shown in FIG. 5 , the method includes the following steps:
  • Step S501 the terminal acquires the source voice and the target video.
  • the source voice may be the voice collected by the user through a voice collection device on the terminal, or may be the voice downloaded by the user through the terminal.
  • the target video can be a video of any duration, and the target video has a target object.
  • Step S502 the terminal obtains the source text and the target video.
  • the source text is the text used to generate the source voice.
  • the input source voice be processed to generate a synthetic video with the source voice, but also the input source text can be parsed and converted to generate the source text.
  • speech which in turn forms a synthesized video with the source speech.
  • Step S503 the terminal performs text analysis on the source text to obtain linguistic features of the source text.
  • linguistic features include, but are not limited to, linguistic features such as pinyin, pause, punctuation, and tones.
  • text parsing of the source text may also be performed based on artificial intelligence technology to obtain linguistic features of the source text.
  • Step S504 the terminal extracts the acoustic parameters of the linguistic feature to obtain the acoustic parameters of the source text in the time domain.
  • the acoustic parameters are the parameter representation of the source text in the time domain, and the acoustic parameters of the source text in the time domain are obtained by extracting the acoustic parameters of the linguistic features.
  • Step S505 the terminal performs conversion processing on the acoustic parameters to obtain the speech waveform of the source text in the frequency domain.
  • the speech waveform is the acoustic representation corresponding to the acoustic parameters
  • the speech waveform is the parametric representation of the source text in the frequency domain.
  • Step S506 the terminal determines the voice corresponding to the voice waveform as the source voice.
  • Step S507 the terminal encapsulates the source voice and the target video to form an action-driven request.
  • the terminal may also encapsulate the source text in the action-driven request, and send the action-driven request to the server, and the server implements the steps of converting the source text into the source speech in steps S503 to S506.
  • Step S508 the terminal sends the action driving request to the server.
  • Step S509 the server parses the action-driven request to obtain the source voice and the target video.
  • step S510 the server performs facial parameter conversion processing on the speech parameters of the source speech at each moment to obtain the source parameters of the source speech at the corresponding moment.
  • Step S511 the server extracts parameters from the target video to obtain target parameters.
  • Step S512 the server performs image reconstruction on the target object in the target video according to the combination parameter obtained by combining the source parameter and the target parameter to obtain a reconstructed image.
  • Step S513 the server generates a composite video by reconstructing the image, wherein the composite video has a target object, and the action of the target object corresponds to the source voice.
  • steps S510 to S513 are the same as the above-mentioned steps S402 to S405, and are not repeated in this embodiment of the present application.
  • Step S514 the server sends the composite video to the terminal.
  • Step S515 the terminal plays the composite video on the current interface.
  • FIG. 6A is a schematic flowchart of a method for driving an action of a target object provided by an embodiment of the present application. As shown in FIG. 6A , step S402 can be implemented by the following steps:
  • Step S601 perform feature extraction on the source speech to obtain a speech feature vector of the source speech.
  • step S602 convolution processing and full connection processing are sequentially performed on the speech feature vector to obtain expression parameters and mouth key point parameters of the source speech at the corresponding moment.
  • step S602 may be implemented in the following manner: sequentially perform convolution processing on the speech feature vector through at least two first convolution layers with specific convolution kernels to obtain a convolution processing vector;
  • the connection layer performs full connection processing on the convolution processing vector in turn to obtain a fully connected processing vector.
  • the full connection processing vector includes the vector corresponding to the expression parameter and the vector corresponding to the mouth key point parameter, wherein the sum of the dimensions of the vector corresponding to the expression parameter and the vector corresponding to the mouth key point parameter is equal to the full connection processing vector. dimension.
  • step S403 can be implemented by the following steps:
  • Step S603 perform mouth parameter extraction and face parameter extraction sequentially on the target object in the current video frame of the target video, and correspondingly obtain target mouth key point parameters and target face parameters.
  • the target mouth key point parameters and the target face parameters are the parameters of the target object.
  • the target mouth key point parameters and target face of the target object in each video frame can be extracted. part parameters.
  • Step S604 determining the target mouth key point parameter and the target face parameter as target parameters.
  • step S404 can be implemented by the following steps:
  • Step S605 combine the source parameter and the target parameter to obtain the combined parameter.
  • the parameters used to generate the final composite image can be extracted, and the parameters not used to generate the final composite image are deleted to obtain the combined parameters.
  • Step S606 Perform image reconstruction on the target object in the target video according to the combination parameters to obtain a mouth contour map and a UV map.
  • the reconstructed image includes a mouth contour map and a UV map (UV map), wherein the mouth contour map is used to reflect the mouth contour of the target object in the final generated composite image, and the UV map is used to match the mouth contour with the mouth.
  • UV map UV map
  • the facial contour maps are combined to generate the texture of the mouth region of the target object in the composite image.
  • Step S607 using the mouth contour map and the UV map as the reconstructed image.
  • the source parameters include: expression parameters and mouth key point parameters; the target parameters include target mouth key point parameters and target face parameters; the target face parameters include at least: target pose parameters, target shape parameters and target parameters Expression parameters.
  • step S605 can be implemented in the following ways: replacing the target facial expression parameters in the target facial parameters with the facial expression parameters to obtain the replaced facial parameters; replacing the target mouth key point parameters with the mouth key point parameters, The replaced mouth key point parameters are obtained; the replaced face parameters and the replaced mouth key point parameters are used as combined parameters.
  • step S405 the process of generating a composite video by reconstructing the image in step S405 can be realized by the following steps:
  • Step S6054 the image rendering model is invoked based on the replaced face parameters, the replaced mouth key point parameters and the background image corresponding to the target video at each moment.
  • the replaced face parameters, the replaced mouth key point parameters and the background image corresponding to the target video at each moment are input into the image rendering model.
  • the reconstructed image includes the replaced face parameters and the replaced mouth key point parameters.
  • Step S6055 through the first rendering network in the image rendering model, perform mouth shape region rendering on the replaced face parameters at each moment and the replaced mouth key point parameters at each moment, and obtain the mouth shape at each moment. type area texture image.
  • the first rendering network includes at least one second convolution layer, at least one first down-sampling layer and at least one first up-sampling layer; wherein, the mouth shape region rendering process can be implemented by the following steps : Convolution and downsampling are performed on the replaced face parameters and the replaced mouth key point parameters through the second convolution layer and the first downsampling layer in turn to obtain the depth features of the reconstructed image; An up-sampling layer, performing up-sampling processing on the depth features of the reconstructed image to restore the resolution of the reconstructed image and obtain a texture image of the mouth shape region.
  • Step S6056 Perform splicing processing on the texture image of the mouth shape region and the background image through the second rendering network in the image rendering model to obtain a composite image at each moment.
  • the second rendering network includes at least one third convolutional layer, at least one second downsampling layer and at least one second upsampling layer; wherein, the splicing process may be implemented by the following steps: sequentially Through the third convolution layer and the second downsampling layer, convolution and downsampling are performed on the texture image of the mouth shape region and the background image to obtain the depth features of the texture image and the background image of the mouth shape region; through the second upsampling layer , perform up-sampling processing on the depth features of the texture image of the mouth shape region and the background image to restore the resolution of the texture image of the mouth shape region and the background image, and obtain the composite image at the current moment.
  • Step S6057 Determine a synthesized video including the target object and the source voice according to the synthesized image at each moment.
  • the above image rendering model is used to render the reconstructed image at each moment to generate a composite image at the corresponding moment, and the composite image includes not only the target object but also the voice of the source voice at the corresponding moment.
  • the image rendering model includes at least a first rendering network and a second rendering network.
  • the first rendering network is used to perform feature extraction and mouth shape region rendering on the reconstructed image and the target image, respectively, and the second rendering network is used for the mouth shape region.
  • the texture image and the target image are stitched together.
  • FIG. 7 is a schematic diagram of an implementation flow of a training method for an image rendering model provided by an embodiment of the present application. As shown in FIG. 7 , the method includes the following steps:
  • Step S701 calling an image rendering model based on the reconstructed image sample and the target image sample.
  • the reconstructed image sample may be obtained through the following steps: performing facial parameter conversion processing on the voice parameter of the voice sample at the current moment to obtain a voice parameter sample; performing parameter extraction on the target image sample to obtain the target parameter sample; The speech parameter sample and the target parameter sample are combined to obtain a combined parameter sample, and an image of the target object in the target image sample is reconstructed according to the combined parameter sample to obtain the reconstructed image sample.
  • the reconstructed image sample can also be obtained by the following steps: performing text analysis on the text sample to obtain linguistic features of the text sample, and performing acoustic parameter extraction on the linguistic features of the text sample to obtain the time domain of the text sample. Acoustic parameters on the above; convert the acoustic parameters to obtain the speech waveform of the text sample in the frequency domain, and determine the speech corresponding to the speech waveform as the speech sample.
  • the voice parameter of the voice sample at the current moment is processed by face parameter conversion to obtain a voice parameter sample; the parameter extraction is performed on the target image sample to obtain the target parameter sample; the voice parameter sample and the target parameter sample are combined to obtain a combined parameter sample , and image reconstruction is performed on the target object in the target image sample according to the combined parameter sample to obtain the reconstructed image sample.
  • the target image sample includes the target object sample, and the final generated composite image sample also includes the target object sample.
  • Step S702 Perform feature extraction processing and mouth shape region rendering on the reconstructed image sample and the target image sample through the first rendering network of the image rendering model to obtain a mouth shape texture image sample.
  • the first rendering network includes at least one second convolution layer, at least one first downsampling layer, and at least one first upsampling layer.
  • the parameters corresponding to the input reconstructed image sample and the target image sample can be convolved through the second convolution layer, and the parameters after convolution processing can be down-sampled through the first down-sampling layer.
  • the first upsampling layer can perform upsampling processing on the extracted first image feature samples to restore the resolution of the reconstructed image samples and the target image samples, and obtain the mouth shape texture image samples .
  • a second convolutional layer is connected before each first downsampling layer, and a second convolutional layer is also connected after each first upsampling layer, that is, in each downsampling process A convolution process is performed before, and a convolution process is performed after each upsampling process.
  • a skip connection is introduced between the first down-sampling layer and the first up-sampling layer, and feature information of different resolutions is preserved through the skip connection.
  • Step S703 performing splicing processing on the sample mouth shape texture image and the sample target image through the second rendering network in the image rendering model to obtain a composite image sample.
  • the second rendering network includes at least one third convolutional layer, at least one second downsampling layer and at least one second upsampling layer.
  • the parameters corresponding to the input mouth texture image sample and the target image sample can be convolved first through the third convolution layer, and the parameters after convolution processing can be downsampled through the second downsampling layer.
  • processing to extract the depth features of the mouth texture image sample and the target image sample that is, to obtain a second image feature sample.
  • an upsampling process is performed on the extracted second image feature samples through the second upsampling layer, so as to restore the resolutions of the mouth shape texture image samples and the target image samples, and obtain a composite image sample.
  • a third convolutional layer is connected before each second downsampling layer, and a third convolutional layer is also connected after each second upsampling layer, that is, in each downsampling process A convolution process is performed before, and a convolution process is performed after each upsampling process.
  • skip connections may be introduced between the second down-sampling layer and the second up-sampling layer, and feature information of different resolutions can be preserved through skip connections.
  • Step S704 calling a preset loss model based on the synthetic image sample to obtain a loss result.
  • step S704 can be implemented by the following ways: obtaining a real synthetic image corresponding to the reconstructed image sample and the target image sample; splicing the synthetic image sample and the real synthetic image and inputting them into a preset loss model, The loss model is set to calculate the similarity loss between the synthetic image samples and the real synthetic image before and after the frame, and obtain the loss result.
  • the following loss functions can be calculated: the loss between the two losses of the image rendering model about the real synthetic image and the synthetic image sample, the generational adversarial loss, the L1 loss, the use of The difference between the real synthetic image and the feature map output by the synthetic image sample in the N activation layers calculated by the L1 loss, and the difference is linearly weighted to obtain the final loss and the similarity loss of the previous and previous frames, where the loss result is Calculated according to at least one of the above loss functions, that is, the loss between the two losses of the image rendering model with respect to the real synthetic image and the synthetic image sample, the generative adversarial loss, the L1 loss, the calculated using the L1 loss
  • the difference of the feature maps output by the real synthetic image and the synthetic image sample in the N activation layers, and the difference is linearly weighted to obtain the final loss, and the weighted sum of the similarity loss before and after the frame, and the loss result is obtained.
  • Step S705 correcting the parameters in the first rendering network and the second rendering network according to the loss result, to obtain a trained image rendering model.
  • a generative adversarial strategy when training the image rendering model, a generative adversarial strategy can be adopted, and the model training is considered based on the similarity between the preceding and following frames, and then the loss result of the image rendering model in each prediction is calculated.
  • the image rendering model can be accurately trained, and the image rendering model obtained by training considers the continuous change between the previous and subsequent frames, so that the change between two consecutive video frames in the generated composite video is smoother, so that the The resulting composite video is smoother and more realistic, which can improve the visual effect of the composite video generated by the image rendering model.
  • the embodiments of the present application can be applied to video generation scenarios of lip-synched speech, such as smart speaker screens, smart TVs, artificial intelligence (AI) education, virtual anchors, and live broadcasts.
  • the actions of the target objects provided by the embodiments of the present application The driving method can synthesize a synchronous speech video corresponding to a specific target person according to the input text or speech, which significantly improves the human-computer interaction effect and user experience of the smart product.
  • the target object is a virtual teacher
  • the action-driven method of the target object provided in the embodiment of the present application automatically generates a personalized 3D virtual teacher that speaks synchronously according to the text or voice input by the teacher.
  • the teacher's teaching video, teaching on the student side simulates the real teacher's online teaching function, which not only improves the user experience of the student side, but also reduces the workload of the teacher side.
  • the action-driven method for the target object provided by the embodiment of the present application automatically generates a live video of the virtual host who speaks synchronously according to the text or voice input by the host,
  • the virtual anchor can broadcast the game live to attract attention, and can also enhance interaction through chat programs, and can also obtain high clicks through cover dance, etc., thereby improving the efficiency of live broadcast.
  • the action driving method of the target object provided by the embodiment of the present application is a text-driven or voice-driven 3D virtual mouth shape synchronous speech video generation technology.
  • the mouth shape is predicted by combining 2D and 3D face parameters, and then the The rendering network trained by the differential loss synthesizes the final output picture; the embodiment of the present application solves the problems that the speech-driven model is limited to speech input, and the synthesized video is unreal and jittery.
  • a piece of text or speech can be used to learn 2D/3D face parameters, and a realistic speech video of a specific target person synchronizing with the mouth shape can be synthesized accordingly.
  • Face parameters a new 2D/3D face model is reconstructed by replacing the learned parameters with the parameters of the target person, and the reconstructed face model (ie, the reconstructed image) is input into the rendering network to generate video frames , so as to realize the video generation of the synchronous speech of the target person's mouth shape.
  • FIG. 8 is a system frame diagram of an action-driven method for a target object provided by an embodiment of the present application.
  • the input of the system may be a piece of source text 801 or a source voice 802. If the input is source text 801, then The corresponding source voice will be generated through the text-to-speech module 803, and then the source voice will obtain the corresponding face parameters through the voice-to-face parameter network 804.
  • the face parameters here include 2D mouth key points and 3D expression parameters.
  • the obtained The face parameters are combined with the target parameters obtained by the face parameter extraction module 805 to reconstruct a new face model 806, wherein the UV Map 8061 and the reconstructed mouth key points 8062 can be obtained from the face model 806 , and then input the UV Map 8061 and the reconstructed mouth key points 8062 into the two-stage image rendering model 807 trained by the similarity loss of the front and rear frames to generate the final output picture 808 (ie, the composite image).
  • Text-to-speech module 803 This module aims to realize a given piece of input source text, convert it into a corresponding source speech, and use it as the input of the speech-to-face parameter network.
  • FIG. 9 is a frame diagram of a text-to-speech module provided by an embodiment of the present application.
  • the text-to-speech module mainly includes three sub-modules: a text analysis module 901 , an acoustic model module 902 and a vocoder module 903 .
  • the text analysis module 901 is used to parse the input text (that is, the source text), determine the pronunciation, tone, intonation, etc. of each character, and map the text to linguistic features.
  • the linguistic features here include but are not limited to: pinyin, Linguistic features such as pauses, punctuation, and tones; the acoustic model module 902 is used to map linguistic features to acoustic parameters, where the acoustic parameters are the parameter representation of the source text in the time domain; the vocoder module 903 is used to convert the acoustic The parameters are converted into speech waveforms, where the speech waveforms are the parametric representation of the source text in the frequency domain.
  • FIG. 10 is a frame diagram of the voice-to-face parameter network provided by the embodiment of the present application, as shown in FIG. 10 , AI represents the input audio segment (Input Audio) (i.e. source voice), Obtained by the user speaking or the above text-to-speech module, F A represents audio features, c 1 -c 4 represent four convolution layers (Convolution layers), and f 1 -f 3 represent three fully connected layers ( Fully connection layer), T S represents the source 3D facial expression parameters (Three dimensional facial expression parameters of source), K S represents the source 2D mouth Keypoints (2D mouth Keypoints of source).
  • AI represents the input audio segment (Input Audio) (i.e. source voice), Obtained by the user speaking or the above text-to-speech module
  • F A represents audio features
  • c 1 -c 4 represent four convolution layers (Convolution layers)
  • f 1 -f 3 represent three fully connected layers ( Fully connection layer)
  • T S represents the source 3D facial expression parameters (Three
  • the purpose of the speech-to-face parameter network is to predict the corresponding source 3D facial expression parameters and 2D mouth key points from the input speech fragments, where the 3D facial expression parameters have 10-dimensional coefficients, and the 2D mouth key points
  • the points are based on the 20 key points used in the Dlib algorithm. Since the 2D key points are composed of two coordinate systems (x, y), the 20 key points correspond to 40-dimensional vectors.
  • the speech feature F A is firstly extracted by the recurrent neural network (RNN) proposed in the DeepSpeech method, and then enters a convolutional layer c 1 -c 4 and three fully connected layers f 1 -f 3 convolutional neural network (CNN), and finally two sets of face parameters are obtained by CNN, which are 3D facial expression parameters T S and 2D mouth key points K S .
  • RNN recurrent neural network
  • the extracted speech feature FA can be a 16 ⁇ 29 tensor, and the convolution layers c 1 -c 4 all use 3 ⁇ 1 convolution kernels, reducing the dimension of FA to 8 ⁇ 32, 4 ⁇ 32, 2 ⁇ 64 and 1 ⁇ 64, the feature map output by the convolutional layer c 4 will go through three fully connected layers f 1 -f 3 to obtain 128, 64 and 50-dimensional vectors respectively.
  • Face parameter extraction module 805 This module is designed to extract the target person's 2D mouth key point position and 3D face parameters from the target person's video frame. Among them, the key points of the 2D mouth are obtained by the Dlib algorithm. Given a picture, the algorithm will predict 68 key points on the face, as shown in Figure 11, which is an effect diagram of the Dlib algorithm provided by the embodiment of the present application , where the left picture 1101 is the original picture, and the points on the face in the right picture 1102 are the key points predicted by the Dlib algorithm. In this embodiment of the present application, only the predicted 20 key points of the mouth may be used as 2D face parameters.
  • 62-dimensional 3D face parameters are predicted, including 12-dimensional pose parameters, 40-dimensional shape parameters and 10-dimensional expression parameters.
  • the 2D mouth key points and 3D facial expression parameters obtained by the face parameter extraction module will be replaced by the results obtained in the speech-to-face parameter network, while the pose parameters and shape parameters of the target person are retained, and the recombined 3D face parameters. Then, use the recombined 3D face parameters to reconstruct the face of the target character and obtain the corresponding UV Map, and the new 2D mouth key point information will be directly used as one of the inputs for subsequent rendering.
  • FIG. 12 is a frame diagram of the image rendering model provided by the embodiment of the present application.
  • the purpose of the image rendering model is to synthesize the final image. Lip sync speech video frames.
  • the resolutions of K R and UR are both 256 ⁇ 256, and they are stitched together as the input of the image rendering model.
  • the image rendering model is divided into two stages.
  • the first stage (ie the first rendering network) synthesizes the mouth region texture r 1 , r 1 and the target video background frame b g (ie the background image) for splicing as the second rendering network input; the second stage (ie, the second rendering network) combines the background image to synthesize the final output r 2 .
  • the structures used by the two rendering networks are U-Net networks.
  • the U-Net network is a network that continuously uses downsampling and convolution operations on the input to extract deep features, and then restores its resolution through a step-by-step upsampling layer, while downsampling. A skip connection is introduced between upsampling and upsampling to preserve feature information at different resolutions.
  • condition-based generative adversarial networks may be used when training the rendering network, as shown in FIG.
  • the predicted value F that is, the synthetic image F
  • the real value R that is, the real image R
  • the predicted value F and the real value R of the rendering network will be spliced with the input I of the rendering network (that is, the input image I) and sent to the discriminator 1301 to obtain information about the real value.
  • the final loss function LD of the discriminator 1301 is represented by the following formula (1-1):
  • the rendering network can be regarded as a generator, and its loss function includes the generation of adversarial loss LG_GAN .
  • LG_GAN is the same as LD_fake in the discriminator, but the generator will maximize this value, and its goal is to make the discriminator unable to distinguish true and false, and the discriminator will minimize this value, with the goal of accurately identifying the composite image.
  • L1 loss is also used in the generator, as shown in the following formula (1-2):
  • L G_L1 L 1 (F,R) (1-2)
  • L G_L1 represents the loss value corresponding to the L1 loss.
  • the synthetic images and real images are also constrained at the feature level, for example, the synthetic images and real images are input into the VGG19 network respectively, and then the L1 loss is used to calculate the feature maps output by the two in the five activation layers. difference, and perform linear weighting to obtain the final loss L G_VGG , as shown in the following formula (1-3):
  • Relu f i and Relu ri i represent the feature maps of the synthetic image and the real image in the ith activation layer, respectively.
  • the embodiments of the present application also introduce a similarity loss L G_Smi between the previous and subsequent frames to reduce the difference between the two frames in the synthetic video and the real video. Please continue to refer to Fig.
  • L G L G_GAN + ⁇ *L G_L1 + ⁇ *L G_VGG + ⁇ *L G_Smi (1-4)
  • ⁇ , ⁇ and ⁇ are all hyperparameters.
  • Fig. 14 is a synchronous speech video of a virtual person synthesized by a method in the related art. As shown in Fig. 14, the synthesized video frames are often not smooth enough and not realistic enough. The picture of video frame 1401 and the video frame 1402 The picture is not continuous enough.
  • the embodiment of the present application overcomes the above problems by combining 2D and 3D face parameters and introducing the loss of similarity between front and rear frames.
  • the order of ten video frames is left to right, top to bottom. It can be seen from FIG. 15 that the synthesized video generated by the embodiment of the present application is smoother and more realistic, and the visual effect is better.
  • the method in the embodiment of the present application belongs to a text-driven method, and a video of the target person speaking can be generated by combining a mature TTS technology to realize a given piece of text and any video of the target person.
  • Typical application scenarios of the embodiments of the present application include the AI education industry that has emerged in recent years.
  • the embodiments of the present application extend the input requirements into text or voice, which can further enhance the user experience.
  • a convolutional neural network is used for the speech features extracted by using DeepSpeech in the above-mentioned speech-to-face parameter network to predict the facial parameters.
  • the embodiment of the present application does not limit the model type of the deep convolutional network.
  • a recurrent neural network or a generative adversarial network can also be used to replace the convolutional neural network, and the accuracy and efficiency can be adjusted according to the requirements of practical applications or products. to choose.
  • the two rendering networks in the image rendering model can not only use the U-Net structure, but also other encoder-decoder structures, such as hourglass networks.
  • the action driving device 354 of the target object provided by the embodiments of the present application is implemented as a software module.
  • the action driving device 354 of the target object stored in the memory 350 The software module can be an action driving device of the target object in the server 300, and the device includes:
  • the acquisition module 3541 is configured to acquire the source voice and acquire the target video, the target video includes the target object;
  • the face parameter conversion module 3542 is configured to perform facial parameter conversion on the voice parameters of the source voice at each moment processing to obtain the source parameters of the source voice at the corresponding moment;
  • the parameter extraction module 3543 is configured to perform parameter extraction processing on the target video to obtain the target parameters of the target video;
  • the image reconstruction module 3544 is configured to The combination parameter obtained by combining the source parameter and the target parameter, performs image reconstruction processing on the target object in the target video to obtain a reconstructed image;
  • the generating module 3545 is configured to generate a composite video by using the reconstructed image, Wherein, the synthesized video includes the target object, and the action of the target object corresponds to the source speech.
  • the obtaining module 3541 is further configured to: obtain source text, perform text parsing processing on the source text, and obtain linguistic features of the source text; perform acoustic parameter extraction on the linguistic features processing to obtain the acoustic parameters of the source text in the time domain; converting the acoustic parameters to obtain the voice waveform of the source text in the frequency domain; taking the voice corresponding to the voice waveform as the source voice .
  • the source parameters include: expression parameters and mouth key point parameters;
  • the face parameter conversion module 3542 is further configured to: perform the following processing on the speech parameters of the source speech at any time: Perform feature extraction processing on the voice parameters to obtain the voice feature vector of the source voice; perform convolution processing and full connection processing on the voice feature vector in turn to obtain the expression parameter and all the source voice at the time. Describe the key point parameters of the mouth.
  • the face parameter conversion module 3542 is further configured to: perform the convolution processing on the speech feature vector through at least two first convolution layers including a specific convolution kernel, to obtain convolution processing vector; perform the full connection processing on the convolution processing vector through at least two fully connected layers to obtain a full connection processing vector; wherein, the fully connected processing vector includes the vector corresponding to the expression parameter and the mouth The vector corresponding to the key point parameter, the sum of the dimensions of the vector corresponding to the expression parameter and the vector corresponding to the mouth key point parameter is equal to the dimension of the fully connected processing vector.
  • the target parameters include: target mouth key point parameters and the target face parameters; the parameter extraction module 3543 is further configured to: perform a mouth measurement on the target object in the target video The parameter extraction process is performed to obtain the key point parameters of the target mouth; the facial parameter extraction process is performed on the target object in the target video to obtain the target facial parameters.
  • the image reconstruction module 3544 is further configured to: combine the source parameters and the target parameters to obtain the combined parameters;
  • the target object performs image reconstruction processing to obtain a mouth contour map and a face coordinate map; the mouth contour map and the face coordinate map are used as the reconstructed image.
  • the source parameters include: expression parameters and mouth key point parameters;
  • the target parameters include target mouth key point parameters and target face parameters;
  • the target face parameters include: target pose parameters, target shape parameters and target expression parameters;
  • the image reconstruction module 3544 is further configured to: replace the target expression parameters in the target facial parameters by the expression parameters, to obtain the replaced facial parameters;
  • the mouth key point parameters replace the target mouth key point parameters to obtain the replaced mouth key point parameters;
  • the replaced face parameters and the replaced mouth key point parameters are used as the combined parameters .
  • the reconstructed image includes the replaced face parameters and the replaced mouth key point parameters; the generating module 3545 is further configured to: based on the replaced face parameters at each moment The face parameters, the replaced mouth key point parameters and the background image corresponding to the target video call the image rendering model; through the first rendering network in the image rendering model, the Perform mouth shape region rendering with the replaced face parameters and the replaced mouth key point parameters at each moment to obtain a texture image of the mouth shape region at each moment; through the second rendering network in the image rendering model , perform splicing processing on the mouth shape area texture image and the background image at each moment to obtain a composite image at each moment; according to the composite image at each moment, determine the target object and the source The synthesized video of the speech.
  • the first rendering network includes at least one second convolution layer, at least one first downsampling layer, and at least one first upsampling layer; the generating module 3545 is further configured to: pass The second convolutional layer and the first downsampling layer perform convolution processing and downsampling processing on the replaced face parameters and the replaced mouth key point parameters to obtain the reconstruction The depth feature of the image; through the first upsampling layer, the depth feature of the reconstructed image is subjected to upsampling processing to obtain the mouth shape region texture image.
  • the second rendering network includes at least one third convolutional layer, at least one second downsampling layer, and at least one second upsampling layer; the generating module 3545 is further configured to: pass The third convolution layer and the second downsampling layer perform convolution processing and downsampling processing on the mouth shape region texture image and the background image to obtain the mouth shape region texture image and the background The depth feature of the image; through the second upsampling layer, the depth feature is upsampled to obtain a composite image at each moment.
  • the image rendering model is trained by the following steps: calling the image rendering model based on reconstructed image samples and target image samples; The image sample and the target image sample are subjected to feature extraction processing and mouth shape region rendering to obtain a mouth shape texture image sample; through the second rendering network in the image rendering model, the sample mouth shape texture image and the sample The target image is spliced to obtain a composite image sample; a preset loss model is called based on the composite image sample to obtain a loss result; the parameters in the first rendering network and the second rendering network are performed according to the loss result. Correction to obtain the image rendering model after training.
  • the image rendering model is trained by the steps of: acquiring a real synthetic image corresponding to the reconstructed image sample and the target image sample; stitching the synthetic image sample and the real synthetic image Then, it is input into the preset loss model, and the loss result of the loss is obtained by performing the similarity loss calculation between the before and after frames on the synthetic image sample and the real synthetic image through the preset loss model.
  • Embodiments of the present application provide a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the foregoing method in the embodiments of the present application.
  • the embodiments of the present application provide a storage medium storing executable instructions, wherein the executable instructions are stored, and when the executable instructions are executed by a processor, the processor will cause the processor to execute the method provided by the embodiments of the present application, for example, as shown in FIG. 4 shows the method.
  • the storage medium may be a computer-readable storage medium, for example, Ferromagnetic Random Access Memory (FRAM), Read Only Memory (ROM), Programmable Read Only Memory (PROM). Read Only Memory), Erasable Programmable Read Only Memory (EPROM, Erasable Programmable Read Only Memory), Electrically Erasable Programmable Read Only Memory (EEPROM, Electrically Erasable Programmable Read Only Memory), Flash Memory, Magnetic Surface Memory, Optical Disc, Or memory such as CD-ROM (Compact Disk-Read Only Memory); it can also be various devices including one or any combination of the above memories.
  • FRAM Ferromagnetic Random Access Memory
  • ROM Read Only Memory
  • PROM Programmable Read Only Memory
  • EPROM Erasable Programmable Read Only Memory
  • EEPROM Electrically Erasable Programmable Read Only Memory
  • Flash Memory Magnetic Surface Memory
  • Optical Disc Or memory such as CD-ROM (Compact Disk-Read Only Memory); it can also be various devices including one or any combination of the above memories.
  • executable instructions may take the form of programs, software, software modules, scripts, or code, written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and which Deployment may be in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
  • executable instructions may, but do not necessarily correspond to files in a file system, may be stored as part of a file that holds other programs or data, for example, a Hyper Text Markup Language (HTML, Hyper Text Markup Language) document
  • HTML Hyper Text Markup Language
  • One or more scripts in stored in a single file dedicated to the program in question, or in multiple cooperating files (eg, files that store one or more modules, subroutines, or code sections).
  • executable instructions may be deployed to be executed on one computing device, or on multiple computing devices located at one site, or alternatively, distributed across multiple sites and interconnected by a communication network execute on.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Acoustics & Sound (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Computer Graphics (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Processing Or Creating Images (AREA)
  • Image Analysis (AREA)

Abstract

一种目标对象的动作驱动方法、装置、设备、存储介质及计算机程序产品,涉及人工智能技术领域。方法包括:获取源语音,并获取目标视频,目标视频中包括目标对象;对源语音在每一时刻的语音参数进行脸部参数转换处理,得到源语音在对应时刻的源参数;对目标视频进行参数提取处理,得到目标视频的目标参数;根据源参数和目标参数结合得到的结合参数,对目标视频中的目标对象进行图像重构处理,得到重构图像;通过重构图像生成合成视频,其中,合成视频中包括目标对象、且目标对象的动作与源语音对应。

Description

目标对象的动作驱动方法、装置、设备及存储介质及计算机程序产品
相关申请的交叉引用
本申请实施例基于申请号为202011413461.3、申请日为2020年12月04日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请实施例作为参考。
技术领域
本申请涉及互联网技术领域,涉及但不限于一种目标对象的动作驱动方法、装置、设备及计算机可读存储介质及计算机程序产品。
背景技术
近年来,在嘴型同步说话视频生成领域中,主要是基于深度学习来实现同步过程。相关技术中,一种实现方式是利用循环神经网络从语音特征中学习到嘴部关键点,然后基于嘴部关键点信息生成嘴部纹理,最后和目标视频帧结合,得到嘴型同步说话视频帧;另一种实现方式是首先根据多个不同来源的声音片段学习一个通用、共享的“语音-表情”空间,然后根据所得的表情参数,确定最终的嘴型同步说话视频帧。
但是,相关技术中的方法所生成的最终的嘴型同步说话视频帧均存在视频不平滑且不真实的问题。
发明内容
本申请实施例提供一种目标对象的动作驱动方法、装置、设备、计算机可读存储介质及计算机程序产品,能够提高最终所得到的合成视频的平滑度和真实性。
本申请实施例的技术方案是这样实现的:
本申请实施例提供一种目标对象的动作驱动方法,所述方法包括:
获取源语音,并获取目标视频,所述目标视频中包括目标对象;
对所述源语音在每一时刻的语音参数进行脸部参数转换处理,得到所述源语音在对应时刻的源参数;
对所述目标视频进行参数提取处理,得到所述目标视频的目标参数;
根据所述源参数和所述目标参数结合得到的结合参数,对所述目标视频中的目标对象进行图像重构处理,得到重构图像;
通过所述重构图像生成合成视频,其中,所述合成视频中包括所述目标对象、且所述目标对象的动作与所述源语音对应。
本申请实施例提供一种目标对象的动作驱动装置,所述装置包括:
获取模块,配置为获取源语音,并获取目标视频,所述目标视频中包括目标对象;
脸部参数转换模块,配置为对所述源语音在每一时刻的语音参数进行脸部参数转换处理,得到所述源语音在对应时刻的源参数;
参数提取模块,配置为对所述目标视频进行参数提取处理,得到所述目标视频的目标参数;
图像重构模块,配置为根据所述源参数和所述目标参数结合得到的结合参数,对所述目标视频中的目标对象进行图像重构处理,得到重构图像;
生成模块,配置为通过所述重构图像生成合成视频,其中,所述合成视频中包括所述目标对象、且所述目标对象的动作与所述源语音对应。
本申请实施例提供一种目标对象的动作驱动系统,至少包括:终端和服务器;
所述终端,用于向所述服务器发送所述目标对象的动作驱动请求,所述动作驱动请求中包括源语音和目标视频,所述目标视频中包括目标对象;
所述服务器,用于响应于所述动作驱动请求,实现上述的目标对象的动作驱动方法。
本申请实施例提供一种计算机程序产品或计算机程序,所述计算机程序产品或计算机程序包括计算机指令,所述计算机指令存储在计算机可读存储介质中;其中,计算机设备的处理器从所述计算机可读存储介质中读取所述计算机指令,所述处理器用于执行所述计算机指令,实现上述的目标对象的动作驱动方法。
本申请实施例提供一种目标对象的动作驱动设备,包括:存储器,用于存储可执行指令;处理器,用于执行所述存储器中存储的可执行指令时,实现上述的目标对象的动作驱动方法。
本申请实施例提供一种计算机可读存储介质,存储有可执行指令,用于引起处理器执行所述可执行指令时,实现上述的目标对象的动作驱动方法。
本申请实施例具有以下有益效果:通过源参数和目标参数的结合参数,得到最终语音驱动目标对象的动作的合成视频,提高最终所得到的合成视频的平滑度和真实性,进而提高了视频合成的视觉效果。
附图说明
图1是相关技术中的一种目标对象的动作驱动方法的系统框架图;
图2是本申请实施例提供的目标对象的动作驱动系统的一个架构示意图;
图3是本申请实施例提供的目标对象的动作驱动设备的结构示意图;
图4是本申请实施例提供的目标对象的动作驱动方法的一个流程示意图;
图5是本申请实施例提供的目标对象的动作驱动方法的一个流程示意图;
图6A-图6B是本申请实施例提供的目标对象的动作驱动方法的一个流程示意图;
图7是本申请实施例提供的图像渲染模型的训练方法的实现流程示意图;
图8是本申请实施例提供的目标对象的动作驱动方法的系统框架图;
图9是本申请实施例提供的文本转语音模块的框架图;
图10是本申请实施例提供的语音转人脸参数网络的框架图;
图11是本申请实施例提供的Dlib算法效果图;
图12是本申请实施例提供的图像渲染模型的框架图;
图13是本申请实施例提供的基于条件的生成式对抗网络的框架图;
图14是相关技术中的方法合成的虚拟人同步说话视频;
图15是本申请实施例的目标对象的动作驱动方法所生成的合成视频。
具体实施方式
为了使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请作进一步地详细描述,所描述的实施例不应视为对本申请的限制,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。
在以下的描述中,涉及到“一些实施例”,其描述了所有可能实施例的子集,但是可以理解,“一些实施例”可以是所有可能实施例的相同子集或不同子集,并且可以在不冲突的情况下相互结合。除非另有定义,本申请实施例所使用的所有的技术和科学术 语与属于本申请实施例的技术领域的技术人员通常理解的含义相同。本申请实施例所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。
在解释本申请实施例之前,首先对相关技术中的目标对象的动作驱动方法进行说明:
目前,嘴型同步说话视频生成方案主要分为两大类:文本驱动和语音驱动。顾名思义,文本驱动是输入一段文本和一段目标人物的视频,通过从文本到语音(TTS,Text To Speech)技术将文本转化成语音,再从语音特征中学习人脸特征,最后输出一段目标人物阅读输入文本的视频;而语音驱动则跳过TTS的步骤,直接输入一段语音和目标人物的视频,可以说,文本驱动方法是语音驱动方法的一种扩充。目前主要都是基于深度学习来实现嘴型同步说话视频生成方案,其中,Audio2Obama方法中,首先利用循环神经网络从语音特征中学习到20个嘴部关键点,然后基于关键点信息生成嘴部纹理,最后和目标视频帧结合,得到嘴型同步说话视频帧。而作为文本驱动的ObamaNet方法,则主要包含三个模块,分别为“文本-语音”模块、“语音-关键点”模块以及“关键点-视频帧”模块,其中,“文本-语音”模块采用TTS算法中的Char2Wav,“语音-关键点”模块同样利用循环神经网络从语音特征中学习到关键点信息,而“关键点-视频帧”模块则利用具有跳跃连接来实现信息传递的U-Net网络,该方法是首个基于深度学习的文本驱动的嘴型同步说话视频生成模型。
虽然上述方法都能取得比较可观的效果,但上述方法都是基于同一个人进行实验验证,模型可扩展性较差。为此,另一些方法也开始致力于设计能适应不同人物声音的网络。例如,一种方法中,首先根据多个不同来源的声音片段学习一个通用、共享的“语音-表情”空间,然后根据所得的表情参数进行3D人脸重构,进而得到对应的脸部坐标映射图,即UV贴图(UV map),UV map为一张由3D人脸坐标映射到二维平面的图。该方法同样采用U-Net网络来渲染视频帧。另一种方法则提出一个语音身份信息去除网络来将不同说话者的语音特征转换到一个全局域中,然后利用循环神经网络从语音特征中学习表情参数,将得到的表情参数与目标人物的3D人脸参数结合重构得到3D网格,将3D网格输入到U-Net网络得到最终的视频帧。再一种方法中,则主要针对渲染模块进行改进,提出了一个记忆增强生成式对抗网络(GAN,Generative Adversarial Networks)来保存不同说话人的身份特征和空间特征对,从而实现不同人说话视频合成。
相关技术中,还提出一种基于语音驱动模型的目标对象的动作驱动方法,该方法首先根据多个不同来源的声音片段学习一个通用、共享的“语音-表情”空间,该空间由多个混合形状构成,不同人的表情参数均可由空间中的不同混合形状的线性组合构成。然后根据所得的表情参数进行3D人脸重构,进而得到对应的UV map,然后采用基于空洞卷积的U-Net来渲染视频帧,图1是相关技术中的一种目标对象的动作驱动方法的系统框架图,如图1所示,目标对象的动作驱动方法的系统框架由广义网络(Generalized Network)11和专业网络(Specialized Network)12组成,其中,该技术方案系统框架的具体处理流程如下:首先将不同来源的声音片段111输入到语音识别系统(DeepSpeech RNN)112中进行语音特征提取,得到语音特征然后经过一个卷积神经网络(CNN,Convolutional Neural Networks)113将不同人的语音特征映射到一个通用、共享的隐语音表情空间(Latent Audio Expression Space)114,对于不同人的语音特征,均可用该空间中的不同混合形状(blendshape)的线性组合构成。广义网络11的输出会进入专业网络12的内容感知滤波器(Content-Aware Filtering)121,得到平滑的语音-表情参数(Smooth Audio-Expressions)122,进而得到重构的3D人脸模型(3D Model)123和UV Map 124。最后UV Map 124和背景图片125输入到神经渲染网络(Neural Rendering Network)126中得到最终的输出图片127。
相关技术中的上述方法至少存在以下问题:相关技术中是语音驱动方法,无法实现 给定一个文本,输出对应的嘴型同步说话视频;相关技术中所利用的人脸参数只有由3D人脸模型得到的UV Map,但UV Map只能为网络提供嘴型的先验数据,网络对于牙齿的细节没有任何辅助信息;相关技术中在训练渲染网络时仅惩罚预测值和真实值的对应帧,对于输入的前后帧之间没有考虑,会导致前后帧的差异性得不到优化,使最终的视频出现抖动现象。并且,对于相关技术中的上述方法,还均存在所生成的最终的嘴型同步说话视频帧对应的视频不平滑且不真实的问题。
3D虚拟人嘴型同步说话视频生成领域目前的主要挑战包括两点:人脸重构以及视频帧的渲染。针对第一个难点,本申请实施例提出了一个语音转人脸参数网络,可以从语音特征中同时学习2D嘴部关键点与3D人脸表情参数,这样既能得到2D关键点提供的精确位置信息,同时也能保留3D人脸参数具有深度信息的优势,结合2D和3D参数来重构人脸能确保其准确性。在得到重构人脸后,还与背景进行融合。针对第二个难点,本申请实施例提出了一个两阶段的渲染网络,第一个渲染网络实现从重构人脸中渲染出嘴部纹理区域,第二个渲染网络则旨在将嘴部纹理区域与背景结合渲染出最终视频帧。使用两阶段渲染网络的好处在于:1)单独训练两个渲染网络能降低训练难度,同时能确保第一个渲染网络生成的嘴型纹理的准确性;2)训练第二个渲染网络时再次对嘴型区域进行惩罚,实现对嘴型的修正以及牙齿、皱纹等细节的优化。此外,在训练渲染网络时,还采用了一个视频帧相似性损失来确保输出的前后帧之间差异性不大,避免视频抖动现象和视频不平滑且不真实的问题。
本申请实施例提供的目标对象的动作驱动方法,首先,获取源语音和目标视频,目标视频中包括目标对象;然后,对源语音在每一时刻的语音参数进行脸部参数转换处理,得到源语音在对应时刻的源参数;对目标视频进行参数提取,得到目标参数;根据对源参数和目标参数进行结合所得到的结合参数,对目标视频中的目标对象进行图像重构,得到重构图像;最后,通过重构图像生成合成视频,合成视频中具有目标对象,且目标对象的动作与源语音对应。如此,由于基于源参数和目标参数的结合参数得到最终语音驱动目标对象的动作的合成视频,使得最终所得到的合成视频更加平滑和真实,提高了视频合成的视觉效果。
需要说明的是,本申请实施例中的脸部并不局限于人脸,还可以是动物的脸部,还可以是虚拟对象的脸部。
下面说明本申请实施例的目标对象的动作驱动设备的示例性应用,在一种实现方式中,本申请实施例提供的目标对象的动作驱动设备可以实施为笔记本电脑,平板电脑,台式计算机,移动设备(例如,移动电话,便携式音乐播放器,个人数字助理,专用消息设备,便携式游戏设备)、智能机器人等任意的具备视频播放功能的终端,在另一种实现方式中,本申请实施例提供的目标对象的动作驱动设备还可以实施为服务器。下面,将说明目标对象的动作驱动设备实施为服务器时的示例性应用。
参见图2,图2是本申请实施例提供的目标对象的动作驱动系统20的一个架构示意图。为实现合成同时具有目标对象和源语音的合成视频,即生成源语音驱动目标对象的动作的合成视频,本申请实施例提供的目标对象的动作驱动系统20中包括终端100、网络200和服务器300,终端100获取目标视频和源语音,根据目标视频和源语音生成目标对象的动作驱动请求,并将动作驱动请求通过网络200发送给服务器300,服务器300响应于动作驱动请求,对源语音在每一时刻的语音参数进行脸部参数转换处理,得到源语音在对应时刻的源参数;并对目标视频进行参数提取,得到目标参数;然后,根据对源参数和目标参数进行结合所得到的结合参数,对目标视频中的目标对象进行图像重构,得到重构图像;通过重构图像生成合成视频,其中,合成视频中具有目标对象,且目标对象的动作与源语音对应。在得到合成视频之后,将合成视频通过网络200发送给终端 100。终端100在获取到合成视频之后,在终端100的当前界面100-1上播放该合成视频。
下面,将说明目标对象的动作驱动设备实施为终端时的示例性应用。
为实现合成同时具有目标对象和源语音的合成视频,即生成源语音驱动目标对象的动作的合成视频,终端100获取目标视频和源语音,其中目标视频和源语音可以是本地存储的视频和语音,还可以是实时录制的视频和语音,终端对源语音在每一时刻的语音参数进行脸部参数转换处理,得到源语音在对应时刻的源参数;并对目标视频进行参数提取,得到目标参数;然后,根据对源参数和目标参数进行结合所得到的结合参数,对目标视频中的目标对象进行图像重构,得到重构图像;通过重构图像生成合成视频,其中,合成视频中具有目标对象,且目标对象的动作与源语音对应。在得到合成视频之后,在终端100的当前界面100-1上播放该合成视频。
本申请实施例提供的目标对象的动作驱动方法还涉及人工智能技术领域,通过人工智能技术实现对合成视频进行合成。本申请实施例中,至少可以通过人工智能技术中的计算机视觉技术、语音技术和自然语言处理技术来实现。其中,计算机视觉技术(CV,Computer Vision)是一门研究如何使机器“看”的科学,更进一步的说,就是指用摄影机和电脑代替人眼对目标进行识别、跟踪和测量等机器视觉,并进一步做图形处理,使电脑处理成为更适合人眼观察或传送给仪器检测的图像。作为一个科学学科,计算机视觉研究相关的理论和技术,试图建立能够从图像或者多维数据中获取信息的人工智能系统。计算机视觉技术通常包括图像处理、图像识别、图像语义理解、图像检索、光学字符识别(OCR,Optical Character Recognition)、视频处理、视频语义理解、视频内容/行为识别、三维物体重建、3D技术、虚拟现实、增强现实、同步定位与地图构建等技术,还包括常见的脸部识别、指纹识别等生物特征识别技术。语音技术(Speech Technology)的关键技术有自动语音识别技术(ASR,Automatic Speech Recognition)和语音合成技术(TTS,Text To Speech)以及声纹识别技术。让计算机能听、能看、能说、能感觉,是未来人机交互的发展方向,其中语音成为未来最被看好的人机交互方式之一。自然语言处理(NLP,Nature Language Processing)是计算机科学领域与人工智能领域中的一个重要方向。它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。自然语言处理是一门融语言学、计算机科学、数学于一体的科学。因此,这一领域的研究将涉及自然语言,即人们日常使用的语言,所以它与语言学的研究有着密切的联系。自然语言处理技术通常包括文本处理、语义理解、机器翻译、机器人问答、知识图谱等技术。
本申请实施例所提供的目标对象的动作驱动方法还可以基于云平台并通过云技术来实现,例如,上述服务器300可以是云端服务器。
图3是本申请实施例提供的目标对象的动作驱动设备的结构示意图,以目标对象的动作驱动设备为上述服务器300为例,图3所示的服务器300包括:至少一个处理器310、存储器350、至少一个网络接口320和用户接口330。服务器300中的各个组件通过总线系统340耦合在一起。可理解,总线系统340用于实现这些组件之间的连接通信。总线系统340除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图3中将各种总线都标为总线系统340。
处理器310可以是一种集成电路芯片,具有信号的处理能力,例如通用处理器、数字信号处理器(DSP,Digital Signal Processor),或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,其中,通用处理器可以是微处理器或者任何常规的处理器等。
用户接口330包括使得能够呈现媒体内容的一个或多个输出装置331,包括一个或 多个扬声器和/或一个或多个视觉显示屏。用户接口330还包括一个或多个输入装置332,包括有助于用户输入的用户接口部件,比如键盘、鼠标、麦克风、触屏显示屏、摄像头、其他输入按钮和控件。
存储器350可以是可移除的,不可移除的或其组合。示例性的硬件设备包括固态存储器,硬盘驱动器,光盘驱动器等。存储器350可选地包括在物理位置上远离处理器310的一个或多个存储设备。存储器350包括易失性存储器或非易失性存储器,也可包括易失性和非易失性存储器两者。非易失性存储器可以是只读存储器(ROM,Read Only Memory),易失性存储器可以是随机存取存储器(RAM,Random Access Memory)。本申请实施例描述的存储器350旨在包括任意适合类型的存储器。在一些实施例中,存储器350能够存储数据以支持各种操作,这些数据的示例包括程序、模块和数据结构或者其子集或超集,下面示例性说明。
操作系统351,包括用于处理各种基本系统服务和执行硬件相关任务的系统程序,例如框架层、核心库层、驱动层等,用于实现各种基础业务以及处理基于硬件的任务;
网络通信模块352,用于经由一个或多个(有线或无线)网络接口320到达其他计算设备,示例性的网络接口320包括:蓝牙、无线相容性认证(WiFi)、和通用串行总线(USB,Universal Serial Bus)等;
输入处理模块353,用于对一个或多个来自一个或多个输入装置332之一的一个或多个用户输入或互动进行检测以及翻译所检测的输入或互动。
在一些实施例中,本申请实施例提供的装置可采用软件方式实现,图3示出了存储在存储器350中的一种目标对象的动作驱动装置354,该目标对象的动作驱动装置354可以是服务器300中的目标对象的动作驱动装置,其可以是程序和插件等形式的软件,包括以下软件模块:获取模块3541、脸部参数转换模块3542、参数提取模块3543、图像重构模块3544和生成模块3545,这些模块是逻辑上的,因此根据所实现的功能可以进行任意的组合或进一步拆分。将在下文中说明各个模块的功能。
在另一些实施例中,本申请实施例提供的装置可以采用硬件方式实现,作为示例,本申请实施例提供的装置可以是采用硬件译码处理器形式的处理器,其被编程以执行本申请实施例提供的目标对象的动作驱动方法,例如,硬件译码处理器形式的处理器可以采用一个或多个应用专用集成电路(ASIC,Application Specific Integrated Circuit)、DSP、可编程逻辑器件(PLD,Programmable Logic Device)、复杂可编程逻辑器件(CPLD,Complex Programmable Logic Device)、现场可编程门阵列(FPGA,Field-Programmable Gate Array)或其他电子元件。
下面将结合本申请实施例提供的服务器300的示例性应用和实施,说明本申请实施例提供的目标对象的动作驱动方法,该方法可以是一种视频合成方法。参见图4,图4是本申请实施例提供的目标对象的动作驱动方法的一个流程示意图,下面将结合图4示出的步骤进行说明。
步骤S401,获取源语音和目标视频,目标视频中包括目标对象。
这里,服务器可以接收用户通过终端发送的目标对象的动作驱动请求,该动作驱动请求用于请求对源语音和目标视频进行合成,生成同时具有目标对象和源语音,且源语音驱动目标对象的动作的合成视频,即请求生成的合成视频中具有目标视频中的目标对象,且目标对象对应的语音为该源语音。
源语音可以是用户预先录制的语音,也可以是从网络上下载的语音,还可以是对特定的文本进行转换后得到的语音。在一些实施例中,源语音的声音特征可以是特定对象的声音特征,还可以是目标视频中的目标对象的声音特征。
步骤S402,对源语音在每一时刻的语音参数进行脸部参数转换处理,得到源语音 在对应时刻的源参数。
这里,每一时刻的源参数包括但不限于表情参数和嘴部关键点参数,其中,表情参数是与该时刻的语音参数对应的表情参数,例如,当语音参数对应的为欢快的语音时,则表情参数可以是微笑的表情参数,当语音参数对应的为悲伤的语音时,则表情参数可以是皱眉的表情参数。嘴部关键点参数是在表达该时刻的语音参数时的口型参数。
本申请实施例中,表情参数为3D表情参数,嘴部关键点参数为2D关键点参数。
步骤S403,对目标视频进行参数提取处理,得到目标参数。
这里,可以采用预设算法对目标视频进行参数提取,即对目标视频中的目标对象进行参数提取,其中,目标参数包括但不限于目标嘴部关键点参数和目标脸部参数,当然,目标参数还可以包括目标对象的姿态参数、位置参数、形状参数和动作参数等。
步骤S404,根据源参数和目标参数结合得到的结合参数,对目标视频中的目标对象进行图像重构处理,得到重构图像。
这里,首先对源参数和目标参数进行结合,得到结合参数,结合参数是用于表征最终合成视频中的目标对象的姿态、位置、形状、动作和口型等状态的参数。
本申请实施例中,根据结合参数对目标对象进行图像重构,得到重构图像,重构图像是用于生成最终合成视频的图像。
步骤S405,通过重构图像生成合成视频。
这里,合成视频中具有目标对象,且目标对象的动作与源语音对应。
本申请实施例中,对应于每一时刻的语音参数,生成对应的重构图像,且对每一重构图像进行渲染生成一张合成图像,由于语音参数具有一定的时长,因此重构图像可以具有至少一张,且合成视频的时长与源语音的时长相等,或者,合成视频的时长大于源语音的时长。当重构图像具有一张时,则最终生成的合成视频为一张合成图像;当重构图像具有多张时,则最终生成的合成视频的时长与源语音的时长相同,且合成视频是由多张合成图像按照时间先后顺序连接形成的视频。
在一些实施例中,目标视频可以具有至少一帧视频帧,目标视频中具有目标对象,当目标视频包括一帧视频帧时,该视频帧中具有目标对象,视频合成请求用于请求生成具有该目标对象的合成视频,且合成视频是基于一帧视频帧得到的动态的视频;当目标视频中包括多帧视频帧时,至少一帧视频帧中具有目标对象,视频合成请求用于请求生成具有该目标对象的合成视频,且合成视频是基于多帧视频帧得到的动态的视频。
在一些实施例中,当目标视频中包括多帧视频帧时,目标视频的时长可以与源语音的时长相同,也可以不同。当目标视频的时长与源语音的时长相同时,则可以根据每一视频帧对应源语音在每一时刻的语音参数,形成合成图像,最终形成具有与目标视频相同时长的合成视频。
本申请实施例可以应用于以下场景:在教育产业中,如果想生成一段关于某知识点的教学视频,可以将该知识点对应的源语音(即课堂教师语音)和具有教师讲课的目标视频输入至服务器中,服务器可以采用本申请实施例的方法直接生成教师讲解该知识点的教学视频(即合成视频)并输出。
本申请实施例提供的目标对象的动作驱动方法,对源语音在每一时刻的语音参数进行脸部参数转换处理,得到源语音在对应时刻的源参数,对目标视频进行参数提取得到目标参数,并根据源参数和目标参数的结合参数对目标对象进行图像重构,得到重构图像,最后,通过重构图像生成合成视频,其中,合成视频中具有目标对象,且目标对象的动作与源语音对应。如此,由于基于源参数和目标参数的结合参数得到最终语音驱动目标对象的动作的合成视频,使得最终所得到的合成视频更加平滑和真实,提高了视频合成的视觉效果。
在一些实施例中,目标对象的动作驱动系统中至少包括终端和服务器,通过终端和服务器之间的交互,实现对终端动作驱动请求的响应,生成用户想要的合成视频。其中,动作驱动请求中包括源语音和目标视频,动作驱动请求中还可以包括源文本,可以通过该源文本得到源语音。图5是本申请实施例提供的目标对象的动作驱动方法的一个流程示意图,如图5所示,方法包括以下步骤:
步骤S501,终端获取源语音和目标视频。
这里,源语音可以是用户通过终端上的语音采集装置采集的语音,还可以是用户通过终端下载的语音。目标视频可以是具有任意时长的视频,目标视频中具有目标对象。
步骤S502,终端获取源文本和目标视频。
这里,源文本是用于生成源语音的文本,本申请实施例中,不仅可以对输入的源语音进行处理,生成具有源语音的合成视频,还可以对输入的源文本进行解析和转换生成源语音,进而形成具有源语音的合成视频。
步骤S503,终端对源文本进行文本解析,得到源文本的语言学特征。
这里,语言学特征包括但不限于:拼音、停顿、标点符号和声调等语言学特征。在一些实施例中,还可以基于人工智能技术对源文本进行文本解析,得到源文本的语言学特征。
步骤S504,终端对语言学特征进行声学参数提取,得到源文本在时域上的声学参数。
这里,声学参数为源文本在时域上的参数表示,通过对语言学特征进行声学参数提取,得到源文本在时域上的声学参数。
步骤S505,终端对声学参数进行转换处理,得到源文本在频域上的语音波形。
这里,语音波形是与声学参数对应的声学表示,语音波形是源文本在频域上的参数表示。
步骤S506,终端将语音波形对应的语音,确定为源语音。
步骤S507,终端对源语音和目标视频进行封装,形成动作驱动请求。
在一些实施例中,终端还可以将源文本封装于动作驱动请求中,并将动作驱动请求发送给服务器,由服务器实现步骤S503至步骤S506中将源文本转换为源语音的步骤。
步骤S508,终端将动作驱动请求发送给服务器。
步骤S509,服务器解析动作驱动请求,得到源语音和目标视频。
步骤S510,服务器对源语音在每一时刻的语音参数进行脸部参数转换处理,得到源语音在对应时刻的源参数。
步骤S511,服务器对目标视频进行参数提取,得到目标参数。
步骤S512,服务器根据对源参数和目标参数进行结合所得到的结合参数,对目标视频中的目标对象进行图像重构,得到重构图像。
步骤S513,服务器通过重构图像生成合成视频,其中,合成视频中具有目标对象,且目标对象的动作与源语音对应。
需要说明的是,步骤S510至步骤S513与上述步骤S402至步骤S405相同,本申请实施例不再赘述。
步骤S514,服务器将合成视频发送给终端。
步骤S515,终端在当前界面上播放合成视频。
在一些实施例中,源参数包括:表情参数和嘴部关键点参数。基于图4,图6A是本申请实施例提供的目标对象的动作驱动方法的一个流程示意图,如图6A所示,步骤S402可以通过以下步骤实现:
步骤S601,对源语音进行特征提取,得到源语音的语音特征向量。
步骤S602,对语音特征向量依次进行卷积处理和全连接处理,得到源语音在对应时刻的表情参数和嘴部关键点参数。
在一些实施例中,步骤S602可以通过以下方式实现:通过具有特定卷积核的至少两层第一卷积层对语音特征向量依次进行卷积处理,得到卷积处理向量;通过至少两层全连接层对卷积处理向量依次进行全连接处理,得到全连接处理向量。
这里,全连接处理向量中包括表情参数对应的向量和嘴部关键点参数对应的向量,其中,表情参数对应的向量与嘴部关键点参数对应的向量的维度之和,等于全连接处理向量的维度。
请继续参照图6A,在一些实施例中,步骤S403可以通过以下步骤实现:
步骤S603,对目标视频的当前视频帧中的目标对象依次进行嘴部参数提取和脸部参数提取,对应得到目标嘴部关键点参数和目标脸部参数。
这里,目标嘴部关键点参数和目标脸部参数是目标对象的参数,当目标视频中具有多帧视频帧时,可以提取每一帧视频帧中目标对象的目标嘴部关键点参数和目标脸部参数。
步骤S604,将目标嘴部关键点参数和目标脸部参数确定为目标参数。
请继续参照图6A,在一些实施例中,步骤S404可以通过以下步骤实现:
步骤S605,对源参数和目标参数进行结合,得到结合参数。
这里,对源参数和目标参数进行结合,可以是将用于生成最终的合成图像的参数提取出来,将不用于生成最终的合成图像的参数删除,以得到结合参数。
步骤S606,根据结合参数,对目标视频中的目标对象进行图像重构,得到嘴部轮廓图和UV贴图。
本申请实施例中,重构图像包括嘴部轮廓图和UV贴图(UV map),其中,嘴部轮廓图用于反应最终所生成合成图像中目标对象的嘴部轮廓,UV贴图用于与嘴部轮廓图结合生成合成图像中目标对象的嘴部区域纹理。
步骤S607,将嘴部轮廓图和UV贴图作为重构图像。
本申请实施例中,源参数包括:表情参数和嘴部关键点参数;目标参数包括目标嘴部关键点参数和目标脸部参数;目标脸部参数至少包括:目标姿态参数、目标形状参数和目标表情参数。
在一些实施例中,步骤S605可以通过以下方式实现:通过表情参数替换目标脸部参数中的目标表情参数,得到替换后的脸部参数;通过嘴部关键点参数替换目标嘴部关键点参数,得到替换后的嘴部关键点参数;将替换后的脸部参数和替换后的嘴部关键点参数作为结合参数。
参照图6B,步骤S405中通过重构图像生成合成视频的过程,可以通过以下步骤实现:
步骤S6054,基于每一时刻的替换后的脸部参数、替换后的嘴部关键点参数和与目标视频对应的背景图像调用图像渲染模型。
将每一时刻的替换后的脸部参数、替换后的嘴部关键点参数和与目标视频对应的背景图像,输入至图像渲染模型中。其中,重构图像包括替换后的脸部参数和替换后的嘴部关键点参数。
步骤S6055,通过图像渲染模型中的第一渲染网络,对每一时刻的替换后的脸部参数和每一时刻的替换后的嘴部关键点参数进行嘴型区域渲染,得到每一时刻的嘴型区域纹理图像。
在一些实施例中,第一渲染网络包括至少一层第二卷积层、至少一层第一下采样层和至少一层第一上采样层;其中,嘴型区域渲染过程可以通过以下步骤实现:依次通过 第二卷积层和第一下采样层,对替换后的脸部参数和替换后的嘴部关键点参数进行卷积处理和下采样处理,得到重构图像的深度特征;通过第一上采样层,对重构图像的深度特征进行上采样处理,以恢复重构图像的分辨率,并得到嘴型区域纹理图像。
步骤S6056,通过图像渲染模型中的第二渲染网络,对嘴型区域纹理图像和背景图像进行拼接处理,得到每一时刻的合成图像。
在一些实施例中,第二渲染网络包括至少一层第三卷积层、至少一层第二下采样层和至少一层第二上采样层;其中,拼接处理过程可以通过以下步骤实现:依次通过第三卷积层和第二下采样层,对嘴型区域纹理图像和背景图像进行卷积处理和下采样处理,得到嘴型区域纹理图像和背景图像的深度特征;通过第二上采样层,对嘴型区域纹理图像和背景图像的深度特征进行上采样处理,以恢复嘴型区域纹理图像和背景图像的分辨率,并得到当前时刻的合成图像。
步骤S6057,根据每一时刻的合成图像,确定包括目标对象和源语音的合成视频。
在一些实施例中,上述图像渲染模型用于对每一时刻的重构图像进行渲染,以生成对应时刻的合成图像,且该合成图像中不仅具有目标对象还具有源语音在对应时刻的语音。其中,图像渲染模型至少包括第一渲染网络和第二渲染网络,第一渲染网络用于对重构图像和目标图像分别进行特征提取和嘴型区域渲染,第二渲染网络用于对嘴型区域纹理图像和目标图像进行拼接处理。下面,说明本申请实施例所提供的图像渲染模型的训练方法。
图7是本申请实施例提供的图像渲染模型的训练方法的实现流程示意图,如图7所示,方法包括以下步骤:
步骤S701,基于重构图像样本和目标图像样本调用图像渲染模型。
在一些实施例中,重构图像样本可以通过以下步骤得到:对语音样本在当前时刻的语音参数进行脸部参数转换处理后得到语音参数样本;对目标图像样本进行参数提取,得到目标参数样本;对语音参数样本和目标参数样本进行结合得到结合参数样本,并根据结合参数样本对目标图像样本中的目标对象进行图像重构,得到该重构图像样本。
在一些实施例中,重构图像样本还可以通过以下步骤得到:对文本样本进行文本解析,得到文本样本的语言学特征,对文本样本的语言学特征进行声学参数提取,得到文本样本在时域上的声学参数;对该声学参数进行转换处理,得到文本样本在频域上的语音波形,并将语音波形对应的语音,确定为语音样本。然后,对语音样本在当前时刻的语音参数进行脸部参数转换处理后得到语音参数样本;对目标图像样本进行参数提取,得到目标参数样本;对语音参数样本和目标参数样本进行结合得到结合参数样本,并根据结合参数样本对目标图像样本中的目标对象进行图像重构,得到该重构图像样本。
其中,目标图像样本中包括目标对象样本,最终生成的合成图像样本中也包括该目标对象样本。
步骤S702,通过图像渲染模型的第一渲染网络,对重构图像样本和目标图像样本进行特征提取处理和嘴型区域渲染,得到嘴型纹理图像样本。
这里,第一渲染网络包括至少一层第二卷积层、至少一层第一下采样层和至少一层第一上采样层。
在进行特征提取时,可以通过第二卷积层对输入的重构图像样本和目标图像样本对应的参数进行卷积处理,通过第一下采样层对卷积处理后的参数进行下采样处理,以提取重构图像样本和目标图像样本的深度特征,即提取得到第一图像特征样本。在进行嘴型区域渲染时,可以通过第一上采样层对提取到的第一图像特征样本进行上采样处理,以恢复重构图像样本和目标图像样本的分辨率,并得到嘴型纹理图像样本。
本申请实施例中,在每一第一下采样层之前连接有一个第二卷积层,在每一第一上 采样层之后也连接有一个第二卷积层,即在每一次下采样处理之前进行一次卷积处理,在每一次上采样处理之后进行一次卷积处理。在一些实施例中,第一下采样层与第一上采样层之间引入跳跃连接,通过跳跃连接来保留不同分辨率的特征信息。
步骤S703,通过图像渲染模型中的第二渲染网络,对样本嘴型纹理图像和样本目标图像进行拼接处理,得到合成图像样本。
这里,第二渲染网络包括至少一层第三卷积层、至少一层第二下采样层和至少一层第二上采样层。
在进行拼接处理时,可以首先通过第三卷积层对输入的嘴型纹理图像样本和目标图像样本对应的参数进行卷积处理,通过第二下采样层对卷积处理后的参数进行下采样处理,以提取嘴型纹理图像样本和目标图像样本的深度特征,即提取得到第二图像特征样本。然后,通过第二上采样层对提取到的第二图像特征样本进行上采样处理,以恢复嘴型纹理图像样本和目标图像样本的分辨率,并得到合成图像样本。
本申请实施例中,在每一第二下采样层之前连接有一个第三卷积层,在每一第二上采样层之后也连接有一个第三卷积层,即在每一次下采样处理之前进行一次卷积处理,在每一次上采样处理之后进行一次卷积处理。在一些实施例中,第二下采样层与第二上采样层之间引入可以引入跳跃连接,通过跳跃连接来保留不同分辨率的特征信息。
步骤S704,基于合成图像样本调用预设损失模型,得到损失结果。
在一些实施例中,步骤S704可以通过以下方式实现:获取对应于重构图像样本和目标图像样本的真实合成图像;将合成图像样本和真实合成图像拼接后输入至预设损失模型中,通过预设损失模型对合成图像样本和真实合成图像进行前后帧相似性损失计算,得到损失结果。
本申请实施例中,在进行前后帧相似性损失计算时,可以计算以下损失函数:图像渲染模型关于真实合成图像和合成图像样本的两个损失之间的损失、生成对抗损失、L1损失、利用L1损失所计算的真实合成图像和合成图像样本在N个激活层所输出的特征图的差异,并对该差异进行线性加权所得到最终的损失、和前后帧相似性损失,其中,损失结果是根据上述损失函数中的至少一个计算得到的,也就是说,可以对图像渲染模型关于真实合成图像和合成图像样本的两个损失之间的损失、生成对抗损失、L1损失、利用L1损失所计算的真实合成图像和合成图像样本在N个激活层所输出的特征图的差异,并对该差异进行线性加权所得到最终的损失、和前后帧相似性损失进行加权求和后,得到该损失结果。
步骤S705,根据损失结果对第一渲染网络和第二渲染网络中的参数进行修正,得到训练后的图像渲染模型。
本申请实施例中,在对图像渲染模型进行训练时,可以采用生成对抗策略,并基于前后帧之间的相似性进行模型训练考虑,进而计算图像渲染模型在每一次预测时的损失结果。如此,能够对图像渲染模型进行准确的训练,且训练得到的图像渲染模型考虑了前后帧之间的连续变化,使得所生成的合成视频中连续两帧视频帧之间的变化更加平滑,从而使得所得到的合成视频更加平滑和真实,能够提高图像渲染模型所生成的合成视频的视觉效果。
下面,将说明本申请实施例在一个实际的应用场景中的示例性应用。
本申请实施例可应用于智能音箱屏、智能电视、人工智能(AI,Artificial Intelligence)教育、虚拟主播和直播等嘴型同步说话的视频生成场景中,通过本申请实施例提供的目标对象的动作驱动方法,可根据输入文本或语音合成出特定目标人物对应的同步说话视频,显著地改善智能产品的人机交互效果和用户体验感。
作为一种示例,例如针对AI教育应用,目标对象为虚拟教师,通过本申请实施例 提供的目标对象的动作驱动方法,根据教师端输入的文本或语音,自动生成一个同步说话的个性化3D虚拟教师的教学视频,对学生端授课,模拟真实的教师在线教学的功能,在提升学生端用户体验的同时,还减少教师端的工作量。
作为一种示例,例如针对直播应用,目标对象为虚拟主播,通过本申请实施例提供的目标对象的动作驱动方法,根据主播输入的文本或语音,自动生成一个同步说话的虚拟主播的直播视频,该虚拟主播可以进行游戏实况播报以吸引关注,还可以通过杂谈节目增强互动,还可以通过翻唱舞蹈获得高点击量等,从而提高直播效率。
下面具体说明本申请实施例提供的一种目标对象的动作驱动方法:
本申请实施例提供的目标对象的动作驱动方法是一种文本驱动或语音驱动的3D虚拟人嘴型同步说话视频生成技术,通过结合2D和3D人脸参数来预测嘴型,然后用由视频帧差异性损失训练的渲染网络合成最终的输出图片;本申请实施例解决了语音驱动模型只局限于语音输入、合成视频不真实和抖动问题。
本申请实施例中,可以利用一段文本或语音来学习2D/3D人脸参数,并由此合成一段逼真的特定目标人物嘴型同步的说话视频。在实现的过程中,首先利用TTS技术将输入文本转化为对应语音,然后采用一个卷积神经网络从语音特征中学习到2D/3D人脸参数,同时对一段目标人物的视频提取出2D/3D人脸参数,通过将学习到的参数替换掉目标人物的参数来重构出新的2D/3D人脸模型,将该重构人脸模型(即重构图像)输入到渲染网络中生成视频帧,从而实现目标人物嘴型同步说话的视频生成。
图8是本申请实施例提供的目标对象的动作驱动方法的系统框架图,如图8所示,该系统输入的可以是一段源文本801或者源语音802,若输入的为源文本801,则会经过文本转语音模块803生成对应的源语音,然后源语音经过语音转人脸参数网络804得到对应的人脸参数,这里的人脸参数包括2D嘴部关键点和3D表情参数,所得到的人脸参数和由人脸参数提取模块805获取到的目标参数结合,重构出新的人脸模型806,其中,UV Map 8061和重构后的嘴部关键点8062可由该人脸模型806得到,然后将UV Map 8061和重构后的嘴部关键点8062输入到由前后帧相似性损失训练的两阶段的图像渲染模型807中,生成最终的输出图片808(即合成图像)。
下面对目标对象的动作驱动方法的系统框架中的每一部分进行详细说明。
文本转语音模块803:该模块旨在实现给定一段输入源文本,将其转换成对应的源语音,作为语音转人脸参数网络的输入。
图9是本申请实施例提供的文本转语音模块的框架图,如图9所示,文本转语音模块主要包括三个子模块:文本分析模块901、声学模型模块902和声码器模块903。文本分析模块901用于对输入的文本(即源文本)进行解析,决定每个字的发音、语气、语调等,将文本映射到语言学特征,这里的语言学特征包括但不限于:拼音、停顿、标点符号和声调等语言学特征;声学模型模块902用于将语言学特征映射为声学参数,这里的声学参数为源文本在时域上的参数表示;声码器模块903用于将声学参数转换为语音波形,这里的语音波形为源文本在频域上的参数表示。
语音转人脸参数网络804:图10是本申请实施例提供的语音转人脸参数网络的框架图,如图10所示,A I表示输入的语音片段(Input Audio)(即源语音),通过用户说话或上述文本转语音模块得到的,F A表示语音特征(Audio Features),c 1-c 4表示四个卷积层(Convolution layer),f 1-f 3表示三个全连接层(Fully connection layer),T S表示源3D表情参数(Three dimensional facial expression parameters of source),K S表示源2D嘴部关键点(2D mouth Keypoints of source)。
语音转人脸参数网络的目的在于从输入的语音片段中预测出对应的源3D人脸表情参数和2D嘴部关键点,其中,3D人脸表情参数具有10维度的系数,而2D嘴部关键 点是基于Dlib算法中所使用的20个关键点,由于2D关键点是由(x,y)两个坐标系构成的,所以20个关键点对应40维度的向量。
对于输入的源语音A I,首先经过DeepSpeech方法中提出的循环神经网络(RNN)提取出语音特征F A,然后进入一个包括四个卷积层c 1-c 4和三个全连接层f 1-f 3的卷积神经网络(CNN),最后由CNN得到两组人脸参数,分别为3D人脸表情参数T S和2D嘴部关键点K S。其中,所提取的语音特征F A可以是一个16×29的张量,卷积层c 1-c 4均采用3×1的卷积核,将F A的维度分别降低成8×32、4×32、2×64和1×64,卷积层c 4输出的特征图会经过三个全连接层f 1-f 3后,分别得到128、64和50维的向量。
人脸参数提取模块805:该模块旨在从目标人物的视频帧中提取出目标人物2D嘴部关键点位置和3D人脸参数。其中,2D嘴部关键点通过Dlib算法得到,给定一张图片,该算法会预测出人脸上68个关键点,如图11所示,图11是本申请实施例提供的Dlib算法效果图,其中左图1101是原始图片,右图1102中人脸上的点是通过Dlib算法预测出的关键点。在本申请实施例中,可以只采用所预测出的20个嘴部关键点作为2D人脸参数。对于每张人脸图片会预测出62维的3D人脸参数,包括12维姿态参数、40维形状参数和10维表情参数。人脸参数提取模块得到的2D嘴部关键点和3D人脸表情参数会被语音转人脸参数网络中得到的结果所替代,而目标人物的姿态参数和形状参数保留,已得到重新结合的3D人脸参数。然后,利用重新结合的3D人脸参数对目标人物进行人脸重构并得到对应的UV Map,而新的2D嘴部关键点信息将直接作为后续渲染的输入之一。
图像渲染模型807:图12是本申请实施例提供的图像渲染模型的框架图,如图12所示,给定2D嘴部关键点、UV Map和背景图像,图像渲染模型的目的是合成最终的嘴型同步说话视频帧。在实现的过程,可以首先对20个重构得到的嘴部关键点进行连接得到一个多边形作为嘴部轮廓,即K R(reconstructed mouth keypoints),然后基于特定算法从3D人脸参数中映射出UV Map,即U R。K R和U R的分辨率均为256×256,两者进行拼接后作为图像渲染模型的输入。图像渲染模型分为两个阶段,第一阶段(即第一个渲染网络)合成嘴型区域纹理r 1,r 1和目标视频背景帧b g(即背景图像)进行拼接作为第二个渲染网络的输入;第二阶段(即第二个渲染网络)结合背景图像合成最终的输出r 2。两个渲染网络采用的结构均为U-Net网络,U-Net网络是一个对输入不断采用下采样和卷积操作来提取深度特征,然后通过逐步的上采样层恢复其分辨率,而下采样和上采样之间引入了跳跃连接来保留不同分辨率的特征信息。
在一些实施例中,在训练渲染网络时,可以采用了基于条件的生成式对抗网络(GAN,Generative Adversarial Networks),如图13所示,是本申请实施例提供的基于条件的GAN框架图,对于渲染网络的预测值F(即合成图像F)和真实值R(即真实图像R),会分别和渲染网络的输入I(即输入图像I)拼接后送入判别器1301,得到关于真实值与预测值的两个损失L D_fake和L D_real。判别器1301最终的损失函数L D通过以下公式(1-1)表示:
L D=(L D_fake+L D_real)*0.5      (1-1)
而渲染网络可看成生成器,其损失函数包括生成对抗损失L G_GAN,L G_GAN和判别器中的L D_fake是一样的,只不过生成器将最大化该值,其目标在于使得判别器无法判别真假,而判别器将最小化该值,其目标在于准确辨别出合成图片。此外,为了使得合成图像F和真实图像R更加接近,生成器中还采用了L1损失,如以下公式(1-2)所示:
L G_L1=L 1(F,R)        (1-2)
其中,L G_L1表示L1损失对应的损失值。
此外,还在特征层面上对合成图像和真实图像进行了约束,例如,分别将合成图像和真实图像输入到VGG19网络中,然后利用L1损失分别计算两者在五个激活层输出的特征图的差异,并进行线性加权得到最终的损失L G_VGG,如以下公式(1-3)所示:
Figure PCTCN2021134541-appb-000001
其中,Relu f i和Relu r i分别表示合成图像和真实图像在第i个激活层的特征图。
上述的损失都是基于每一帧进行单独计算,在帧与帧之间没有加入任何约束,这会导致最终合成视频出现不平滑或抖动现象。为此,本申请实施例还引入了一个前后帧相似性损失L G_Smi来减少合成视频中前后两帧与真实视频的差异性。请继续参照图8,对于合成的第t帧,首先计算合成的第t帧与第t-1帧之间的差异性,记为d fake,同样地,计算真实视频中第t帧与第t-1帧的差异性,记为d real,L G_Smi的目的是减少d fake和d real的差距,即min[L 1(d fake,d real)]。
那么,生成器(即图像渲染模型)的最终损失函数L G则为以下公式(1-4):
L G=L G_GAN+α*L G_L1+β*L G_VGG+γ*L G_Smi      (1-4)
其中,α、β和γ均为超参数。
本申请实施例提供的方法,与其他相关技术中的虚拟人同步说话视频生成算法相比,申请实施例的技术方案能合成出在时间上更加平滑与真实的结果。其中,图14是相关技术中的方法合成的虚拟人同步说话视频,如图14所示,合成的视频帧之间往往会出现不够平滑与不够真实的情况,视频帧1401的画面与视频帧1402的画面不够连续。
而本申请实施例通过结合2D和3D人脸参数以及引入前后帧相似性损失,则克服了上述难题,所生成的最终合成视频的效果如图15所示,是连续的十帧视频帧,这十帧视频帧的顺序为从左到右,从上到下。由图15可以看出,本申请实施例所生成的合成视频更加平滑与真实,视觉效果更好。
需要说明的是,本申请实施例的方法属于文本驱动方法,通过结合成熟的TTS技术实现给定一段文本以及任意一段目标人物视频即可生成目标人说话视频。本申请实施例典型的应用场景包括近年来兴起的AI教育产业,与目前的语音驱动生成虚拟教师方案不同,本申请实施例对于输入的要求扩展成文本或语音,可进一步增强用户体验感。
在一些实施例中,上述的语音转人脸参数网络中对于利用DeepSpeech提取出的语音特征采用了一个卷积神经网络来预测人脸参数。但是对于该模块,本申请实施例并不限定深度卷积网络的模型类型,例如还可以使用循环神经网络或生成对抗网络来代替卷积神经网络,可根据实际应用或产品对于精度和效率的要求来选择。同样地,图像渲染模型中的两个渲染网络不仅可采用U-Net结构,其他编码-解码结构也均可使用,例如沙漏网络。
下面继续说明本申请实施例提供的目标对象的动作驱动装置354实施为软件模块的示例性结构,在一些实施例中,如图3所示,存储在存储器350的目标对象的动作驱动装置354中的软件模块可以是服务器300中的目标对象的动作驱动装置,所述装置包括:
获取模块3541,配置为获取源语音,并获取目标视频,所述目标视频中包括目标对象;脸部参数转换模块3542,配置为对所述源语音在每一时刻的语音参数进行脸部参数 转换处理,得到所述源语音在对应时刻的源参数;参数提取模块3543,配置为对所述目标视频进行参数提取处理,得到所述目标视频的目标参数;图像重构模块3544,配置为根据所述源参数和所述目标参数结合得到的结合参数,对所述目标视频中的目标对象进行图像重构处理,得到重构图像;生成模块3545,配置为通过所述重构图像生成合成视频,其中,所述合成视频中包括所述目标对象、且所述目标对象的动作与所述源语音对应。
在一些实施例中,所述获取模块3541还配置为:获取源文本,并对所述源文本进行文本解析处理,得到所述源文本的语言学特征;对所述语言学特征进行声学参数提取处理,得到所述源文本在时域上的声学参数;对所述声学参数进行转换处理,得到所述源文本在频域上的语音波形;将所述语音波形对应的语音作为所述源语音。
在一些实施例中,所述源参数包括:表情参数和嘴部关键点参数;所述人脸参数转换模块3542还配置为:针对所述源语音在任一时刻的语音参数执行以下处理:对所述语音参数进行特征提取处理,得到所述源语音的语音特征向量;对所述语音特征向量依次进行卷积处理和全连接处理,得到所述源语音在所述时刻的所述表情参数和所述嘴部关键点参数。
在一些实施例中,所述脸部参数转换模块3542还配置为:通过包含特定卷积核的至少两层第一卷积层对所述语音特征向量进行所述卷积处理,得到卷积处理向量;通过至少两层全连接层对所述卷积处理向量进行所述全连接处理,得到全连接处理向量;其中,所述全连接处理向量中包括所述表情参数对应的向量和所述嘴部关键点参数对应的向量,所述表情参数对应的向量与所述嘴部关键点参数对应的向量的维度之和,等于所述全连接处理向量的维度。
在一些实施例中,所述目标参数包括:目标嘴部关键点参数和所述目标脸部参数;所述参数提取模块3543还配置为:对所述目标视频中的所述目标对象进行嘴部参数提取处理,得到所述目标嘴部关键点参数;对所述目标视频中的所述目标对象进行脸部参数提取处理,得到所述目标脸部参数。
在一些实施例中,所述图像重构模块3544还配置为:对所述源参数和所述目标参数进行结合处理,得到所述结合参数;根据所述结合参数,对所述目标视频中的目标对象进行图像重构处理,得到嘴部轮廓图和脸部坐标映射图;将所述嘴部轮廓图和所述脸部坐标映射图作为所述重构图像。
在一些实施例中,所述源参数包括:表情参数和嘴部关键点参数;所述目标参数包括目标嘴部关键点参数和目标脸部参数;所述目标脸部参数包括:目标姿态参数、目标形状参数和目标表情参数;所述图像重构模块3544还配置为:通过所述表情参数替换所述目标脸部参数中的所述目标表情参数,得到替换后的脸部参数;通过所述嘴部关键点参数替换所述目标嘴部关键点参数,得到替换后的嘴部关键点参数;将所述替换后的脸部参数和所述替换后的嘴部关键点参数作为所述结合参数。
在一些实施例中,所述重构图像包括所述替换后的脸部参数和所述替换后的嘴部关键点参数;所述生成模块3545还配置为:基于每一时刻的所述替换后的脸部参数、所述替换后的嘴部关键点参数和与所述目标视频对应的背景图像调用图像渲染模型;通过所述图像渲染模型中的第一渲染网络,对每一时刻的所述替换后的脸部参数和每一时刻的所述替换后的嘴部关键点参数进行嘴型区域渲染,得到每一时刻的嘴型区域纹理图像;通过所述图像渲染模型中的第二渲染网络,对所述每一时刻的嘴型区域纹理图像和所述背景图像进行拼接处理,得到每一时刻的合成图像;根据所述每一时刻的合成图像,确定包括所述目标对象和所述源语音的所述合成视频。
在一些实施例中,所述第一渲染网络包括至少一层第二卷积层、至少一层第一下采 样层和至少一层第一上采样层;所述生成模块3545还配置为:通过所述第二卷积层和所述第一下采样层,对所述替换后的脸部参数和所述替换后的嘴部关键点参数进行卷积处理和下采样处理,得到所述重构图像的深度特征;通过所述第一上采样层,对所述重构图像的深度特征进行上采样处理,得到所述嘴型区域纹理图像。
在一些实施例中,所述第二渲染网络包括至少一层第三卷积层、至少一层第二下采样层和至少一层第二上采样层;所述生成模块3545还配置为:通过所述第三卷积层和所述第二下采样层,对所述嘴型区域纹理图像和所述背景图像进行卷积处理和下采样处理,得到所述嘴型区域纹理图像和所述背景图像的深度特征;通过所述第二上采样层,对所述深度特征进行上采样处理,得到每一时刻的合成图像。
在一些实施例中,所述图像渲染模型通过以下步骤进行训练:基于重构图像样本和目标图像样本调用所述图像渲染模型;通过所述图像渲染模型的第一渲染网络,对所述重构图像样本和所述目标图像样本进行特征提取处理和嘴型区域渲染,得到嘴型纹理图像样本;通过所述图像渲染模型中的第二渲染网络,对所述样本嘴型纹理图像和所述样本目标图像进行拼接处理,得到合成图像样本;基于所述合成图像样本调用预设损失模型,得到损失结果;根据所述损失结果对所述第一渲染网络和所述第二渲染网络中的参数进行修正,得到训练后的所述图像渲染模型。
在一些实施例中,所述图像渲染模型通过以下步骤进行训练:获取对应于所述重构图像样本和所述目标图像样本的真实合成图像;将所述合成图像样本和所述真实合成图像拼接后输入至所述预设损失模型中,通过所述预设损失模型对所述合成图像样本和所述真实合成图像进行前后帧相似性损失计算,得到所述损失结果。
需要说明的是,本申请实施例装置的描述,与上述方法实施例的描述是类似的,具有同方法实施例相似的有益效果,因此不做赘述。对于本装置实施例中未披露的技术细节,请参照本申请方法实施例的描述而理解。
本申请实施例提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行本申请实施例上述的方法。
本申请实施例提供一种存储有可执行指令的存储介质,其中存储有可执行指令,当可执行指令被处理器执行时,将引起处理器执行本申请实施例提供的方法,例如,如图4示出的方法。
在一些实施例中,存储介质可以是计算机可读存储介质,例如,铁电存储器(FRAM,Ferromagnetic Random Access Memory)、只读存储器(ROM,Read Only Memory)、可编程只读存储器(PROM,Programmable Read Only Memory)、可擦除可编程只读存储器(EPROM,Erasable Programmable Read Only Memory)、带电可擦可编程只读存储器(EEPROM,Electrically Erasable Programmable Read Only Memory)、闪存、磁表面存储器、光盘、或光盘只读存储器(CD-ROM,Compact Disk-Read Only Memory)等存储器;也可以是包括上述存储器之一或任意组合的各种设备。
在一些实施例中,可执行指令可以采用程序、软件、软件模块、脚本或代码的形式,按任意形式的编程语言(包括编译或解释语言,或者声明性或过程性语言)来编写,并且其可按任意形式部署,包括被部署为独立的程序或者被部署为模块、组件、子例程或者适合在计算环境中使用的其它单元。
作为示例,可执行指令可以但不一定对应于文件系统中的文件,可以可被存储在保存其它程序或数据的文件的一部分,例如,存储在超文本标记语言(HTML,Hyper Text Markup Language)文档中的一个或多个脚本中,存储在专用于所讨论的程序的单个文 件中,或者,存储在多个协同文件(例如,存储一个或多个模块、子程序或代码部分的文件)中。作为示例,可执行指令可被部署为在一个计算设备上执行,或者在位于一个地点的多个计算设备上执行,又或者,在分布在多个地点且通过通信网络互连的多个计算设备上执行。
以上所述,仅为本申请的实施例而已,并非用于限定本申请的保护范围。凡在本申请的精神和范围之内所作的任何修改、等同替换和改进等,均包含在本申请的保护范围之内。

Claims (17)

  1. 一种目标对象的动作驱动方法,由目标对象的动作驱动执行,所述方法包括:
    获取源语音,并获取目标视频,所述目标视频中包括目标对象;
    对所述源语音在每一时刻的语音参数进行脸部参数转换处理,得到所述源语音在对应时刻的源参数;
    对所述目标视频进行参数提取处理,得到所述目标视频的目标参数;
    根据所述源参数和所述目标参数结合得到的结合参数,对所述目标视频中的目标对象进行图像重构处理,得到重构图像;
    通过所述重构图像生成合成视频,其中,所述合成视频中包括所述目标对象、且所述目标对象的动作与所述源语音对应。
  2. 根据权利要求1所述的方法,其中,所述获取源语音,包括:
    获取源文本,并对所述源文本进行文本解析处理,得到所述源文本的语言学特征;
    对所述语言学特征进行声学参数提取处理,得到所述源文本在时域上的声学参数;
    对所述声学参数进行转换处理,得到所述源文本在频域上的语音波形;
    将所述语音波形对应的语音作为所述源语音。
  3. 根据权利要求1所述的方法,其中,
    所述源参数包括:表情参数和嘴部关键点参数;
    所述对所述源语音在每一时刻的语音参数进行脸部参数转换处理,得到所述源语音在对应时刻的源参数,包括:
    针对所述源语音在任一时刻的语音参数执行以下处理:
    对所述语音参数进行特征提取处理,得到所述源语音的语音特征向量;
    对所述语音特征向量依次进行卷积处理和全连接处理,得到所述源语音在所述时刻的所述表情参数和所述嘴部关键点参数。
  4. 根据权利要求3所述的方法,其中,所述对所述语音特征向量依次进行卷积处理和全连接处理,得到所述源语音在所述时刻的所述表情参数和所述嘴部关键点参数,包括:
    通过包含特定卷积核的至少两层第一卷积层对所述语音特征向量进行所述卷积处理,得到卷积处理向量;
    通过至少两层全连接层对所述卷积处理向量进行所述全连接处理,得到全连接处理向量;
    其中,所述全连接处理向量中包括所述表情参数对应的向量和所述嘴部关键点参数对应的向量,所述表情参数对应的向量与所述嘴部关键点参数对应的向量的维度之和,等于所述全连接处理向量的维度。
  5. 根据权利要求1所述的方法,其中,
    所述目标参数包括:目标嘴部关键点参数和所述目标脸部参数;
    所述对所述目标视频进行参数提取处理,得到所述目标视频的目标参数,包括:
    对所述目标视频中的所述目标对象进行嘴部参数提取处理,得到所述目标嘴部关键点参数;
    对所述目标视频中的所述目标对象进行脸部参数提取处理,得到所述目标脸部参数。
  6. 根据权利要求1所述的方法,其中,
    在所述对所述目标视频中的目标对象进行图像重构处理,得到重构图像之前,所述方法还包括:
    对所述源参数和所述目标参数进行结合处理,得到所述结合参数;
    所述根据所述源参数和所述目标参数结合得到的结合参数,对所述目标视频中的目 标对象进行图像重构处理,得到重构图像,包括:
    根据所述结合参数,对所述目标视频中的目标对象进行图像重构处理,得到嘴部轮廓图和脸部坐标映射图;
    将所述嘴部轮廓图和所述脸部坐标映射图作为所述重构图像。
  7. 根据权利要求1所述的方法,其中,
    所述源参数包括:表情参数和嘴部关键点参数;所述目标参数包括目标嘴部关键点参数和目标脸部参数;所述目标脸部参数包括:目标姿态参数、目标形状参数和目标表情参数;
    所述对所述源参数和所述目标参数进行结合处理,得到所述结合参数,包括:
    通过所述表情参数替换所述目标脸部参数中的所述目标表情参数,得到替换后的脸部参数;
    通过所述嘴部关键点参数替换所述目标嘴部关键点参数,得到替换后的嘴部关键点参数;
    将所述替换后的脸部参数和所述替换后的嘴部关键点参数作为所述结合参数。
  8. 根据权利要求7所述的方法,其中,
    所述重构图像包括所述替换后的脸部参数和所述替换后的嘴部关键点参数;
    所述通过所述重构图像生成合成视频,包括:
    基于每一时刻的所述替换后的脸部参数、所述替换后的嘴部关键点参数和与所述目标视频对应的背景图像调用图像渲染模型;
    通过所述图像渲染模型中的第一渲染网络,对每一时刻的所述替换后的脸部参数和每一时刻的所述替换后的嘴部关键点参数进行嘴型区域渲染,得到每一时刻的嘴型区域纹理图像;
    通过所述图像渲染模型中的第二渲染网络,对所述每一时刻的嘴型区域纹理图像和所述背景图像进行拼接处理,得到每一时刻的合成图像;
    根据所述每一时刻的合成图像,确定所述目标对象和所述源语音的所述合成视频。
  9. 根据权利要求8所述的方法,其中,
    所述第一渲染网络包括至少一层第二卷积层、至少一层第一下采样层和至少一层第一上采样层;
    所述通过所述图像渲染模型中的第一渲染网络,对每一时刻的所述替换后的脸部参数和每一时刻的所述替换后的嘴部关键点参数进行嘴型区域渲染,得到每一时刻的嘴型区域纹理图像,包括:
    通过所述第二卷积层和所述第一下采样层,对所述替换后的脸部参数和所述替换后的嘴部关键点参数进行卷积处理和下采样处理,得到所述重构图像的深度特征;
    通过所述第一上采样层,对所述重构图像的深度特征进行上采样处理,得到所述嘴型区域纹理图像。
  10. 根据权利要求8所述的方法,其中,
    所述第二渲染网络包括至少一层第三卷积层、至少一层第二下采样层和至少一层第二上采样层;
    所述通过所述图像渲染模型中的第二渲染网络,对所述每一时刻的嘴型区域纹理图像和所述背景图像进行拼接处理,得到每一时刻的合成图像,包括:
    通过所述第三卷积层和所述第二下采样层,对所述嘴型区域纹理图像和所述背景图像进行卷积处理和下采样处理,得到所述嘴型区域纹理图像和所述背景图像的深度特征;
    通过所述第二上采样层,对所述深度特征进行上采样处理,得到每一时刻的合成图像。
  11. 根据权利要求8所述的方法,其中,所述图像渲染模型通过以下步骤进行训练:
    基于重构图像样本和目标图像样本调用所述图像渲染模型;
    通过所述图像渲染模型的第一渲染网络,对所述重构图像样本和所述目标图像样本进行特征提取处理和嘴型区域渲染,得到嘴型纹理图像样本;
    通过所述图像渲染模型中的第二渲染网络,对所述样本嘴型纹理图像和所述样本目标图像进行拼接处理,得到合成图像样本;
    基于所述合成图像样本调用预设损失模型,得到损失结果;
    根据所述损失结果对所述第一渲染网络和所述第二渲染网络中的参数进行修正,得到训练后的所述图像渲染模型。
  12. 根据权利要求11所述的方法,其中,所述基于所述合成图像样本调用预设损失模型,得到损失结果,包括:
    获取对应于所述重构图像样本和所述目标图像样本的真实合成图像;
    将所述合成图像样本和所述真实合成图像拼接后输入至所述预设损失模型中,通过所述预设损失模型对所述合成图像样本和所述真实合成图像进行前后帧相似性损失计算,得到所述损失结果。
  13. 一种目标对象的动作驱动装置,所述装置包括:
    获取模块,配置为获取源语音,并获取目标视频,所述目标视频中包括目标对象;
    脸部参数转换模块,配置为对所述源语音在每一时刻的语音参数进行脸部参数转换处理,得到所述源语音在对应时刻的源参数;
    参数提取模块,配置为对所述目标视频进行参数提取处理,得到所述目标视频的目标参数;
    图像重构模块,配置为根据所述源参数和所述目标参数结合得到的结合参数,对所述目标视频中的目标对象进行图像重构处理,得到重构图像;
    生成模块,配置为通过所述重构图像生成合成视频,其中,所述合成视频中包括所述目标对象、且所述目标对象的动作与所述源语音对应。
  14. 一种目标对象的动作驱动系统,包括:终端和服务器;
    所述终端,用于向所述服务器发送所述目标对象的动作驱动请求,所述动作驱动请求中包括源语音和目标视频,所述目标视频中包括目标对象;
    所述服务器,用于响应于所述动作驱动请求,实现权利要求1至12任一项所述的目标对象的动作驱动方法。
  15. 一种目标对象的动作驱动设备,包括:
    存储器,用于存储可执行指令;
    处理器,用于执行所述存储器中存储的可执行指令时,实现权利要求1至12任一项所述的目标对象的动作驱动方法。
  16. 一种计算机可读存储介质,存储有可执行指令,用于引起处理器执行所述可执行指令时,实现权利要求1至12任一项所述的目标对象的动作驱动方法。
  17. 一种计算机程序产品,包括计算机程序或指令,所述计算机程序或指令使得计算机执行如权利要求1至12任一项所述的目标对象的动作驱动方法。
PCT/CN2021/134541 2020-12-04 2021-11-30 目标对象的动作驱动方法、装置、设备及存储介质及计算机程序产品 WO2022116977A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2023518520A JP7557055B2 (ja) 2020-12-04 2021-11-30 目標対象の動作駆動方法、装置、機器及びコンピュータプログラム
US17/968,747 US20230042654A1 (en) 2020-12-04 2022-10-18 Action synchronization for target object

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011413461.3A CN113554737A (zh) 2020-12-04 2020-12-04 目标对象的动作驱动方法、装置、设备及存储介质
CN202011413461.3 2020-12-04

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/968,747 Continuation US20230042654A1 (en) 2020-12-04 2022-10-18 Action synchronization for target object

Publications (1)

Publication Number Publication Date
WO2022116977A1 true WO2022116977A1 (zh) 2022-06-09

Family

ID=78129986

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/134541 WO2022116977A1 (zh) 2020-12-04 2021-11-30 目标对象的动作驱动方法、装置、设备及存储介质及计算机程序产品

Country Status (4)

Country Link
US (1) US20230042654A1 (zh)
JP (1) JP7557055B2 (zh)
CN (1) CN113554737A (zh)
WO (1) WO2022116977A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115330913A (zh) * 2022-10-17 2022-11-11 广州趣丸网络科技有限公司 三维数字人口型生成方法、装置、电子设备及存储介质
CN116071811A (zh) * 2023-04-06 2023-05-05 中国工商银行股份有限公司 人脸信息验证方法及装置

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113554737A (zh) * 2020-12-04 2021-10-26 腾讯科技(深圳)有限公司 目标对象的动作驱动方法、装置、设备及存储介质
KR102251781B1 (ko) * 2020-12-30 2021-05-14 (주)라이언로켓 인공신경망을 이용한 입모양 합성 장치 및 방법
KR102540763B1 (ko) * 2021-06-03 2023-06-07 주식회사 딥브레인에이아이 머신 러닝 기반의 립싱크 영상 생성을 위한 학습 방법 및 이를 수행하기 위한 립싱크 영상 생성 장치
US20230135244A1 (en) * 2021-10-28 2023-05-04 Lenovo (United States) Inc. Method and system to modify speech impaired messages utilizing neural network audio filters
US20230252714A1 (en) * 2022-02-10 2023-08-10 Disney Enterprises, Inc. Shape and appearance reconstruction with deep geometric refinement
CN114782596A (zh) * 2022-02-28 2022-07-22 清华大学 语音驱动的人脸动画生成方法、装置、设备及存储介质
CN114881023B (zh) * 2022-04-07 2024-08-16 长沙千博信息技术有限公司 一种文本驱动虚拟人非语言行为的系统及方法
CN115170703A (zh) * 2022-06-30 2022-10-11 北京百度网讯科技有限公司 虚拟形象驱动方法、装置、电子设备及存储介质
CN115767181A (zh) * 2022-11-17 2023-03-07 北京字跳网络技术有限公司 直播视频流渲染方法、装置、设备、存储介质及产品
CN115550744B (zh) * 2022-11-29 2023-03-14 苏州浪潮智能科技有限公司 一种语音生成视频的方法和装置
CN116074577B (zh) * 2022-12-23 2023-09-26 北京生数科技有限公司 视频处理方法、相关装置及存储介质
CN115914505B (zh) * 2023-01-06 2023-07-14 粤港澳大湾区数字经济研究院(福田) 基于语音驱动数字人模型的视频生成方法及系统
CN116312612B (zh) * 2023-02-02 2024-04-16 北京甲板智慧科技有限公司 基于深度学习的音频处理方法和装置
CN116310146B (zh) * 2023-05-16 2023-10-27 北京邃芒科技有限公司 人脸图像重演方法、系统、电子设备、存储介质
CN117523677B (zh) * 2024-01-02 2024-06-11 武汉纺织大学 一种基于深度学习的课堂行为识别方法
CN117710543A (zh) * 2024-02-04 2024-03-15 淘宝(中国)软件有限公司 基于数字人的视频生成与交互方法、设备、存储介质与程序产品
CN117974849B (zh) * 2024-03-28 2024-06-04 粤港澳大湾区数字经济研究院(福田) 音频驱动面部运动生成的方法、系统、终端及存储介质
CN118283219B (zh) * 2024-06-03 2024-08-30 深圳市顺恒利科技工程有限公司 一种音视频会议的实现方法及系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110277099A (zh) * 2019-06-13 2019-09-24 北京百度网讯科技有限公司 基于语音的嘴型生成方法和装置
CN111028318A (zh) * 2019-11-25 2020-04-17 天脉聚源(杭州)传媒科技有限公司 一种虚拟人脸合成方法、系统、装置和存储介质
WO2020155299A1 (zh) * 2019-02-01 2020-08-06 网宿科技股份有限公司 视频帧中目标对象的拟合方法、系统及设备
CN111508064A (zh) * 2020-04-14 2020-08-07 北京世纪好未来教育科技有限公司 基于音素驱动的表情合成方法、装置和计算机存储介质
CN113554737A (zh) * 2020-12-04 2021-10-26 腾讯科技(深圳)有限公司 目标对象的动作驱动方法、装置、设备及存储介质

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010182150A (ja) * 2009-02-06 2010-08-19 Seiko Epson Corp 顔の特徴部位の座標位置を検出する画像処理装置
WO2018167522A1 (en) * 2017-03-14 2018-09-20 Google Llc Speech synthesis unit selection
AU2018244917B2 (en) * 2017-03-29 2019-12-05 Google Llc End-to-end text-to-speech conversion
US10891969B2 (en) * 2018-10-19 2021-01-12 Microsoft Technology Licensing, Llc Transforming audio content into images
CN109508678B (zh) * 2018-11-16 2021-03-30 广州市百果园信息技术有限公司 人脸检测模型的训练方法、人脸关键点的检测方法和装置
KR20210009596A (ko) * 2019-07-17 2021-01-27 엘지전자 주식회사 지능적 음성 인식 방법, 음성 인식 장치 및 지능형 컴퓨팅 디바이스
CN111370020B (zh) 2020-02-04 2023-02-14 清华珠三角研究院 一种将语音转换成唇形的方法、系统、装置和存储介质
US11477366B2 (en) * 2020-03-31 2022-10-18 Snap Inc. Selfie setup and stock videos creation
US11735204B2 (en) * 2020-08-21 2023-08-22 SomniQ, Inc. Methods and systems for computer-generated visualization of speech
EP4205105A1 (en) * 2020-08-28 2023-07-05 Microsoft Technology Licensing, LLC System and method for cross-speaker style transfer in text-to-speech and training data generation

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020155299A1 (zh) * 2019-02-01 2020-08-06 网宿科技股份有限公司 视频帧中目标对象的拟合方法、系统及设备
CN110277099A (zh) * 2019-06-13 2019-09-24 北京百度网讯科技有限公司 基于语音的嘴型生成方法和装置
CN111028318A (zh) * 2019-11-25 2020-04-17 天脉聚源(杭州)传媒科技有限公司 一种虚拟人脸合成方法、系统、装置和存储介质
CN111508064A (zh) * 2020-04-14 2020-08-07 北京世纪好未来教育科技有限公司 基于音素驱动的表情合成方法、装置和计算机存储介质
CN113554737A (zh) * 2020-12-04 2021-10-26 腾讯科技(深圳)有限公司 目标对象的动作驱动方法、装置、设备及存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115330913A (zh) * 2022-10-17 2022-11-11 广州趣丸网络科技有限公司 三维数字人口型生成方法、装置、电子设备及存储介质
CN116071811A (zh) * 2023-04-06 2023-05-05 中国工商银行股份有限公司 人脸信息验证方法及装置

Also Published As

Publication number Publication date
JP2023545642A (ja) 2023-10-31
JP7557055B2 (ja) 2024-09-26
CN113554737A (zh) 2021-10-26
US20230042654A1 (en) 2023-02-09

Similar Documents

Publication Publication Date Title
WO2022116977A1 (zh) 目标对象的动作驱动方法、装置、设备及存储介质及计算机程序产品
WO2022048403A1 (zh) 基于虚拟角色的多模态交互方法、装置及系统、存储介质、终端
CN110688911B (zh) 视频处理方法、装置、系统、终端设备及存储介质
US20120130717A1 (en) Real-time Animation for an Expressive Avatar
CN113077537B (zh) 一种视频生成方法、存储介质及设备
Granström et al. Audiovisual representation of prosody in expressive speech communication
CN114144790A (zh) 具有三维骨架正则化和表示性身体姿势的个性化语音到视频
WO2023284435A1 (zh) 生成动画的方法及装置
CN110931042A (zh) 同声传译方法、装置、电子设备以及存储介质
KR102174922B1 (ko) 사용자의 감정 또는 의도를 반영한 대화형 수어-음성 번역 장치 및 음성-수어 번역 장치
CN115049016B (zh) 基于情绪识别的模型驱动方法及设备
Abdulsalam et al. Emotion recognition system based on hybrid techniques
CN115953521B (zh) 远程数字人渲染方法、装置及系统
WO2023246163A1 (zh) 一种虚拟数字人驱动方法、装置、设备和介质
CN117315102A (zh) 虚拟主播处理方法、装置、计算设备及存储介质
CN117115310A (zh) 一种基于音频和图像的数字人脸生成方法及系统
CN116310004A (zh) 虚拟人授课动画生成方法、装置、计算机设备和存储介质
CN115409923A (zh) 生成三维虚拟形象面部动画的方法、装置及系统
CN114898018A (zh) 数字对象的动画生成方法、装置、电子设备及存储介质
Kolivand et al. Realistic lip syncing for virtual character using common viseme set
CN111696182A (zh) 一种虚拟主播生成系统、方法和存储介质
Manglani et al. Lip Reading Into Text Using Deep Learning
CN117373455B (zh) 一种音视频的生成方法、装置、设备及存储介质
Dhanushkodi et al. SPEECH DRIVEN 3D FACE ANIMATION.
Garg et al. Speech to Face Generation using GAN’s

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21900008

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2023518520

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 27.10.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21900008

Country of ref document: EP

Kind code of ref document: A1